The e-mail text processor makes it easy to extract the textual content of e-mail documents and pipe it to other middlewares for further processing. This middleware can extract text, HTML, and structured JSON from e-mail documents. It also optionally supports the extraction of attachments and forwarding them as new documents to other middlewares.
📨 Parsing E-mails
To use this middleware, you import it in your CDK stack and instantiate it as part of a pipeline.
import { EmailTextProcessor } from '@project-lakechain/email-text-processor';import { CacheStorage } from '@project-lakechain/core';
class Stack extends cdk.Stack { constructor(scope: cdk.Construct, id: string) { // The cache storage. const cache = new CacheStorage(this, 'Cache');
// Create the e-mail text processor. const emailProcessor = new EmailTextProcessor.Builder() .withScope(this) .withIdentifier('EmailProcessor') .withCacheStorage(cache) .withSource(source) // 👈 Specify a data source .build(); }}
Output Formats
The e-mail text processor can extract the following formats from e-mail documents:
text
: Extracts only the textual body of the e-mail.html
: Extracts the body of the e-mail as HTML.json
: Extracts the body and the attributes of the e-mail as JSON.
💁 You can specify the output format by using the
withOutputFormat
method. By default, the output format istext
.
const emailProcessor = new EmailTextProcessor.Builder() .withScope(this) .withIdentifier('EmailProcessor') .withCacheStorage(cache) .withSource(source) .withOutputFormat('html') // 👈 Specify the output format .build();
Include Attachments
The e-mail text processor can optionally extract attachments from e-mail documents and forward them as new documents to other middlewares. You can enable the processing of attachments using the .withIncludeAttachments
API.
const emailProcessor = new EmailTextProcessor.Builder() .withScope(this) .withIdentifier('EmailProcessor') .withCacheStorage(cache) .withSource(source) .withIncludeAttachments(true) // 👈 Enable the processing of attachments .build();
Include Image Links
This middleware supports converting CID attachments to data URL images. You can enable this feature using the .withIncludeImageLinks
API.
const emailProcessor = new EmailTextProcessor.Builder() .withScope(this) .withIdentifier('EmailProcessor') .withCacheStorage(cache) .withSource(source) .withIncludeImageLinks(true) // 👈 Enable the processing of image links .build();
📄 Metadata
The e-mail text processor transforms input e-mail documents in the desired output format. It also enriches the metadata of documents with different information. Below is an example of the metadata created by this middleware.
💁 Click to expand example
{ "specversion": "1.0", "id": "1780d5de-fd6f-4530-98d7-82ebee85ea39", "type": "document-created", "time": "2023-10-22T13:19:10.657Z", "data": { "chainId": "6ebf76e4-f70c-440c-98f9-3e3e7eb34c79", "source": { "url": "s3://bucket/email.eml", "type": "message/rfc822", "size": 24532, "etag": "1243cbd6cf145453c8b5519a2ada4779" }, "document": { "url": "s3://bucket/email.txt", "type": "text/plain", "size": 125, "etag": "1243cbd6cf145453c8b5519a2ada4779" }, "metadata": { "title": "Re: Hello World", "createdAt": "2023-10-22T13:19:10.657Z", "authors": [ "John Doe" ], "properties": { "kind": "text", "attrs": {} } }, "callStack": [] }}
🏗️ Architecture
This middleware is based on a Lambda ARM64 compute, and packages the mailparser
library to parse e-mail documents.
🏷️ Properties
Supported Inputs
Mime Type | Description |
---|---|
message/rfc822 | E-mail documents. |
application/vnd.ms-outlook | Outlook e-mail documents. |
Supported Outputs
The supported output types for this middleware consist of a variant that depends on whether or not the inclusion of attachments is enabled. If attachments are included, the output type is */*
as attachments can consist of any type, otherwise the output type is associated with the defined output format.
Output Format | With Attachment | Mime Type | Description |
---|---|---|---|
text | No | text/plain | Plain text document. |
text | Yes | */* | Any document. |
html | No | text/html | HTML document. |
html | Yes | */* | Any document. |
json | No | application/json | JSON document. |
json | Yes | */* | Any document. |
Supported Compute Types
Type | Description |
---|---|
CPU | This middleware only supports CPU compute. |
📖 Examples
- E-mail NLP Pipeline - An example showcasing how to analyze e-mails.