The e-mail text processor makes it easy to extract the textual content of e-mail documents and pipe it to other middlewares for further processing. This middleware can extract text, HTML, and structured JSON from e-mail documents. It also optionally supports the extraction of attachments and forwarding them as new documents to other middlewares.
📨 Parsing E-mails
To use this middleware, you import it in your CDK stack and instantiate it as part of a pipeline.
import { EmailTextProcessor } from '@project-lakechain/email-text-processor';import { CacheStorage } from '@project-lakechain/core';
class Stack extends cdk.Stack { constructor(scope: cdk.Construct, id: string) { // The cache storage. const cache = new CacheStorage(this, 'Cache');
// Create the e-mail text processor. const emailProcessor = new EmailTextProcessor.Builder() .withScope(this) .withIdentifier('EmailProcessor') .withCacheStorage(cache) .withSource(source) // 👈 Specify a data source .build(); }}Output Formats
The e-mail text processor can extract the following formats from e-mail documents:
text: Extracts only the textual body of the e-mail.html: Extracts the body of the e-mail as HTML.json: Extracts the body and the attributes of the e-mail as JSON.
💁 You can specify the output format by using the
withOutputFormatmethod. By default, the output format istext.
const emailProcessor = new EmailTextProcessor.Builder() .withScope(this) .withIdentifier('EmailProcessor') .withCacheStorage(cache) .withSource(source) .withOutputFormat('html') // 👈 Specify the output format .build();Include Attachments
The e-mail text processor can optionally extract attachments from e-mail documents and forward them as new documents to other middlewares. You can enable the processing of attachments using the .withIncludeAttachments API.
const emailProcessor = new EmailTextProcessor.Builder() .withScope(this) .withIdentifier('EmailProcessor') .withCacheStorage(cache) .withSource(source) .withIncludeAttachments(true) // 👈 Enable the processing of attachments .build();Include Image Links
This middleware supports converting CID attachments to data URL images. You can enable this feature using the .withIncludeImageLinks API.
const emailProcessor = new EmailTextProcessor.Builder() .withScope(this) .withIdentifier('EmailProcessor') .withCacheStorage(cache) .withSource(source) .withIncludeImageLinks(true) // 👈 Enable the processing of image links .build();📄 Metadata
The e-mail text processor transforms input e-mail documents in the desired output format. It also enriches the metadata of documents with different information. Below is an example of the metadata created by this middleware.
💁 Click to expand example
{ "specversion": "1.0", "id": "1780d5de-fd6f-4530-98d7-82ebee85ea39", "type": "document-created", "time": "2023-10-22T13:19:10.657Z", "data": { "chainId": "6ebf76e4-f70c-440c-98f9-3e3e7eb34c79", "source": { "url": "s3://bucket/email.eml", "type": "message/rfc822", "size": 24532, "etag": "1243cbd6cf145453c8b5519a2ada4779" }, "document": { "url": "s3://bucket/email.txt", "type": "text/plain", "size": 125, "etag": "1243cbd6cf145453c8b5519a2ada4779" }, "metadata": { "title": "Re: Hello World", "createdAt": "2023-10-22T13:19:10.657Z", "authors": [ "John Doe" ], "properties": { "kind": "text", "attrs": {} } }, "callStack": [] }}🏗️ Architecture
This middleware is based on a Lambda ARM64 compute, and packages the mailparser library to parse e-mail documents.

🏷️ Properties
Supported Inputs
| Mime Type | Description |
|---|---|
message/rfc822 | E-mail documents. |
application/vnd.ms-outlook | Outlook e-mail documents. |
Supported Outputs
The supported output types for this middleware consist of a variant that depends on whether or not the inclusion of attachments is enabled. If attachments are included, the output type is */* as attachments can consist of any type, otherwise the output type is associated with the defined output format.
| Output Format | With Attachment | Mime Type | Description |
|---|---|---|---|
text | No | text/plain | Plain text document. |
text | Yes | */* | Any document. |
html | No | text/html | HTML document. |
html | Yes | */* | Any document. |
json | No | application/json | JSON document. |
json | Yes | */* | Any document. |
Supported Compute Types
| Type | Description |
|---|---|
CPU | This middleware only supports CPU compute. |
📖 Examples
- E-mail NLP Pipeline - An example showcasing how to analyze e-mails.