Skip to content

Email

Unstable API 0.10.0 @project-lakechain/email-text-processor TypeScript

The e-mail text processor makes it easy to extract the textual content of e-mail documents and pipe it to other middlewares for further processing. This middleware can extract text, HTML, and structured JSON from e-mail documents. It also optionally supports the extraction of attachments and forwarding them as new documents to other middlewares.


📨 Parsing E-mails

To use this middleware, you import it in your CDK stack and instantiate it as part of a pipeline.

import { EmailTextProcessor } from '@project-lakechain/email-text-processor';
import { CacheStorage } from '@project-lakechain/core';
class Stack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string) {
// The cache storage.
const cache = new CacheStorage(this, 'Cache');
// Create the e-mail text processor.
const emailProcessor = new EmailTextProcessor.Builder()
.withScope(this)
.withIdentifier('EmailProcessor')
.withCacheStorage(cache)
.withSource(source) // 👈 Specify a data source
.build();
}
}


Output Formats

The e-mail text processor can extract the following formats from e-mail documents:

  • text: Extracts only the textual body of the e-mail.
  • html: Extracts the body of the e-mail as HTML.
  • json: Extracts the body and the attributes of the e-mail as JSON.

💁 You can specify the output format by using the withOutputFormat method. By default, the output format is text.

const emailProcessor = new EmailTextProcessor.Builder()
.withScope(this)
.withIdentifier('EmailProcessor')
.withCacheStorage(cache)
.withSource(source)
.withOutputFormat('html') // 👈 Specify the output format
.build();


Include Attachments

The e-mail text processor can optionally extract attachments from e-mail documents and forward them as new documents to other middlewares. You can enable the processing of attachments using the .withIncludeAttachments API.

const emailProcessor = new EmailTextProcessor.Builder()
.withScope(this)
.withIdentifier('EmailProcessor')
.withCacheStorage(cache)
.withSource(source)
.withIncludeAttachments(true) // 👈 Enable the processing of attachments
.build();


This middleware supports converting CID attachments to data URL images. You can enable this feature using the .withIncludeImageLinks API.

const emailProcessor = new EmailTextProcessor.Builder()
.withScope(this)
.withIdentifier('EmailProcessor')
.withCacheStorage(cache)
.withSource(source)
.withIncludeImageLinks(true) // 👈 Enable the processing of image links
.build();


📄 Metadata

The e-mail text processor transforms input e-mail documents in the desired output format. It also enriches the metadata of documents with different information. Below is an example of the metadata created by this middleware.

💁 Click to expand example
{
"specversion": "1.0",
"id": "1780d5de-fd6f-4530-98d7-82ebee85ea39",
"type": "document-created",
"time": "2023-10-22T13:19:10.657Z",
"data": {
"chainId": "6ebf76e4-f70c-440c-98f9-3e3e7eb34c79",
"source": {
"url": "s3://bucket/email.eml",
"type": "message/rfc822",
"size": 24532,
"etag": "1243cbd6cf145453c8b5519a2ada4779"
},
"document": {
"url": "s3://bucket/email.txt",
"type": "text/plain",
"size": 125,
"etag": "1243cbd6cf145453c8b5519a2ada4779"
},
"metadata": {
"title": "Re: Hello World",
"createdAt": "2023-10-22T13:19:10.657Z",
"authors": [
"John Doe"
],
"properties": {
"kind": "text",
"attrs": {}
}
},
"callStack": []
}
}


🏗️ Architecture

This middleware is based on a Lambda ARM64 compute, and packages the mailparser library to parse e-mail documents.

Architecture



🏷️ Properties


Supported Inputs
Mime TypeDescription
message/rfc822E-mail documents.
application/vnd.ms-outlookOutlook e-mail documents.
Supported Outputs

The supported output types for this middleware consist of a variant that depends on whether or not the inclusion of attachments is enabled. If attachments are included, the output type is */* as attachments can consist of any type, otherwise the output type is associated with the defined output format.

Output FormatWith AttachmentMime TypeDescription
textNotext/plainPlain text document.
textYes*/*Any document.
htmlNotext/htmlHTML document.
htmlYes*/*Any document.
jsonNoapplication/jsonJSON document.
jsonYes*/*Any document.
Supported Compute Types
TypeDescription
CPUThis middleware only supports CPU compute.


📖 Examples