The e-mail text processor makes it easy to extract the textual content of e-mail documents and pipe it to other middlewares for further processing. This middleware can extract text, HTML, and structured JSON from e-mail documents. It also optionally supports the extraction of attachments and forwarding them as new documents to other middlewares.
📨 Parsing E-mails
To use this middleware, you import it in your CDK stack and instantiate it as part of a pipeline.
Output Formats
The e-mail text processor can extract the following formats from e-mail documents:
text
: Extracts only the textual body of the e-mail.html
: Extracts the body of the e-mail as HTML.json
: Extracts the body and the attributes of the e-mail as JSON.
💁 You can specify the output format by using the
withOutputFormat
method. By default, the output format istext
.
Include Attachments
The e-mail text processor can optionally extract attachments from e-mail documents and forward them as new documents to other middlewares. You can enable the processing of attachments using the .withIncludeAttachments
API.
Include Image Links
This middleware supports converting CID attachments to data URL images. You can enable this feature using the .withIncludeImageLinks
API.
📄 Metadata
The e-mail text processor transforms input e-mail documents in the desired output format. It also enriches the metadata of documents with different information. Below is an example of the metadata created by this middleware.
💁 Click to expand example
🏗️ Architecture
This middleware is based on a Lambda ARM64 compute, and packages the mailparser
library to parse e-mail documents.
🏷️ Properties
Supported Inputs
Mime Type | Description |
---|---|
message/rfc822 | E-mail documents. |
application/vnd.ms-outlook | Outlook e-mail documents. |
Supported Outputs
The supported output types for this middleware consist of a variant that depends on whether or not the inclusion of attachments is enabled. If attachments are included, the output type is */*
as attachments can consist of any type, otherwise the output type is associated with the defined output format.
Output Format | With Attachment | Mime Type | Description |
---|---|---|---|
text | No | text/plain | Plain text document. |
text | Yes | */* | Any document. |
html | No | text/html | HTML document. |
html | Yes | */* | Any document. |
json | No | application/json | JSON document. |
json | Yes | */* | Any document. |
Supported Compute Types
Type | Description |
---|---|
CPU | This middleware only supports CPU compute. |
📖 Examples
- E-mail NLP Pipeline - An example showcasing how to analyze e-mails.