Text Transform
The Text transform processor operates on plain text documents and allows to perform operations on text such as string replacements, base64 encoding, and substring extraction.
This middleware supports semantic operations, meaning that it understands the metadata associated with documents to use them as context to perform operations on them. For example, it can leverage PII, Part-of-Speech or Named Entities from document metadata to apply transformations such as substring redaction.
📝 Transforming Text
To use this middleware, you import it in your CDK stack and instantiate it as part of a pipeline.
💁 In the below example, we redact Personal Identifiable Information (PII) from input documents. Note that PII information need to be made available before being processed by the text transform processor.
Chaining transformations
You can chain operations to be applied on input text. In the below example, we first select a substring part of the text document, replace several words in it, and then encode it in base64.
🏗️ Architecture
This middleware is based on a Lambda compute running the logic associated with the text transform processor.
🏷️ Properties
Supported Inputs
Mime Type | Description |
---|---|
text/plain | Plain text documents. |
Supported Outputs
Mime Type | Description |
---|---|
text/plain | Plain text documents. |
Supported Compute Types
Type | Description |
---|---|
CPU | This middleware only supports CPU compute. |
📖 Examples
- PII Redaction Pipeline - A PII redaction pipeline using Project Lakechain.