NLP
The NLP text processor middleware uses Amazon Comprehend to provide natural language processing capabilities to your pipelines. It allows you to analyze and extract the substance of text documents and use that as an input to other middlewares in your pipeline. For example, running sentiment analysis, PII detection, part-of-speech tagging, and named entity recognition on your text documents.
๐ก Intents
To use this middleware, you define an intent that specifies the type of processing you want to operate on text. Intents expose a powerful functional API making it easy to describe the processing you want to leverage when processing text documents.
In the following sections, we will explore several use-cases that demonstrate how to use intents.
Language Detection
Letโs start with a simple example where we use Amazon Comprehendโs ability to detect the language of text documents. In the below example, we define an intent that will extract language information from text documents and store it within the document metadata.
๐ Weโre using the intent domain-specific language (DSL) to express actions within an intent.
Sentiment Analysis
Amazon Comprehend can also be used to perform sentiment analysis on text documents.
๐ Weโre using the
language
intent action before thesentiment
action as the sentiment action needs to know the language of the document first. Weโre doing this by chaining both actions together.
PII Detection
Amazon Comprehend can identify personally identifiable information (PII) entities in text documents. You can enable PII detection using the NLP text processor by using the pii
intent action.
๐ Weโre using the
language
intent action before thepii
action as the PII action needs to know the language of the document first. Weโre doing this by chaining both actions together.
Part-of-Speech Tagging
Leveraging Comprehendโs ability to perform part-of-speech tagging on text documents is as simple as using the pos
intent action. POS tagging identifies the grammatical parts of speech in a sentence, such as verbs, nouns, adjectives, adverbs, etc.
๐ Weโre using the
language
intent action before thepos
action as the POS action needs to know the language of the document first. Weโre doing this by chaining both actions together.
Entity Recognition
To use Amazon Comprehendโs ability to recognize entities, such as events, locations, dates, and more, within your text documents, you can use the entities
intent action.
๐ Weโre using the
language
intent action before theentities
action as the entities action needs to know the language of the document first. Weโre doing this by chaining both actions together.
Reading Time
You can use the readingTime
intent action to calculate the estimated reading time of a text document. The estimated reading time will be stored within the document metadata.
Statistics
To extract other statistics from text documents, such as the count of the number of sentences and words in the document, you can use the stats
intent action.
Combining Actions
All actions can be combined within a single intent, and the NLP text processor will execute them in the order in which they are defined.
๐ Using Filters
Each action within the DSL supports one or more filters that you can apply to it. For example, the pii
, entities
, and pos
actions support different filters.
๐ The below intent extracts PII with 90% or more confidence, entities associated with
PEOPLE
tags, and part-of-speech tags associated with adjectives and nouns with 90% or more confidence.
๐๏ธ Architecture
The NLP text processor uses AWS Lambda as its compute, using an ARM64 architecture. The Lambda function is integrated with the Amazon Comprehend service, and issues the appropriate API calls to process images given the intent defined by the user.
๐ท๏ธ Properties
Supported Inputs
Mime Type | Description |
---|---|
text/plain | This middleware supports plain text as input. |
Supported Outputs
Mime Type | Description |
---|---|
text/plain | This middleware supports plain text as output. |
Supported Compute Types
Type | Description |
---|---|
CPU | This middleware only supports CPU compute. |
๐ Examples
- NLP Pipeline - Builds a pipeline for extracting metadata from text-oriented documents.
- PII Redaction Pipeline - A PII redaction pipeline using Project Lakechain.