Extractor
The StructuredEntityExtractor
middleware can accurately extract structured entities from text documents using large-language models. The middleware uses models hosted on Amazon Bedrock, along with Bedrock Tools, to extract structured entities from text documents.
To ground the process of entity extraction, you specify a Zod typed schema describing your entities and their constraints. This middleware validates the LLM generated data against that schema, and automatically prompts the underlying model with any error generated to improve the quality of the extraction.
π Data Extraction
To start using this middleware in your pipelines, you import the StructuredEntityExtractor
construct in your CDK stack, and specify the schema you want to use for the extraction.
π The below example demonstrates how to use this middleware to extract data from text documents provided by upstream middlewares.
π€ Model Selection
You can select the specific Bedrock models to use with this middleware using the .withModel
API.
βΉοΈ By default, the extractor uses the
anthropic.claude-3-sonnet-20240229-v1:0
model from Amazon Bedrock.
π You can choose amongst the following models for structured entity extraction.
Model Name | Model identifier |
---|---|
Claude 3 Haiku | anthropic.claude-3-haiku-20240307-v1:0 |
Claude 3 Sonnet | anthropic.claude-3-sonnet-20240229-v1:0 |
Claude 3.5 Sonnet | anthropic.claude-3-5-sonnet-20240620-v1:0 |
Claude 3 Opus | anthropic.claude-3-opus-20240229-v1:0 |
Cohere Command R | cohere.command-r-v1:0 |
Cohere Command R+ | cohere.command-r-plus-v1:0 |
Llama 3.1 8B | meta.llama3-1-8b-instruct-v1:0 |
Llama 3.1 70B | meta.llama3-1-70b-instruct-v1:0 |
Llama 3.1 405B | meta.llama3-1-405b-instruct-v1:0 |
Mistral Large | mistral.mistral-large-2402-v1:0 |
Mistral Large v2 | mistral.mistral-large-2407-v1:0 |
Mistral Small | mistral.mistral-small-2402-v1:0 |
π Instructions
In some cases, it can be very handy to provide domain specific instructions to the model for a better, more accurate, extraction. You can provide these instructions using the .withInstructions
API.
π For example, if youβre looking to translate a text, and want to ensure that you obtain only the translation with no preamble or other explanation, you can prompt the model with specific instructions as shown below.
π Output
The structured data extracted from documents by this middleware are by default returned as a new JSON document that can be consumed by the next middlewares in a pipeline. Alternatively, you can also choose to embed the extracted data inside the document metadata, and leave the original document unchanged.
π You can control this behavior using the
.withOutputType
API.
When setting metadata
as the output type, the original document is used as the output of this middleware without any changes. Below is an example of how the extracted data can be accessed from the document metadata.
π Click to expand example
π Region Selection
You can specify the AWS region in which you want to invoke the Amazon Bedrock model using the .withRegion
API. This can be helpful if Amazon Bedrock is not yet available in your deployment region.
π By default, the middleware will use the current region in which it is deployed.
βοΈ Model Parameters
You can optionally forward specific parameters to the underlying LLM using the .withModelParameters
method. Below is a description of the supported parameters.
Parameter | Description | Min | Max | Default |
---|---|---|---|---|
temperature | Controls the randomness of the generated text. | 0 | 1 | 0 |
maxTokens | The maximum number of tokens to generate. | 1 | 4096 | 4096 |
topP | The cumulative probability of the top tokens to sample from. | 0 | 1 | N/A |
𧩠Composite Events
In addition to handling single documents, the Anthropic text processor also supports composite events as an input. This means that it can take multiple documents and compile them into a single input to the model.
This can come in handy in map-reduce pipelines where you use the Reducer to combine multiple documents into a single input having a similar semantic, for example, multiple pages of a PDF document that you would like the model to extract data from, while keeping the context between the pages.
ποΈ Architecture
This middleware is based on a Lambda compute running on an ARM64 architecture, and integrate with Amazon Bedrock to extract data based on the given documents.
π·οΈ Properties
Supported Inputs
Mime Type | Description |
---|---|
text/plain | UTF-8 text documents. |
text/markdown | Markdown documents. |
text/csv | CSV documents. |
text/html | HTML documents. |
application/x-subrip | SubRip subtitles. |
text/vtt | Web Video Text Tracks (WebVTT) subtitles. |
application/json | JSON documents. |
application/xml | XML documents. |
application/cloudevents+json | Composite events emitted by the Reducer . |
Supported Outputs
The supported output depends on the output type specified using the .withOutputType
API. By default, this middleware will always return a JSON document with the application/json
type.
If the output type is set to metadata
, the original document is used as the output of this middleware.
Supported Compute Types
Type | Description |
---|---|
CPU | This middleware only supports CPU compute. |
π Examples
- Structured Data Extraction Pipeline - Builds a pipeline for structured data extraction from documents using Amazon Bedrock.
- Bedrock Translation Pipeline - An example showcasing how to translate documents using LLMs on Amazon Bedrock.