KeyBERT
This middleware is based on the KeyBERT keyword extraction and topic modeling library. It leverages the power of embedding models to identify the most significant keywords and topics in a text document, and to enrich the document metadata with them.
đˇī¸ Keyword Extraction
To use this middleware, you import it in your CDK stack and connect it to a data source that provides text documents, such as the S3 Trigger if your text documents are stored in S3.
Embedding Model
It is possible to customize the embedding model that KeyBERT is going to use to analyze input documents.
âšī¸ At this time, only models from the Sentence Transformers library are supported.
Options
There are different options influencing how the KeyBERT library extracts topics from input documents that you can optionally customize.
đ Output
The KeyBERT text processor middleware does not modify or alter source documents in any way. It instead enriches the metadata of documents with a collection of topics extracted from their text.
đ Click to expand example
đī¸ Architecture
The KeyBERT middleware runs within a Lambda compute running the KeyBERT library packaged as a Docker container. The Lambda compute runs within a VPC, and caches KeyBERT embedding models on an EFS storage.
đˇī¸ Properties
Supported Inputs
Mime Type | Description |
---|---|
text/plain | UTF-8 text documents. |
Supported Outputs
Mime Type | Description |
---|---|
text/plain | UTF-8 text documents. |
Supported Compute Types
Type | Description |
---|---|
CPU | This middleware only supports CPU compute. |
đ Examples
- Topic Modeling Pipeline - An example showcasing how to extract relevant topics from text documents.