Skip to content

Bedrock

Unstable API

0.7.0

@project-lakechain/bedrock-embedding-processors

TypeScript Icon

This package enables developers to use embedding models hosted on Amazon Bedrock to create vector embeddings for text and markdown documents within their pipelines. It exposes different constructs that you can integrate as part of your pipelines, including Amazon Titan, and Cohere embedding processors.


📝 Embedding Documents

To use the Bedrock embedding processors, you import the Titan or Cohere construct in your CDK stack and specify the embedding model you want to use.

Amazon Titan

ℹī¸ The below example demonstrates how to use the Amazon Titan embedding processor to create vector embeddings for text documents.

import { TitanEmbeddingProcessor, TitanEmbeddingModel } from '@project-lakechain/bedrock-embedding-processors';
import { CacheStorage } from '@project-lakechain/core';
class Stack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string) {
// The cache storage.
const cache = new CacheStorage(this, 'Cache');
// Creates embeddings for input documents using Amazon Titan.
const embeddingProcessor = new TitanEmbeddingProcessor.Builder()
.withScope(this)
.withIdentifier('BedrockEmbeddingProcessor')
.withCacheStorage(cache)
.withSource(source) // 👈 Specify a data source
.withModel(TitanEmbeddingModel.AMAZON_TITAN_EMBED_TEXT_V1)
.build();
}
}


Cohere

ℹī¸ The below example uses one of the supported Cohere embedding models.

import { CohereEmbeddingProcessor, CohereEmbeddingModel } from '@project-lakechain/bedrock-embedding-processors';
import { CacheStorage } from '@project-lakechain/core';
class Stack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string) {
// The cache storage.
const cache = new CacheStorage(this, 'Cache');
// Creates embeddings for input documents using a Cohere model.
const embeddingProcessor = new CohereEmbeddingProcessor.Builder()
.withScope(this)
.withIdentifier('CohereEmbeddingProcessor')
.withCacheStorage(cache)
.withSource(source) // 👈 Specify a data source
.withModel(CohereEmbeddingModel.COHERE_EMBED_MULTILINGUAL_V3)
.build();
}
}


Escape Hatches

Both Titan and Cohere constructs reference embedding models currently supported by Amazon Bedrock through the TitanEmbeddingModel and CohereEmbeddingModel classes. In case a model is not yet referenced, we allow developers to specify a custom model identifier.

const embeddingProcessor = new TitanEmbeddingProcessor.Builder()
.withScope(this)
.withIdentifier('BedrockEmbeddingProcessor')
.withCacheStorage(cache)
.withSource(source)
// Specify a custom embedding model to use.
.withModel(TitanEmbeddingModel.of('specific.model-id'))
.build();


🌐 Region Selection

You can specify the AWS region in which you want to invoke Amazon Bedrock using the .withRegion API. This can be helpful if Amazon Bedrock is not yet available in your deployment region.

💁 By default, the middleware will use the current region in which it is deployed.

const embeddingProcessor = new TitanEmbeddingProcessor.Builder()
.withScope(this)
.withIdentifier('BedrockEmbeddingProcessor')
.withCacheStorage(cache)
.withSource(source)
.withModel(TitanEmbeddingModel.AMAZON_TITAN_EMBED_TEXT_V1)
.withRegion('eu-central-1') // 👈 Alternate region
.build();


📄 Output

The Bedrock embedding processor does not modify or alter source documents in any way. It instead enriches the metadata of the documents with a pointer to the vector embeddings that were created for the document.

💁 Click to expand example
{
"specversion": "1.0",
"id": "1780d5de-fd6f-4530-98d7-82ebee85ea39",
"type": "document-created",
"time": "2023-10-22T13:19:10.657Z",
"data": {
"chainId": "6ebf76e4-f70c-440c-98f9-3e3e7eb34c79",
"source": {
"url": "s3://bucket/document.txt",
"type": "text/plain",
"size": 245328,
"etag": "1243cbd6cf145453c8b5519a2ada4779"
},
"document": {
"url": "s3://bucket/document.txt",
"type": "text/plain",
"size": 245328,
"etag": "1243cbd6cf145453c8b5519a2ada4779"
},
"metadata": {
"properties": {
"kind": "text",
"attrs": {
"embeddings": {
"vectors": "s3://cache-storage/bedrock-embedding-processor/45a42b35c3225085.json",
"model": "amazon.titan-embed-text-v1",
"dimensions": 1536
}
}
}
}
}


ℹī¸ Limits

Both the Titan and Cohere embedding models have limits on the number of input tokens they can process. For more information, you can consult the Amazon Bedrock documentation to understand these limits.

💁 To limit the size of upstream text documents, we recommend to use a text splitter to chunk text documents before they are passed to this middleware, such as the Recursive Character Text Splitter.

Furthermore, this middleware applies a throttling of 10 concurrently processed documents from its input queue to ensure that it does not exceed the limits of the embedding models it uses — see Bedrock Quotas for more information.



🏗ī¸ Architecture

The middlewares part of this package are based on a Lambda compute running on an ARM64 architecture, and integrate with Amazon Bedrock to generate embeddings for text documents.

Architecture



🏷ī¸ Properties


Supported Inputs
Mime TypeDescription
text/plainUTF-8 text documents.
text/markdownUTF-8 markdown documents.
Supported Outputs
Mime TypeDescription
text/plainUTF-8 text documents.
text/markdownUTF-8 markdown documents.
Supported Compute Types
TypeDescription
CPUThis middleware only supports CPU compute.


📖 Examples