Ollama
Unstable API
0.8.0
@project-lakechain/ollama-embedding-processor
The Ollama embedding middleware enables customers to run Ollama embedding models within their AWS account to create vector embeddings for text and markdown documents.
To orchestrate deployments, this middleware deploys an auto-scaled cluster of CPU or GPU-enabled containers that consume documents from the middleware input queue. The cluster is deployed in the private subnet of the given VPC, and caches the models on an EFS storage to optimize cold-starts.
đĻ Embedding Documents
To use this middleware, you import it in your CDK stack and specify a VPC in which the processing cluster will be deployed. You will also need to select the specific embedding model to use.
âšī¸ The below example shows how to deploy this middleware in a VPC using the
nomic-embed-text
model.
đ¤ Model Selection
Ollama supports a variety of embedding models, and you can specify the model and optionally the specific tag to use.
đ When no tag is provided, the
latest
tag is automatically used. The example below showcases how to use a specific tag on a model.
Escape Hatch
The OllamaEmbeddingModel
class provides a quick way to reference existing models, and select a specific tag.
However, as Ollama adds new models, you may be in a situation where a model is not yet referenced by this middleware.
To address this situation, you can manually specify a model definition pointing to the supported Ollama embedding model you wish to run. You do so by specifying the name of the model in the Ollama library, the tag you wish to use, and its supported input and output mime-types.
đ In the example below, we define the
snowflake-arctic-embed
model.
âī¸ Concurrency
The cluster of containers deployed by this middleware will auto-scale based on the number of documents that need to be processed. The cluster scales up to a maximum of 5 instances by default, and scales down to zero when there are no documents to process.
âšī¸ You can configure the maximum amount of instances that the cluster can auto-scale to by using the
.withMaxConcurrency
method.
đĻ Batch Processing
Ollama supports processing documents in batches since Ollama 0.2.0. This middleware can take advantage of the new parallel requests feature of Ollama to create embeddings for multiple documents in a single request, thus improving the overall throughput of the processing cluster.
âšī¸ You can configure the maximum number of documents to process in a single batch by using the
.withBatchSize
method. Note that the maximum batch size is set to 10, and that batching performance depends on the size of the chosen instance.
đ Infrastructure
Every model requires a specific infrastructure to run optimally.
To ensure the OllamaEmbeddingProcessor
orchestrates your models using the most optimal instance, memory, and GPU allocation, you need to specify an infrastructure definition.
đ The example below describes the infrastructure suited to run the
nomic-embed-text
model on a GPU instance.
Below is a description of the fields associated with the infrastructure definition.
Field | Description |
---|---|
maxMemory | The maximum RAM in MiB to allocate to the container. |
gpus | The number of GPUs to allocate to the container (only relevant for GPU instances). |
instanceType | The EC2 instance type to use for running the containers. |
đ Output
The Ollama embedding middleware does not modify or alter source documents in any way. It instead enriches the metadata of the documents with a pointer to the vector embedding that were created for the document.
đ Click to expand example
âšī¸ Limits
Embedding models have limits on the number of input tokens they can process. For more information, you can consult the documentation of the specific model you are using to understand these limits.
đ To limit the size of upstream text documents, we recommend to use a text splitter to chunk text documents before they are passed to this middleware, such as the Recursive Character Text Splitter.
đī¸ Architecture
This middleware requires CPU or GPU-enabled instances to run the embedding models. To orchestrate deployments, it deploys an ECS auto-scaled cluster of containers that consume documents from the middleware input queue. The cluster is deployed in the private subnet of the given VPC, and caches the models on an EFS storage to optimize cold-starts.
âšī¸ The average cold-start for Ollama containers is around 3 minutes when no instances are running.
đˇī¸ Properties
Supported Inputs
Mime Type | Description |
---|---|
text/plain | UTF-8 text documents. |
text/markdown | UTF-8 markdown documents. |
Supported Outputs
Mime Type | Description |
---|---|
text/plain | UTF-8 text documents. |
text/markdown | UTF-8 markdown documents. |
Supported Compute Types
Type | Description |
---|---|
CPU | This middleware supports CPU compute. |
GPU | This middleware supports GPU compute. |
đ Examples
- Ollama LanceDB Pipeline - An example showcasing how to create vector embeddings for text documents using Ollama and store them in a LanceDB embedded database.