Ollama

Unstable API

0.8.0

@project-lakechain/ollama-embedding-processor

The Ollama embedding middleware enables customers to run Ollama embedding models within their AWS account to create vector embeddings for text and markdown documents.

To orchestrate deployments, this middleware deploys an auto-scaled cluster of CPU or GPU-enabled containers that consume documents from the middleware input queue. The cluster is deployed in the private subnet of the given VPC, and caches the models on an EFS storage to optimize cold-starts.

🦙 Embedding Documents

To use this middleware, you import it in your CDK stack and specify a VPC in which the processing cluster will be deployed. You will also need to select the specific embedding model to use.

ℹ️ The below example shows how to deploy this middleware in a VPC using the nomic-embed-text model.

import { CacheStorage } from '@project-lakechain/core';
import {
  OllamaEmbeddingProcessor,
  OllamaEmbeddingModel,
  InfrastructureDefinition
} from '@project-lakechain/ollama-embedding-processor';

class Stack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string) {
    // Sample VPC.
    const vpc = new ec2.Vpc(this, 'Vpc', {});

    // The cache storage.
    const cache = new CacheStorage(this, 'Cache');

    // Creates embeddings for text using the Ollama model.
    const ollama = new OllamaEmbeddingProcessor.Builder()
      .withScope(this)
      .withIdentifier('OllamaEmbeddingProcessor')
      .withCacheStorage(cache)
      .withVpc(vpc)
      .withSource(source) // 👈 Specify a data source
      .withModel(OllamaEmbeddingModel.NOMIC_EMBED_TEXT)
      .withInfrastructure(new InfrastructureDefinition.Builder()
        .withMaxMemory(15 * 1024)
        .withGpus(1)
        .withInstanceType(ec2.InstanceType.of(
          ec2.InstanceClass.G4DN,
          ec2.InstanceSize.XLARGE2
        ))
        .build())
      .build();
  }
}

🤖 Model Selection

Ollama supports a variety of embedding models, and you can specify the model and optionally the specific tag to use.

💁 When no tag is provided, the latest tag is automatically used. The example below showcases how to use a specific tag on a model.

import { OllamaEmbeddingProcessor, OllamaEmbeddingModel } from '@project-lakechain/ollama-embedding-processor';

const ollama = new OllamaEmbeddingProcessor.Builder()
  .withScope(this)
  .withIdentifier('OllamaEmbeddingProcessor')
  .withCacheStorage(cache)
  .withVpc(vpc)
  .withModel(OllamaEmbeddingModel.NOMIC_EMBED_TEXT.tag('13b'))
  .withInfrastructure(infrastructure)
  .build();

Escape Hatch

The OllamaEmbeddingModel class provides a quick way to reference existing models, and select a specific tag. However, as Ollama adds new models, you may be in a situation where a model is not yet referenced by this middleware.

To address this situation, you can manually specify a model definition pointing to the supported Ollama embedding model you wish to run. You do so by specifying the name of the model in the Ollama library, the tag you wish to use, and its supported input and output mime-types.

💁 In the example below, we define the snowflake-arctic-embed model.

import { OllamaEmbeddingProcessor, OllamaEmbeddingModel } from '@project-lakechain/ollama-embedding-processor';

const ollama = new OllamaEmbeddingProcessor.Builder()
  .withScope(this)
  .withIdentifier('OllamaEmbeddingProcessor')
  .withCacheStorage(cache)
  .withVpc(vpc)
  .withModel(OllamaModel.of('llava', { tag: 'latest' }))
  .withPrompt(prompt)
  .withInfrastructure(infrastructure)
  .build();

↔️ Concurrency

The cluster of containers deployed by this middleware will auto-scale based on the number of documents that need to be processed. The cluster scales up to a maximum of 5 instances by default, and scales down to zero when there are no documents to process.

ℹ️ You can configure the maximum amount of instances that the cluster can auto-scale to by using the .withMaxConcurrency method.

const ollama = new OllamaEmbeddingProcessor.Builder()
  .withScope(this)
  .withIdentifier('OllamaEmbeddingProcessor')
  .withCacheStorage(cache)
  .withVpc(vpc)
  .withSource(source)
  .withMaxConcurrency(10) // 👈 Maximum number of instances.
  .withModel(model)
  .withInfrastructure(infrastructure)
  .build();

📦 Batch Processing

Ollama supports processing documents in batches since Ollama 0.2.0. This middleware can take advantage of the new parallel requests feature of Ollama to create embeddings for multiple documents in a single request, thus improving the overall throughput of the processing cluster.

ℹ️ You can configure the maximum number of documents to process in a single batch by using the .withBatchSize method. Note that the maximum batch size is set to 10, and that batching performance depends on the size of the chosen instance.

const ollama = new OllamaEmbeddingProcessor.Builder()
  .withScope(this)
  .withIdentifier('OllamaEmbeddingProcessor')
  .withCacheStorage(cache)
  .withVpc(vpc)
  .withSource(source)
  .withBatchSize(5) // 👈 Maximum batch size.
  .withModel(model)
  .withInfrastructure(infrastructure)
  .build();

🌉 Infrastructure

Every model requires a specific infrastructure to run optimally. To ensure the OllamaEmbeddingProcessor orchestrates your models using the most optimal instance, memory, and GPU allocation, you need to specify an infrastructure definition.

💁 The example below describes the infrastructure suited to run the nomic-embed-text model on a GPU instance.

const ollama = new OllamaEmbeddingProcessor.Builder()
  .withScope(this)
  .withIdentifier('OllamaEmbeddingProcessor')
  .withCacheStorage(cache)
  .withVpc(vpc)
  .withModel(OllamaEmbeddingModel.NOMIC_EMBED_TEXT)
  // 👇 Infrastructure definition.
  .withInfrastructure(new InfrastructureDefinition.Builder()
    .withMaxMemory(15 * 1024)
    .withGpus(1)
    .withInstanceType(ec2.InstanceType.of(
      ec2.InstanceClass.G4DN,
      ec2.InstanceSize.XLARGE2
    ))
    .build())
  .build();

Below is a description of the fields associated with the infrastructure definition.

Field	Description
maxMemory	The maximum RAM in MiB to allocate to the container.
gpus	The number of GPUs to allocate to the container (only relevant for GPU instances).
instanceType	The EC2 instance type to use for running the containers.

📄 Output

The Ollama embedding middleware does not modify or alter source documents in any way. It instead enriches the metadata of the documents with a pointer to the vector embedding that were created for the document.

💁 Click to expand example

{
  "specversion": "1.0",
  "id": "1780d5de-fd6f-4530-98d7-82ebee85ea39",
  "type": "document-created",
  "time": "2023-10-22T13:19:10.657Z",
  "data": {
      "chainId": "6ebf76e4-f70c-440c-98f9-3e3e7eb34c79",
      "source": {
          "url": "s3://bucket/document.txt",
          "type": "text/plain",
          "size": 245328,
          "etag": "1243cbd6cf145453c8b5519a2ada4779"
      },
      "document": {
          "url": "s3://bucket/document.txt",
          "type": "text/plain",
          "size": 245328,
          "etag": "1243cbd6cf145453c8b5519a2ada4779"
      },
      "metadata": {
        "properties": {
          "kind": "text",
          "attrs": {
            "embeddings": {
              "vectors": "s3://cache-storage/ollama-embedding-processor/45a42b35c3225085.json",
              "model": "nomic-embed-text:latest",
              "dimensions": 768
          }
        }
      }
    }
  }
}

ℹ️ Limits

Embedding models have limits on the number of input tokens they can process. For more information, you can consult the documentation of the specific model you are using to understand these limits.

💁 To limit the size of upstream text documents, we recommend to use a text splitter to chunk text documents before they are passed to this middleware, such as the Recursive Character Text Splitter.

🏗️ Architecture

This middleware requires CPU or GPU-enabled instances to run the embedding models. To orchestrate deployments, it deploys an ECS auto-scaled cluster of containers that consume documents from the middleware input queue. The cluster is deployed in the private subnet of the given VPC, and caches the models on an EFS storage to optimize cold-starts.

ℹ️ The average cold-start for Ollama containers is around 3 minutes when no instances are running.

Sentence Transformers Architecture

🏷️ Properties

Supported Inputs

Mime Type	Description
`text/plain`	UTF-8 text documents.
`text/markdown`	UTF-8 markdown documents.

Supported Outputs

Mime Type	Description
`text/plain`	UTF-8 text documents.
`text/markdown`	UTF-8 markdown documents.

Supported Compute Types

Type	Description
`CPU`	This middleware supports CPU compute.
`GPU`	This middleware supports GPU compute.

📖 Examples

Ollama LanceDB Pipeline - An example showcasing how to create vector embeddings for text documents using Ollama and store them in a LanceDB embedded database.