Skip to content

LanceDB

Unstable API

0.7.0

@project-lakechain/lancedb-storage-connector

TypeScript Icon

The LanceDB connector makes it possible for developers to leverage the embedded nature of LanceDB databases to store document descriptions and their associated vector embeddings. This can be a particularly good choice for applications that don’t require ultra-low latency for indexing and retrieval, and are not I/O sensitive.

πŸ’ By leveraging LanceDB as a vector store, developers can store 10’s of thousands of vectors at a very low cost, benefiting from the serverless nature of LanceDB.



πŸ’Ύ Indexing Documents

To use the LanceDB storage connector, you import it in your CDK stack, and connect it to a data source providing document embeddings. You also define a storage provider such as S3 or EFS that will serve as the storage backend for the LanceDB database.

ℹ️ The below example showcases how to create a LanceDB connector leveraging the S3 storage provider.

import { LanceDbStorageConnector, S3StorageProvider } from '@project-lakechain/lancedb-storage-connector';
import { CacheStorage } from '@project-lakechain/core';
class Stack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string) {
const cache = new CacheStorage(this, 'Cache');
// The bucket used to store the LanceDB database.
const bucket = new s3.Bucket(this, 'Bucket', {
encryption: s3.BucketEncryption.S3_MANAGED,
blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL
});
// Create the LanceDB storage connector.
const connector = new LanceDbStorageConnector.Builder()
.withScope(this)
.withIdentifier('LanceDbStorageConnector')
.withCacheStorage(cache)
.withSource(source)
.withVectorSize(1024)
.withStorageProvider(new S3StorageProvider.Builder()
.withScope(this)
.withIdentifier('S3Storage')
.withBucket(bucket)
.build()
)
.build();
}
}


πŸ—ƒοΈ Storage Providers

The LanceDB storage connector supports 2 different storage providers allowing you to balance the needs between cost, performance, durability and latency.

S3 Storage

The S3 storage provider uses an S3 bucket to store the LanceDB database using a standard storage class.

πŸ’ The provider does not create the S3 bucket, but uses a customer provided bucket, as well as an optional path prefix to store the database.


const connector = new LanceDbStorageConnector.Builder()
.withScope(this)
.withIdentifier('LanceDbStorageConnector')
.withCacheStorage(cache)
.withSource(source)
.withVectorSize(1024)
.withStorageProvider(new S3StorageProvider.Builder()
.withScope(this)
.withIdentifier('S3Storage')
.withBucket(bucket) // πŸ‘ˆ Specify the S3 bucket
.build()
)
.build();


EFS Storage

The EFS storage provider leverages AWS EFS to store the LanceDB database, providing lower latency and higher IOPS compared to S3.

πŸ’ The provider does not create the EFS file system, but uses a customer provided file system placed in a VPC, as well as an optional path prefix to store the database.


const connector = new LanceDbStorageConnector.Builder()
.withScope(this)
.withIdentifier('LanceDbStorageConnector')
.withCacheStorage(cache)
.withSource(source)
.withVectorSize(1024)
.withStorageProvider(new EfsStorageProvider.Builder()
.withScope(this)
.withIdentifier('EfsStorage')
.withFileSystem(fileSystem) // πŸ‘ˆ Specify the EFS
.withVpc(vpc) // πŸ‘ˆ Specify the EFS VPC
.build()
)
.build();


Include Text

When the document being processed is a text document, you can choose to include the text of the document associated with the embeddings in the LanceDB table. This allows you to retrieve the text associated with the embeddings when executing a similarity search without having to retrieve the original text from a separate database.

To do so, you can use the withIncludeText API. If the document is not a text, this option is ignored.

πŸ’ By default, the text is not included in the index.

const connector = new LanceDbStorageConnector.Builder()
.withScope(this)
.withIdentifier('LanceDbStorageConnector')
.withCacheStorage(cache)
.withSource(source)
.withVectorSize(1024)
.withStorageProvider(storageProvider)
.withIncludeText(true) // πŸ‘ˆ Include text
.build();


πŸ—οΈ Architecture

The architecture implemented by the LanceDB storage connector is based on a Lambda ARM64 compute to index document embeddings provided by source middlewares into the LanceDB database. The connector uses an AWS Lambda Layer to include the LanceDB library within the Lambda environment.

πŸ’ The architecture depends on the selected storage provider. Below is a description of the architecture for each storage provider.

S3 Storage Provider

The S3 storage provider uses a user provided S3 bucket to store the LanceDB database.

LanceDB Storage Connector S3 Architecture

EFS Storage Provider

The EFS storage provider uses a user provided EFS file system to store the LanceDB database.

LanceDB Storage Connector EFS Architecture



🏷️ Properties


Supported Inputs
Mime TypeDescription
*/*This middleware supports any type of documents. Note that if no embeddings are specified in the document metadata, the document is filtered out.
Supported Outputs

This middleware does not produce any output.

Supported Compute Types
TypeDescription
CPUThis middleware only supports CPU compute.


πŸ“– Examples

  • Bedrock + LanceDB - An example showcasing an embedding pipeline using Amazon Bedrock and LanceDB.