Skip to content

Indexing

Indexing in byokg-rag enables efficient entity linking by mapping natural language mentions to knowledge graph nodes. The system supports three complementary index types that work together to match entities from user queries to graph entities with varying degrees of precision and semantic understanding.

Entity linking is a critical step in Knowledge Graph Question Answering (KGQA). When a user asks a question, the system must identify which entities in the knowledge graph are relevant. Indexes provide fast lookup mechanisms to find candidate entities based on string similarity, semantic meaning, or direct graph storage.

This document covers:

  • Dense indexes for semantic similarity matching
  • Fuzzy string indexes for approximate string matching
  • Graph-store indexes for embedding-based retrieval directly from Neptune Analytics
  • Guidance on selecting the appropriate index for your use case

Dense indexes use embeddings to find entities based on semantic similarity rather than exact string matches. This approach captures meaning and context, allowing the system to link entities even when the query uses different wording than the entity labels in the graph.

The dense index stores vector embeddings of entity labels and uses similarity search to find the closest matches to a query embedding. The system supports local FAISS-based indexes for development and testing.

LocalFaissDenseIndex provides an in-memory vector index using FAISS (Facebook AI Similarity Search). It computes embeddings for entity labels and stores them in a FAISS index structure that enables fast approximate nearest neighbor search.

Dense indexes require an embedding model to generate vector representations. The system integrates with:

  • Amazon Bedrock - Provides access to foundation models for generating embeddings

To use dense indexes with Amazon Bedrock embeddings, you need the following IAM permissions:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel"
],
"Resource": "arn:aws:bedrock:<region>::foundation-model/*"
}
]
}

NOTE: Replace <region> with your AWS region (e.g., us-east-1).

Configure a local FAISS dense index:

from graphrag_toolkit.byokg_rag.indexing import LocalFaissDenseIndex, LangChainEmbedding
from langchain_aws import BedrockEmbeddings
# Set up embedding model
bedrock_embeddings = BedrockEmbeddings(
model_id="amazon.titan-embed-text-v2:0",
region_name="<region>"
)
embedding = LangChainEmbedding(bedrock_embeddings)
# Create dense index
dense_index = LocalFaissDenseIndex(
embedding=embedding,
distance_type="l2", # Options: "l2", "cosine"
embedding_dim=1024 # Must match embedding model dimension
)
# Add entities to index
entities = ["Albert Einstein", "Marie Curie", "Isaac Newton"]
dense_index.add(entities)
# Query the index
results = dense_index.query("physicist who developed relativity", topk=3)

Parameters:

  • embedding - Embedding instance that generates vector representations
  • distance_type - Distance metric for similarity (“l2” or “cosine”)
  • embedding_dim - Dimension of embedding vectors (must match model output)

Fuzzy string indexes handle variations in entity names through approximate string matching. This approach is effective for typos, abbreviations, and minor spelling differences without requiring embeddings or semantic understanding.

The fuzzy string index uses the thefuzz library to compute string similarity scores between query text and entity labels. It supports configurable matching thresholds and can filter candidates based on string length differences.

FuzzyStringIndex provides fast approximate string matching using Levenshtein distance and other string similarity algorithms. It maintains an in-memory mapping of entity labels and returns matches ranked by similarity score.

Configure a fuzzy string index:

from graphrag_toolkit.byokg_rag.indexing import FuzzyStringIndex
# Create fuzzy string index
fuzzy_index = FuzzyStringIndex()
# Add entities to index
entities = ["Albert Einstein", "Marie Curie", "Isaac Newton"]
fuzzy_index.add(entities)
# Query with fuzzy matching
results = fuzzy_index.match(
inputs=["Albert Einstien", "Mary Curie"], # Note: typos
topk=1,
max_len_difference=4
)

Parameters:

  • topk - Number of top matches to return per query
  • max_len_difference - Maximum allowed length difference between query and candidate
  • id_selector - Optional function to filter candidates before matching

TIP: Fuzzy string matching works best for entity names with consistent structure. For highly variable entity descriptions, consider using dense indexes instead.

Graph-store indexes store embeddings directly in the graph database, eliminating the need for separate index infrastructure. This approach is available for Amazon Neptune Analytics, which supports vector storage and similarity search natively.

NeptuneAnalyticsGraphStoreIndex stores entity embeddings as node properties in Neptune Analytics and uses the graph database’s built-in vector search capabilities. This provides a unified storage layer for both graph structure and semantic embeddings.

Graph-store indexes require:

  • Amazon Neptune Analytics - Graph database with native vector search support
  • Amazon Bedrock - Embedding model for generating vectors
  • Amazon S3 - Storage for embedding data during bulk loading

To use graph-store indexes with Neptune Analytics, you need:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"neptune-graph:ReadDataViaQuery",
"neptune-graph:GetGraph"
],
"Resource": "arn:aws:neptune-graph:<region>:<account-id>:graph/<graph-id>"
},
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel"
],
"Resource": "arn:aws:bedrock:<region>::foundation-model/*"
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject"
],
"Resource": "arn:aws:s3:::<bucket-name>/*"
}
]
}

NOTE: Replace <region>, <account-id>, <graph-id>, and <bucket-name> with your specific values.

Configure a Neptune Analytics graph-store index:

from graphrag_toolkit.byokg_rag.graphstore import NeptuneAnalyticsGraphStore
from graphrag_toolkit.byokg_rag.indexing import NeptuneAnalyticsGraphStoreIndex, LangChainEmbedding
from langchain_aws import BedrockEmbeddings
# Set up graph store
graph_store = NeptuneAnalyticsGraphStore(
graph_identifier="<graph-id>",
region="<region>"
)
# Set up embedding model
bedrock_embeddings = BedrockEmbeddings(
model_id="amazon.titan-embed-text-v2:0",
region_name="<region>"
)
embedding = LangChainEmbedding(bedrock_embeddings)
# Create graph-store index
graph_index = NeptuneAnalyticsGraphStoreIndex(
graphstore=graph_store,
embedding=embedding,
distance_type="l2",
embedding_s3_save_path="s3://<bucket-name>/embeddings/"
)
# Query the index
results = graph_index.query("physicist who developed relativity", topk=3)

Parameters:

  • graphstore - NeptuneAnalyticsGraphStore instance
  • embedding - Embedding instance for generating vectors
  • distance_type - Distance metric for similarity (“l2” or “cosine”)
  • embedding_s3_save_path - S3 path for storing embeddings during bulk operations

Choose the appropriate index type based on your requirements:

Index TypeBest ForProsCons
Dense IndexSemantic matching, paraphrases, synonymsCaptures meaning, handles varied wordingRequires embedding model, higher latency
Fuzzy String IndexTypos, abbreviations, exact name variationsFast, no external dependenciesLimited to string similarity, no semantic understanding
Graph Store IndexNeptune Analytics deployments, unified storageNo separate index infrastructure, integrated with graphRequires Neptune Analytics, S3 for bulk loading

Recommendations:

  • Use fuzzy string index as the default for most applications. It provides good performance with minimal setup.
  • Add dense index when queries use varied terminology or when entity labels are inconsistent.
  • Use graph-store index when deploying on Neptune Analytics to simplify infrastructure.
  • Combine multiple indexes for comprehensive coverage. The entity linker can use multiple indexes in sequence.

TIP: Start with fuzzy string matching and add semantic indexes only if you observe poor entity linking performance in testing.