Indexing
Overview
Section titled “Overview”Indexing in byokg-rag enables efficient entity linking by mapping natural language mentions to knowledge graph nodes. The system supports three complementary index types that work together to match entities from user queries to graph entities with varying degrees of precision and semantic understanding.
Entity linking is a critical step in Knowledge Graph Question Answering (KGQA). When a user asks a question, the system must identify which entities in the knowledge graph are relevant. Indexes provide fast lookup mechanisms to find candidate entities based on string similarity, semantic meaning, or direct graph storage.
This document covers:
- Dense indexes for semantic similarity matching
- Fuzzy string indexes for approximate string matching
- Graph-store indexes for embedding-based retrieval directly from Neptune Analytics
- Guidance on selecting the appropriate index for your use case
Dense Index
Section titled “Dense Index”Purpose
Section titled “Purpose”Dense indexes use embeddings to find entities based on semantic similarity rather than exact string matches. This approach captures meaning and context, allowing the system to link entities even when the query uses different wording than the entity labels in the graph.
Architecture
Section titled “Architecture”The dense index stores vector embeddings of entity labels and uses similarity search to find the closest matches to a query embedding. The system supports local FAISS-based indexes for development and testing.
LocalFaissDenseIndex provides an in-memory vector index using FAISS (Facebook AI Similarity Search). It computes embeddings for entity labels and stores them in a FAISS index structure that enables fast approximate nearest neighbor search.
AWS Services
Section titled “AWS Services”Dense indexes require an embedding model to generate vector representations. The system integrates with:
- Amazon Bedrock - Provides access to foundation models for generating embeddings
IAM Permissions
Section titled “IAM Permissions”To use dense indexes with Amazon Bedrock embeddings, you need the following IAM permissions:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "bedrock:InvokeModel" ], "Resource": "arn:aws:bedrock:<region>::foundation-model/*" } ]}NOTE: Replace <region> with your AWS region (e.g., us-east-1).
Configuration
Section titled “Configuration”Configure a local FAISS dense index:
from graphrag_toolkit.byokg_rag.indexing import LocalFaissDenseIndex, LangChainEmbeddingfrom langchain_aws import BedrockEmbeddings
# Set up embedding modelbedrock_embeddings = BedrockEmbeddings( model_id="amazon.titan-embed-text-v2:0", region_name="<region>")embedding = LangChainEmbedding(bedrock_embeddings)
# Create dense indexdense_index = LocalFaissDenseIndex( embedding=embedding, distance_type="l2", # Options: "l2", "cosine" embedding_dim=1024 # Must match embedding model dimension)
# Add entities to indexentities = ["Albert Einstein", "Marie Curie", "Isaac Newton"]dense_index.add(entities)
# Query the indexresults = dense_index.query("physicist who developed relativity", topk=3)Parameters:
embedding- Embedding instance that generates vector representationsdistance_type- Distance metric for similarity (“l2” or “cosine”)embedding_dim- Dimension of embedding vectors (must match model output)
Fuzzy String Index
Section titled “Fuzzy String Index”Purpose
Section titled “Purpose”Fuzzy string indexes handle variations in entity names through approximate string matching. This approach is effective for typos, abbreviations, and minor spelling differences without requiring embeddings or semantic understanding.
Architecture
Section titled “Architecture”The fuzzy string index uses the thefuzz library to compute string similarity scores between query text and entity labels. It supports configurable matching thresholds and can filter candidates based on string length differences.
FuzzyStringIndex provides fast approximate string matching using Levenshtein distance and other string similarity algorithms. It maintains an in-memory mapping of entity labels and returns matches ranked by similarity score.
Configuration
Section titled “Configuration”Configure a fuzzy string index:
from graphrag_toolkit.byokg_rag.indexing import FuzzyStringIndex
# Create fuzzy string indexfuzzy_index = FuzzyStringIndex()
# Add entities to indexentities = ["Albert Einstein", "Marie Curie", "Isaac Newton"]fuzzy_index.add(entities)
# Query with fuzzy matchingresults = fuzzy_index.match( inputs=["Albert Einstien", "Mary Curie"], # Note: typos topk=1, max_len_difference=4)Parameters:
topk- Number of top matches to return per querymax_len_difference- Maximum allowed length difference between query and candidateid_selector- Optional function to filter candidates before matching
TIP: Fuzzy string matching works best for entity names with consistent structure. For highly variable entity descriptions, consider using dense indexes instead.
Graph Store Index
Section titled “Graph Store Index”Purpose
Section titled “Purpose”Graph-store indexes store embeddings directly in the graph database, eliminating the need for separate index infrastructure. This approach is available for Amazon Neptune Analytics, which supports vector storage and similarity search natively.
Architecture
Section titled “Architecture”NeptuneAnalyticsGraphStoreIndex stores entity embeddings as node properties in Neptune Analytics and uses the graph database’s built-in vector search capabilities. This provides a unified storage layer for both graph structure and semantic embeddings.
AWS Services
Section titled “AWS Services”Graph-store indexes require:
- Amazon Neptune Analytics - Graph database with native vector search support
- Amazon Bedrock - Embedding model for generating vectors
- Amazon S3 - Storage for embedding data during bulk loading
IAM Permissions
Section titled “IAM Permissions”To use graph-store indexes with Neptune Analytics, you need:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "neptune-graph:ReadDataViaQuery", "neptune-graph:GetGraph" ], "Resource": "arn:aws:neptune-graph:<region>:<account-id>:graph/<graph-id>" }, { "Effect": "Allow", "Action": [ "bedrock:InvokeModel" ], "Resource": "arn:aws:bedrock:<region>::foundation-model/*" }, { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject" ], "Resource": "arn:aws:s3:::<bucket-name>/*" } ]}NOTE: Replace <region>, <account-id>, <graph-id>, and <bucket-name> with your specific values.
Configuration
Section titled “Configuration”Configure a Neptune Analytics graph-store index:
from graphrag_toolkit.byokg_rag.graphstore import NeptuneAnalyticsGraphStorefrom graphrag_toolkit.byokg_rag.indexing import NeptuneAnalyticsGraphStoreIndex, LangChainEmbeddingfrom langchain_aws import BedrockEmbeddings
# Set up graph storegraph_store = NeptuneAnalyticsGraphStore( graph_identifier="<graph-id>", region="<region>")
# Set up embedding modelbedrock_embeddings = BedrockEmbeddings( model_id="amazon.titan-embed-text-v2:0", region_name="<region>")embedding = LangChainEmbedding(bedrock_embeddings)
# Create graph-store indexgraph_index = NeptuneAnalyticsGraphStoreIndex( graphstore=graph_store, embedding=embedding, distance_type="l2", embedding_s3_save_path="s3://<bucket-name>/embeddings/")
# Query the indexresults = graph_index.query("physicist who developed relativity", topk=3)Parameters:
graphstore- NeptuneAnalyticsGraphStore instanceembedding- Embedding instance for generating vectorsdistance_type- Distance metric for similarity (“l2” or “cosine”)embedding_s3_save_path- S3 path for storing embeddings during bulk operations
Index Selection Guide
Section titled “Index Selection Guide”Choose the appropriate index type based on your requirements:
| Index Type | Best For | Pros | Cons |
|---|---|---|---|
| Dense Index | Semantic matching, paraphrases, synonyms | Captures meaning, handles varied wording | Requires embedding model, higher latency |
| Fuzzy String Index | Typos, abbreviations, exact name variations | Fast, no external dependencies | Limited to string similarity, no semantic understanding |
| Graph Store Index | Neptune Analytics deployments, unified storage | No separate index infrastructure, integrated with graph | Requires Neptune Analytics, S3 for bulk loading |
Recommendations:
- Use fuzzy string index as the default for most applications. It provides good performance with minimal setup.
- Add dense index when queries use varied terminology or when entity labels are inconsistent.
- Use graph-store index when deploying on Neptune Analytics to simplify infrastructure.
- Combine multiple indexes for comprehensive coverage. The entity linker can use multiple indexes in sequence.
TIP: Start with fuzzy string matching and add semantic indexes only if you observe poor entity linking performance in testing.