Indexing
Overview
Section titled “Overview”There are two stages to indexing: extract, and build. The lexical-graph uses separate pipelines for each of these stages, plus micro-batching, to provide a continous ingest capability. This means that your graph will start being populated soon after extraction begins.
You can run the extract and build pipelines together, to provide for the continuous ingest described above. Or you can run the two pipelines separately, extracting first to file-based chunks, and then later building a graph from these chunks.
The LexicalGraphIndex allows you to run the extract and build pipelines together or separately. See the Using the LexicalGraphIndex to construct a graph section below.
Indexing supports multi-tenancy, whereby you can store separate lexical graphs in the same backend graph and vector stores.
Extract
Section titled “Extract”The extraction stage is, by default, a three-step process:
- The source documents are broken down into chunks.
- For each chunk, an LLM extracts a set of propositions from the unstructured content. This proposition extraction helps ‘clean’ the content and improve the subsequent entity/topic/statement/fact extraction by breaking complex sentences into simpler sentences, replacing pronouns with specific names, and replacing acronyms where possible. These propositions are added to the chunk’s metadata under the
aws::graph::propositionskey. - Following the proposition extraction, a second LLM call extracts entities, relations, topics, statements and facts from the set of extracted propositions. These details are added to the chunk’s metadata under the
aws::graph::topicskey.
Only the third step here is mandatory. If your source data has already been chunked, you can omit step 1. If you’re willing to trade a reduction in LLM calls and improved performance for a reduction in the quality of the entity/topic/statement/fact extraction, you can omit step 2.
Extraction uses a lightly guided strategy whereby the extraction process is seeded with a list of preferred entity classifications. The LLM is instructed to use an existing classification from the list before creating new ones. Any new classifications introduced by the LLM are then carried forward to subsequent invocations. This approach reduces but doesn’t eliminate unwanted variations in entity classification.
The list of DEFAULT_ENTITY_CLASSIFICATIONS used to seed the extraction process can be found here. If these classifications are not appropriate to your workload you can replace them (see the Configuring the extract and build stages section below).
Relationship values are currently unguided (though relatively concise).
In the build stage, the LlamaIndex chunk nodes emitted from the extract stage are broken down further into a stream of individual source, chunk, topic, statement and fact LlamaIndex nodes. Graph construction and vector indexing handlers process these nodes to build and index the graph content. Each of these nodes has an aws::graph::index metadata item containing data that can be used to index the node in a vector store (though only the chunk and statement nodes are actually indexed in the current implementation).
Using the LexicalGraphIndex to construct a graph
Section titled “Using the LexicalGraphIndex to construct a graph”The LexicalGraphIndex provides a convenient means of constructing a graph – via either continuous ingest, or separate extract and build stages. When constructing a LexicalGraphIndex you must supply a graph store and a vector store (see Storage Model for more details). In the examples below, the graph store and vector store connection strings are fetched from environment variables.
The LexicalGraphIndex constructor has an extraction_dir named argument. This is the path to a local directory to which intermediate artefacts (such as checkpoints) will be written. By default, the value of extraction_dir is set to the value of GraphRAGConfig.local_output_dir, which defaults to 'output'. For containerized deployments (EKS/Kubernetes), you can configure this via the LOCAL_OUTPUT_DIR environment variable or by setting GraphRAGConfig.local_output_dir programmatically. See Configuration for more details.
Continous ingest
Section titled “Continous ingest”Use LexicalGraphIndex.extract_and_build() to extract and build a graph in a manner that supports continous ingest.
The extraction stage consumes LlamaIndex nodes – either documents, which will be chunked during extraction, or pre-chunked text nodes. Use a LlamaIndex reader to load source documents. The example below uses a LlamaIndex SimpleWebReader to load several HTML pages.
import os
from graphrag_toolkit.lexical_graph import LexicalGraphIndexfrom graphrag_toolkit.lexical_graph.storage import GraphStoreFactoryfrom graphrag_toolkit.lexical_graph.storage import VectorStoreFactory
from llama_index.readers.web import SimpleWebPageReader
doc_urls = [ 'https://docs.aws.amazon.com/neptune/latest/userguide/intro.html', 'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/what-is-neptune-analytics.html', 'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-features.html', 'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-vs-neptune-database.html']
docs = SimpleWebPageReader( html_to_text=True, metadata_fn=lambda url:{'url': url}).load_data(doc_urls)
with ( GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE']) as graph_store, VectorStoreFactory.for_vector_store(os.environ['VECTOR_STORE']) as vector_store):
graph_index = LexicalGraphIndex( graph_store, vector_store )
graph_index.extract_and_build(docs)The diff below shows what changes when you split the pipelines:
graph_index.extract_and_build(docs)extracted_docs = S3BasedDocs( region=os.environ['AWS_REGION'], bucket_name=os.environ['EXTRACTION_BUCKET'], key_prefix='extracted',)graph_index.extract(docs, handler=extracted_docs, show_progress=True)graph_index.build(extracted_docs, show_progress=True)Run the extract and build stages separately
Section titled “Run the extract and build stages separately”Using the LexicalGraphIndex you can perform the extract and build stages separately. This is useful if you want to extract the graph once, and then build it multiple times (in different environments, for example.)
When you run the extract and build stages separately, you can persist the extracted documents to Amazon S3 or to the filesystem at the end of the extract stage, and then consume these same documents in the build stage. Use the graphrag-toolkit’s S3BasedDocss and FileBasedDocs classes to persist and then retrieve JSON-serialized LlamaIndex nodes.
The following example shows how to use a S3BasedDocs handler to persist extracted documents to an Amazon S3 bucket at the end of the extract stage:
import os
from graphrag_toolkit.lexical_graph import LexicalGraphIndexfrom graphrag_toolkit.lexical_graph.storage import GraphStoreFactoryfrom graphrag_toolkit.lexical_graph.storage import VectorStoreFactoryfrom graphrag_toolkit.lexical_graph.indexing.load import S3BasedDocs
from llama_index.readers.web import SimpleWebPageReader
extracted_docs = S3BasedDocs( region='us-east-1', bucket_name='my-bucket', key_prefix='extracted', collection_id='12345')
with ( GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE']) as graph_store, VectorStoreFactory.for_vector_store(os.environ['VECTOR_STORE']) as vector_store):
graph_index = LexicalGraphIndex( graph_store, vector_store )
doc_urls = [ 'https://docs.aws.amazon.com/neptune/latest/userguide/intro.html', 'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/what-is-neptune-analytics.html', 'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-features.html', 'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-vs-neptune-database.html' ]
docs = SimpleWebPageReader( html_to_text=True, metadata_fn=lambda url:{'url': url} ).load_data(doc_urls)
graph_index.extract(docs, handler=extracted_docs)Following the extract stage, you can then build the graph from the previously extracted documents. Whereas in the extract stage the S3BasedDocs object acted as a handler to persist extracted documents, in the build stage the S3BasedDocs object acts as a source of LlamaIndex nodes, and is thus passed as the first argument to the build() method:
import os
from graphrag_toolkit.lexical_graph import LexicalGraphIndexfrom graphrag_toolkit.lexical_graph.storage import GraphStoreFactoryfrom graphrag_toolkit.lexical_graph.storage import VectorStoreFactoryfrom graphrag_toolkit.lexical_graph.indexing.load import S3BasedDocs
docs = S3BasedDocs( region='us-east-1', bucket_name='my-bucket', key_prefix='extracted', collection_id='12345')
with ( GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE']) as graph_store, VectorStoreFactory.for_vector_store(os.environ['VECTOR_STORE']) as vector_store):
graph_index = LexicalGraphIndex( graph_store, vector_store )
graph_index.build(docs)The S3BasedDocs object has the following parameters:
| Parameter | Description | Mandatory |
|---|---|---|
region | AWS Region in which the S3 bucket is located (e.g. us-east-1) | Yes |
bucket_name | Amazon S3 bucket name | Yes |
key_prefix | S3 key prefix | Yes |
collection_id | Id for a particular collection of extracted documents. Optional: if no collection_id is supplied, the lexical-graph will create a timestamp value. Extracted documents will be written to s3://<bucket>/<key_prefix>/<collection_id>/. | No |
s3_encryption_key_id | KMS key id (Key ID, Key ARN, or Key Alias) to use for object encryption. Optional: if no s3_encryption_key_id is supplied, the lexical-graph will encrypt objects in S3 using Amazon S3 managed keys. | No |
If you use Amazon Web Services KMS keys to encrypt objects in S3, the identity under which the lexical-graph runs should include the following IAM policy. Replace <kms-key-arn> with the ARN of the KMS key you want to use to encrypt objects:
{ "Version": "2012-10-17", "Statement": [ { "Action": [ "kms:GenerateDataKey", "kms:Decrypt" ], "Resource": [ "<kms-key-arn>" ], "Effect": "Allow" } ]}If you want to persist extracted documents to the local filesystem instead of an S3 bucket, use a FileBasedDocs object instead:
from graphrag_toolkit.lexical_graph.indexing.load import FileBasedDocs
chunks = FileBasedDocs( docs_directory='./extracted/', collection_id='12345')The FileBasedChunks object has the following parameters:
| Parameter | Description | Mandatory |
|---|---|---|
docs_directory | Root directory for the extracted documents | Yes |
collection_id | Id for a particular collection of extracted documents. Optional: if no collection_id is supplied, the lexical-graph will create a timestamp value. Extracted documents will be written to /<docs_directory>/<collection_id>/. | No |
Configuring the extract and build stages
Section titled “Configuring the extract and build stages”You can configure the number of workers and batch sizes for the extract and build stages of the LexicalGraphIndex using the GraphRAGConfig object. See Configuration for more details on using the configuration object.
Besides configuring the workers and batch sizes, you can also configure the indexing process with regard to chunking, proposition extraction and entity classification, and graph and vector store contents by passing an instance of IndexingConfig to the LexicalGraphIndex constructor:
from graphrag_toolkit.lexical_graph import LexicalGraphIndex, IndexingConfig, ExtractionConfig
...
graph_index = LexicalGraphIndex( graph_store, vector_store, indexing_config = IndexingConfig( chunking=None, extraction=ExtractionConfig( enable_proposition_extraction=False )
))The IndexingConfig object has the following parameters:
| Parameter | Description | Default Value |
|---|---|---|
chunking | A list of node parsers (e.g. LlamaIndex SentenceSplitter) to be used for chunking source documents. Set chunking to None to skip chunking. | SentenceSplitter with chunk_size=256 and chunk_overlap=25 |
extraction | An ExtractionConfig object specifying extraction options | ExtractionConfig with default values |
build | A BuildConfig object specifying build options | BuildConfig with default values |
batch_config | Batch configuration to be used if performing batch extraction. If batch_config is None, the toolkit will perform chunk-by-chunk extraction. | None |
The ExtractionConfig object has the following parameters:
| Parameter | Description | Default Value |
|---|---|---|
enable_proposition_extraction | Perform proposition extraction before extracting topics, statements, facts and entities | True |
preferred_entity_classifications | Comma-separated list of preferred entity classifications used to seed the entity extraction | DEFAULT_ENTITY_CLASSIFICATIONS |
preferred_topics | List of preferred topic names (or a callable that returns them) supplied to the LLM to seed topic extraction. Accepts the same type as preferred_entity_classifications. | [] |
infer_entity_classifications | Determines whether to pre-process documents to identify significant domain entity classifications. Supply either True or False, or an InferClassificationsConfig object. When True, an InferClassifications step runs as a pre-processor before the main extraction loop — one extra LLM round-trip per batch, not per document. | False |
extract_propositions_prompt_template | Prompt used to extract propositions from chunks. If None, the default extract propositions template is used. See Custom prompts below. | None |
extract_topics_prompt_template | Prompt used to extract topics, statements and entities from chunks. If None, the default extract topics template is used. See Custom prompts below. | None |
extraction_llm | LLM used to perform extraction and infer classifications. Accepts the model id of an Amazon Bedrock model, an Amazon Bedrock inference profile, a JSON string representation of a LlamaIndex BedrockConverse instance, or an instance of a LlamaIndex LLM object (see the LLM configuration section for more details). If None, the GraphRAG.extraction_llm configuration parameter is used. | None |
The BuildConfig object has the following parameters:
| Parameter | Description | Default Value |
|---|---|---|
build_filters | A BuildFilters object to include or exclude specific node types during the build stage | BuildFilters() |
include_domain_labels | Whether to add a domain-specific label (e.g. Company) to entity nodes in addition to __Entity__ | None (falls back to GraphRAGConfig.include_domain_labels) |
include_local_entities | Whether to include local-context entities in the graph | None (falls back to GraphRAGConfig.include_local_entities) |
source_metadata_formatter | A SourceMetadataFormatter instance for customising source metadata written to the graph | DefaultSourceMetadataFormatter() |
enable_versioning | Whether to enable versioned updates. Overrides GraphRAGConfig.enable_versioning when set. | None |
The InferClassificationsConfig object has the following parameters:
| Parameter | Description | Default Value |
|---|---|---|
num_iterations | Number of times to run the pre-processing over the source documents | 1 |
num_samples | Number of chunks (selected at random) from which classifications are extracted per iteration | 5 |
prompt_template | Prompt used to extract classifications from sampled chunks. If None, the default domain entity classifications template is used. See Custom prompts below. | None |
Custom prompts
Section titled “Custom prompts”The extract stage uses up to three LLM prompts:
- Domain entity classifications: Extracts significant domain entity classifications from a sample of source documents prior to processing the documents. These classificatiosn are then supplied to the extract topics prompt as the list of preferred entity classifications.
- Extract propositions: Extracts a set of standalone, well-formed propositions from a chunk.
- Extract topics: Extracts topics, statements and entities and their relations from either a set of propositions, or from the raw chunk text.
Using the ExtractionConfig and InferClassificationsConfig you can customize one or more of these prompts.
Domain entity classifications:
The prompt template should included a {text_chunks} placeholder, into which the sampled chunks will be inserted.
The template should return classifications in the following format:
<entity_classifications>Classification1Classification2Classification3</entity_classifications>Extract propositions:
The prompt template should include a {text} placeholder, into which the chunk text will be inserted.
The template should return propositions in the following format:
propositionpropositionpropositionExtract topics:
The prompt template should include a {text} placeholder, into which a set of propositions (or the raw chunk text) will be inserted, a {preferred_topics} placeholder, into which a list of topics will be inserted, and a {preferred_entity_classifications} placeholder, into which a liist of entity classifications will be inserted.
The template should return extracted topics, statements, entities and relations in the following format:
topic: topic
entities: entity|classification entity|classification
proposition: [exact proposition text] entity-attribute relationships: entity|RELATIONSHIP|attribute entity|RELATIONSHIP|attribute
entity-entity relationships: entity|RELATIONSHIP|entity entity|RELATIONSHIP|entity
proposition: [exact proposition text] entity-attribute relationships: entity|RELATIONSHIP|attribute entity|RELATIONSHIP|attribute
entity-entity relationships: entity|RELATIONSHIP|entity entity|RELATIONSHIP|entityBatch extraction
Section titled “Batch extraction”You can use Amazon Bedrock batch inference with the extract stage of the indexing process. See Batch Extraction for more details.
BatchConfig (indexing/extract/batch_config.py) accepts the following parameters:
| Parameter | Description | Required |
|---|---|---|
role_arn | ARN of the IAM role Bedrock will assume to run batch jobs | Yes |
region | AWS region where batch jobs will run | Yes |
bucket_name | S3 bucket for batch job input/output | Yes |
key_prefix | S3 key prefix for job files | No |
s3_encryption_key_id | KMS key ID for S3 object encryption | No |
subnet_ids | VPC subnet IDs for the batch job network configuration | No |
security_group_ids | VPC security group IDs | No |
max_batch_size | Maximum records per batch job (Bedrock limit: 50,000; jobs under 100 records are skipped and processed inline) | 25000 |
max_num_concurrent_batches | Maximum concurrent batch jobs per worker | 3 |
delete_on_success | Whether to delete S3 job files after a successful run | True |
Metadata filtering
Section titled “Metadata filtering”You can add metadata to source documents on ingest, and then use this metadata to filter documents during the extract and build stages. Source metadata is also used for metadata filtering when querying a lexical graph. See the Metadata Filtering section for more details.
Versioned updates
Section titled “Versioned updates”The lexical graphs supports versioned updates. With versioned updates, if you re-ingest a document whose contents and/or metadata have changed since it was last extracted, any old documents will be archived, and the newly ingested document treated as the current version of the source document.
Checkpoints
Section titled “Checkpoints”The lexical-graph retries upsert operations and calls to LLMs and embedding models that don’t succeed. However, failures can still happen. If an extract or build stage fails partway through, you typically don’t want to reprocess chunks that have successfully made their way through the entire graph construction pipeline.
To avoid having to reprocess chunks that have been successfully processed in a previous run, provide a Checkpoint instance to the extract_and_build(), extract() and/or build() methods. A checkpoint adds a checkpoint filter to steps in the extract and build stages, and a checkpoint writer to the end of the build stage. When a chunk is emitted from the build stage, after having been successfully handled by both the graph construction and vector indexing handlers, its id will be written to a save point in the graph index extraction_dir. If a chunk with the same id is subsequently introduced into either the extract or build stage, it will be filtered out by the checkpoint filter.
The following example passes a checkpoint to the extract_and_build() method:
from graphrag_toolkit.lexical_graph.indexing.build import Checkpoint
checkpoint = Checkpoint('my-checkpoint')
...
graph_index.extract_and_build(docs, checkpoint=checkpoint)When you create a Checkpoint, you must give it a name. A checkpoint filter will only filter out chunks that were checkpointed by a checkpoint writer with the same name. If you use checkpoints when running separate extract and build processes, ensure the checkpoints have different names. If you use the same name across separate extract and build processes, the build stage will ignore all the chunks created by the extract stage.
Checkpoints do not provide any transactional guarantees. If a chunk is successfully processed by the graph construction handlers, but then fails in a vector indexing handler, it will not make it to the end of the build pipeline, and so will not be checkpointed. If the build stage is restarted, the chunk will be reprocessed by both the graph construction and vector indexing handlers. For stores that support upserts (e.g. Amazon Neptune Database and Amazon Neptune Analytics) this is not an issue.
The lexical-graph does not clean up checkpoints. If you use checkpoints, periodically clean the checkpoint directory of old checkpoint files.