Skip to content

Indexing

There are two stages to indexing: extract, and build. The lexical-graph uses separate pipelines for each of these stages, plus micro-batching, to provide a continous ingest capability. This means that your graph will start being populated soon after extraction begins.

You can run the extract and build pipelines together, to provide for the continuous ingest described above. Or you can run the two pipelines separately, extracting first to file-based chunks, and then later building a graph from these chunks.

The LexicalGraphIndex allows you to run the extract and build pipelines together or separately. See the Using the LexicalGraphIndex to construct a graph section below.

Indexing supports multi-tenancy, whereby you can store separate lexical graphs in the same backend graph and vector stores.

The extraction stage is, by default, a three-step process:

  1. The source documents are broken down into chunks.
  2. For each chunk, an LLM extracts a set of propositions from the unstructured content. This proposition extraction helps ‘clean’ the content and improve the subsequent entity/topic/statement/fact extraction by breaking complex sentences into simpler sentences, replacing pronouns with specific names, and replacing acronyms where possible. These propositions are added to the chunk’s metadata under the aws::graph::propositions key.
  3. Following the proposition extraction, a second LLM call extracts entities, relations, topics, statements and facts from the set of extracted propositions. These details are added to the chunk’s metadata under the aws::graph::topics key.

Only the third step here is mandatory. If your source data has already been chunked, you can omit step 1. If you’re willing to trade a reduction in LLM calls and improved performance for a reduction in the quality of the entity/topic/statement/fact extraction, you can omit step 2.

Extraction uses a lightly guided strategy whereby the extraction process is seeded with a list of preferred entity classifications. The LLM is instructed to use an existing classification from the list before creating new ones. Any new classifications introduced by the LLM are then carried forward to subsequent invocations. This approach reduces but doesn’t eliminate unwanted variations in entity classification.

The list of DEFAULT_ENTITY_CLASSIFICATIONS used to seed the extraction process can be found here. If these classifications are not appropriate to your workload you can replace them (see the Configuring the extract and build stages section below).

Relationship values are currently unguided (though relatively concise).

In the build stage, the LlamaIndex chunk nodes emitted from the extract stage are broken down further into a stream of individual source, chunk, topic, statement and fact LlamaIndex nodes. Graph construction and vector indexing handlers process these nodes to build and index the graph content. Each of these nodes has an aws::graph::index metadata item containing data that can be used to index the node in a vector store (though only the chunk and statement nodes are actually indexed in the current implementation).

Using the LexicalGraphIndex to construct a graph

Section titled “Using the LexicalGraphIndex to construct a graph”

The LexicalGraphIndex provides a convenient means of constructing a graph – via either continuous ingest, or separate extract and build stages. When constructing a LexicalGraphIndex you must supply a graph store and a vector store (see Storage Model for more details). In the examples below, the graph store and vector store connection strings are fetched from environment variables.

The LexicalGraphIndex constructor has an extraction_dir named argument. This is the path to a local directory to which intermediate artefacts (such as checkpoints) will be written. By default, the value of extraction_dir is set to the value of GraphRAGConfig.local_output_dir, which defaults to 'output'. For containerized deployments (EKS/Kubernetes), you can configure this via the LOCAL_OUTPUT_DIR environment variable or by setting GraphRAGConfig.local_output_dir programmatically. See Configuration for more details.

Use LexicalGraphIndex.extract_and_build() to extract and build a graph in a manner that supports continous ingest.

The extraction stage consumes LlamaIndex nodes – either documents, which will be chunked during extraction, or pre-chunked text nodes. Use a LlamaIndex reader to load source documents. The example below uses a LlamaIndex SimpleWebReader to load several HTML pages.

continuous_ingest.py
import os
from graphrag_toolkit.lexical_graph import LexicalGraphIndex
from graphrag_toolkit.lexical_graph.storage import GraphStoreFactory
from graphrag_toolkit.lexical_graph.storage import VectorStoreFactory
from llama_index.readers.web import SimpleWebPageReader
doc_urls = [
'https://docs.aws.amazon.com/neptune/latest/userguide/intro.html',
'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/what-is-neptune-analytics.html',
'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-features.html',
'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-vs-neptune-database.html'
]
docs = SimpleWebPageReader(
html_to_text=True,
metadata_fn=lambda url:{'url': url}
).load_data(doc_urls)
with (
GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE']) as graph_store,
VectorStoreFactory.for_vector_store(os.environ['VECTOR_STORE']) as vector_store
):
graph_index = LexicalGraphIndex(
graph_store,
vector_store
)
graph_index.extract_and_build(docs)

The diff below shows what changes when you split the pipelines:

graph_index.extract_and_build(docs)
extracted_docs = S3BasedDocs(
region=os.environ['AWS_REGION'],
bucket_name=os.environ['EXTRACTION_BUCKET'],
key_prefix='extracted',
)
graph_index.extract(docs, handler=extracted_docs, show_progress=True)
graph_index.build(extracted_docs, show_progress=True)

Run the extract and build stages separately

Section titled “Run the extract and build stages separately”

Using the LexicalGraphIndex you can perform the extract and build stages separately. This is useful if you want to extract the graph once, and then build it multiple times (in different environments, for example.)

When you run the extract and build stages separately, you can persist the extracted documents to Amazon S3 or to the filesystem at the end of the extract stage, and then consume these same documents in the build stage. Use the graphrag-toolkit’s S3BasedDocss and FileBasedDocs classes to persist and then retrieve JSON-serialized LlamaIndex nodes.

The following example shows how to use a S3BasedDocs handler to persist extracted documents to an Amazon S3 bucket at the end of the extract stage:

import os
from graphrag_toolkit.lexical_graph import LexicalGraphIndex
from graphrag_toolkit.lexical_graph.storage import GraphStoreFactory
from graphrag_toolkit.lexical_graph.storage import VectorStoreFactory
from graphrag_toolkit.lexical_graph.indexing.load import S3BasedDocs
from llama_index.readers.web import SimpleWebPageReader
extracted_docs = S3BasedDocs(
region='us-east-1',
bucket_name='my-bucket',
key_prefix='extracted',
collection_id='12345'
)
with (
GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE']) as graph_store,
VectorStoreFactory.for_vector_store(os.environ['VECTOR_STORE']) as vector_store
):
graph_index = LexicalGraphIndex(
graph_store,
vector_store
)
doc_urls = [
'https://docs.aws.amazon.com/neptune/latest/userguide/intro.html',
'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/what-is-neptune-analytics.html',
'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-features.html',
'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-vs-neptune-database.html'
]
docs = SimpleWebPageReader(
html_to_text=True,
metadata_fn=lambda url:{'url': url}
).load_data(doc_urls)
graph_index.extract(docs, handler=extracted_docs)

Following the extract stage, you can then build the graph from the previously extracted documents. Whereas in the extract stage the S3BasedDocs object acted as a handler to persist extracted documents, in the build stage the S3BasedDocs object acts as a source of LlamaIndex nodes, and is thus passed as the first argument to the build() method:

import os
from graphrag_toolkit.lexical_graph import LexicalGraphIndex
from graphrag_toolkit.lexical_graph.storage import GraphStoreFactory
from graphrag_toolkit.lexical_graph.storage import VectorStoreFactory
from graphrag_toolkit.lexical_graph.indexing.load import S3BasedDocs
docs = S3BasedDocs(
region='us-east-1',
bucket_name='my-bucket',
key_prefix='extracted',
collection_id='12345'
)
with (
GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE']) as graph_store,
VectorStoreFactory.for_vector_store(os.environ['VECTOR_STORE']) as vector_store
):
graph_index = LexicalGraphIndex(
graph_store,
vector_store
)
graph_index.build(docs)

The S3BasedDocs object has the following parameters:

ParameterDescriptionMandatory
regionAWS Region in which the S3 bucket is located (e.g. us-east-1)Yes
bucket_nameAmazon S3 bucket nameYes
key_prefixS3 key prefixYes
collection_idId for a particular collection of extracted documents. Optional: if no collection_id is supplied, the lexical-graph will create a timestamp value. Extracted documents will be written to s3://<bucket>/<key_prefix>/<collection_id>/.No
s3_encryption_key_idKMS key id (Key ID, Key ARN, or Key Alias) to use for object encryption. Optional: if no s3_encryption_key_id is supplied, the lexical-graph will encrypt objects in S3 using Amazon S3 managed keys.No

If you use Amazon Web Services KMS keys to encrypt objects in S3, the identity under which the lexical-graph runs should include the following IAM policy. Replace <kms-key-arn> with the ARN of the KMS key you want to use to encrypt objects:

{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"kms:GenerateDataKey",
"kms:Decrypt"
],
"Resource": [
"<kms-key-arn>"
],
"Effect": "Allow"
}
]
}

If you want to persist extracted documents to the local filesystem instead of an S3 bucket, use a FileBasedDocs object instead:

from graphrag_toolkit.lexical_graph.indexing.load import FileBasedDocs
chunks = FileBasedDocs(
docs_directory='./extracted/',
collection_id='12345'
)

The FileBasedChunks object has the following parameters:

ParameterDescriptionMandatory
docs_directoryRoot directory for the extracted documentsYes
collection_idId for a particular collection of extracted documents. Optional: if no collection_id is supplied, the lexical-graph will create a timestamp value. Extracted documents will be written to /<docs_directory>/<collection_id>/.No

You can configure the number of workers and batch sizes for the extract and build stages of the LexicalGraphIndex using the GraphRAGConfig object. See Configuration for more details on using the configuration object.

Besides configuring the workers and batch sizes, you can also configure the indexing process with regard to chunking, proposition extraction and entity classification, and graph and vector store contents by passing an instance of IndexingConfig to the LexicalGraphIndex constructor:

from graphrag_toolkit.lexical_graph import LexicalGraphIndex, IndexingConfig, ExtractionConfig
...
graph_index = LexicalGraphIndex(
graph_store,
vector_store,
indexing_config = IndexingConfig(
chunking=None,
extraction=ExtractionConfig(
enable_proposition_extraction=False
)
)
)

The IndexingConfig object has the following parameters:

ParameterDescriptionDefault Value
chunkingA list of node parsers (e.g. LlamaIndex SentenceSplitter) to be used for chunking source documents. Set chunking to None to skip chunking.SentenceSplitter with chunk_size=256 and chunk_overlap=25
extractionAn ExtractionConfig object specifying extraction optionsExtractionConfig with default values
buildA BuildConfig object specifying build optionsBuildConfig with default values
batch_configBatch configuration to be used if performing batch extraction. If batch_config is None, the toolkit will perform chunk-by-chunk extraction.None

The ExtractionConfig object has the following parameters:

ParameterDescriptionDefault Value
enable_proposition_extractionPerform proposition extraction before extracting topics, statements, facts and entitiesTrue
preferred_entity_classificationsComma-separated list of preferred entity classifications used to seed the entity extractionDEFAULT_ENTITY_CLASSIFICATIONS
preferred_topicsList of preferred topic names (or a callable that returns them) supplied to the LLM to seed topic extraction. Accepts the same type as preferred_entity_classifications.[]
infer_entity_classificationsDetermines whether to pre-process documents to identify significant domain entity classifications. Supply either True or False, or an InferClassificationsConfig object. When True, an InferClassifications step runs as a pre-processor before the main extraction loop — one extra LLM round-trip per batch, not per document.False
extract_propositions_prompt_templatePrompt used to extract propositions from chunks. If None, the default extract propositions template is used. See Custom prompts below.None
extract_topics_prompt_templatePrompt used to extract topics, statements and entities from chunks. If None, the default extract topics template is used. See Custom prompts below.None
extraction_llmLLM used to perform extraction and infer classifications. Accepts the model id of an Amazon Bedrock model, an Amazon Bedrock inference profile, a JSON string representation of a LlamaIndex BedrockConverse instance, or an instance of a LlamaIndex LLM object (see the LLM configuration section for more details). If None, the GraphRAG.extraction_llm configuration parameter is used.None

The BuildConfig object has the following parameters:

ParameterDescriptionDefault Value
build_filtersA BuildFilters object to include or exclude specific node types during the build stageBuildFilters()
include_domain_labelsWhether to add a domain-specific label (e.g. Company) to entity nodes in addition to __Entity__None (falls back to GraphRAGConfig.include_domain_labels)
include_local_entitiesWhether to include local-context entities in the graphNone (falls back to GraphRAGConfig.include_local_entities)
source_metadata_formatterA SourceMetadataFormatter instance for customising source metadata written to the graphDefaultSourceMetadataFormatter()
enable_versioningWhether to enable versioned updates. Overrides GraphRAGConfig.enable_versioning when set.None

The InferClassificationsConfig object has the following parameters:

ParameterDescriptionDefault Value
num_iterationsNumber of times to run the pre-processing over the source documents1
num_samplesNumber of chunks (selected at random) from which classifications are extracted per iteration5
prompt_templatePrompt used to extract classifications from sampled chunks. If None, the default domain entity classifications template is used. See Custom prompts below.None

The extract stage uses up to three LLM prompts:

  • Domain entity classifications: Extracts significant domain entity classifications from a sample of source documents prior to processing the documents. These classificatiosn are then supplied to the extract topics prompt as the list of preferred entity classifications.
  • Extract propositions: Extracts a set of standalone, well-formed propositions from a chunk.
  • Extract topics: Extracts topics, statements and entities and their relations from either a set of propositions, or from the raw chunk text.

Using the ExtractionConfig and InferClassificationsConfig you can customize one or more of these prompts.

Domain entity classifications:

The prompt template should included a {text_chunks} placeholder, into which the sampled chunks will be inserted.

The template should return classifications in the following format:

<entity_classifications>
Classification1
Classification2
Classification3
</entity_classifications>

Extract propositions:

The prompt template should include a {text} placeholder, into which the chunk text will be inserted.

The template should return propositions in the following format:

proposition
proposition
proposition

Extract topics:

The prompt template should include a {text} placeholder, into which a set of propositions (or the raw chunk text) will be inserted, a {preferred_topics} placeholder, into which a list of topics will be inserted, and a {preferred_entity_classifications} placeholder, into which a liist of entity classifications will be inserted.

The template should return extracted topics, statements, entities and relations in the following format:

topic: topic
entities:
entity|classification
entity|classification
proposition: [exact proposition text]
entity-attribute relationships:
entity|RELATIONSHIP|attribute
entity|RELATIONSHIP|attribute
entity-entity relationships:
entity|RELATIONSHIP|entity
entity|RELATIONSHIP|entity
proposition: [exact proposition text]
entity-attribute relationships:
entity|RELATIONSHIP|attribute
entity|RELATIONSHIP|attribute
entity-entity relationships:
entity|RELATIONSHIP|entity
entity|RELATIONSHIP|entity

You can use Amazon Bedrock batch inference with the extract stage of the indexing process. See Batch Extraction for more details.

BatchConfig (indexing/extract/batch_config.py) accepts the following parameters:

ParameterDescriptionRequired
role_arnARN of the IAM role Bedrock will assume to run batch jobsYes
regionAWS region where batch jobs will runYes
bucket_nameS3 bucket for batch job input/outputYes
key_prefixS3 key prefix for job filesNo
s3_encryption_key_idKMS key ID for S3 object encryptionNo
subnet_idsVPC subnet IDs for the batch job network configurationNo
security_group_idsVPC security group IDsNo
max_batch_sizeMaximum records per batch job (Bedrock limit: 50,000; jobs under 100 records are skipped and processed inline)25000
max_num_concurrent_batchesMaximum concurrent batch jobs per worker3
delete_on_successWhether to delete S3 job files after a successful runTrue

You can add metadata to source documents on ingest, and then use this metadata to filter documents during the extract and build stages. Source metadata is also used for metadata filtering when querying a lexical graph. See the Metadata Filtering section for more details.

The lexical graphs supports versioned updates. With versioned updates, if you re-ingest a document whose contents and/or metadata have changed since it was last extracted, any old documents will be archived, and the newly ingested document treated as the current version of the source document.

The lexical-graph retries upsert operations and calls to LLMs and embedding models that don’t succeed. However, failures can still happen. If an extract or build stage fails partway through, you typically don’t want to reprocess chunks that have successfully made their way through the entire graph construction pipeline.

To avoid having to reprocess chunks that have been successfully processed in a previous run, provide a Checkpoint instance to the extract_and_build(), extract() and/or build() methods. A checkpoint adds a checkpoint filter to steps in the extract and build stages, and a checkpoint writer to the end of the build stage. When a chunk is emitted from the build stage, after having been successfully handled by both the graph construction and vector indexing handlers, its id will be written to a save point in the graph index extraction_dir. If a chunk with the same id is subsequently introduced into either the extract or build stage, it will be filtered out by the checkpoint filter.

The following example passes a checkpoint to the extract_and_build() method:

from graphrag_toolkit.lexical_graph.indexing.build import Checkpoint
checkpoint = Checkpoint('my-checkpoint')
...
graph_index.extract_and_build(docs, checkpoint=checkpoint)

When you create a Checkpoint, you must give it a name. A checkpoint filter will only filter out chunks that were checkpointed by a checkpoint writer with the same name. If you use checkpoints when running separate extract and build processes, ensure the checkpoints have different names. If you use the same name across separate extract and build processes, the build stage will ignore all the chunks created by the extract stage.

Checkpoints do not provide any transactional guarantees. If a chunk is successfully processed by the graph construction handlers, but then fails in a vector indexing handler, it will not make it to the end of the build pipeline, and so will not be checkpointed. If the build stage is restarted, the chunk will be reprocessed by both the graph construction and vector indexing handlers. For stores that support upserts (e.g. Amazon Neptune Database and Amazon Neptune Analytics) this is not an issue.

The lexical-graph does not clean up checkpoints. If you use checkpoints, periodically clean the checkpoint directory of old checkpoint files.