Batch Extraction

Topics

Overview
Using batch inference with the LexicalGraphIndex
Setup
Batch extraction job requirements

Overview

You can use Amazon Bedrock batch inference in the extract stage of the indexing process to improve extraction performance for large datasets.

See Configuring Batch Extraction for details on configuring batch extraction for large ingests.

Using batch inference with the LexicalGraphIndex

To use batch inference in the extract stage of the indexing process, create a BatchConfig object and supply it to the LexicalGraphIndex as part of the IndexingConfig:

import os

from graphrag_toolkit.lexical_graph import LexicalGraphIndex
from graphrag_toolkit.lexical_graph import GraphRAGConfig, IndexingConfig
from graphrag_toolkit.lexical_graph.storage import GraphStoreFactory
from graphrag_toolkit.lexical_graph.storage import VectorStoreFactory
from graphrag_toolkit.lexical_graph.indexing.extract import BatchConfig

from llama_index.core import SimpleDirectoryReader

def batch_extract_and_load():

    GraphRAGConfig.extraction_batch_size = 1000

    batch_config = BatchConfig(
        region='us-west-2',
        bucket_name='my-bucket',
        key_prefix='batch-extract',
        role_arn='arn:aws:iam::111111111111:role/my-batch-inference-role',
        max_batch_size=40000
    )

    indexing_config = IndexingConfig(batch_config=batch_config)

    with (
        GraphStoreFactory.for_graph_store(os.environ['GRAPH_STORE']) as graph_store,
        VectorStoreFactory.for_vector_store(
            os.environ['VECTOR_STORE'],
            index_names=['chunk']
        ) as vector_store
    ):

        graph_index = LexicalGraphIndex(
            graph_store,
            vector_store,
            indexing_config=indexing_config
        )

        reader = SimpleDirectoryReader(input_dir='path/to/directory')
        docs = reader.load_data()

        graph_index.extract_and_build(docs, show_progress=True)

batch_extract_and_load()

When using batch extraction, update the GraphRAGConfig.extraction_batch_size configuration parameter so that a large number of source documents are passed to a batch inference job in a single batch. In the example above, GraphRAGConfig.extraction_batch_size has been set to 1000, meaning that 1000 source documents will be chunked simultaneously, and these chunks then sent to the batch inference job. If there are 10-50 chunks per document, the batch inference job here will process several thousand records in a single batch, up to a maximum of 40,000 records (the configured max_batch_size value).

Setup

Before running batch extraction for the first time, you must fulfill the following prerequisites:

Create an Amazon S3 bucket in the AWS Region where you will be running batch extraction
Create a custom service role for batch inference with access to the S3 bucket (and permission to invoke an inference profile, if necessary)
Update the IAM identity under which the indexing process runs to allow it to to submit and manage batch inference jobs and pass the custom serice role to Bedrock

In the examples below, replace <account-id> with your AWS account ID, <region> with the name of the AWS Region where you will be running batch extraction, <model-id> with the ID of the foundation model in Amazon Bedrock that you want to use for batch extraction, and <custom-service-role-arn> with the ARN of your new custom service role.

Custom service role

Create a custom service role for batch inference with the following trust relationship:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "<account-id>"
                },
                "ArnEquals": {
                    "aws:SourceArn": "arn:aws:bedrock:<region>:<account-id>:model-invocation-job/*"
                }
            }
        }
    ]
}

Create and attach a policy to your custom service role that allows access to the Amazon S3 bucket where batch inference input and output files will be stored:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket>",
                "arn:aws:s3:::<bucket>/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": [
                        "<account-id>"
                    ]
                }
             }
        }
    ]
}

To run batch inference with an inference profile, the service role must have permissions to invoke the inference profile in an AWS Region, in addition to the model in each Region in the inference profile.

Update IAM identity

You will also need to update the IAM identity under which the indexing process runs (not the custom service role) to allow it to to submit and manage batch inference jobs:

{
    "Version": "2012-10-17",
    "Statement": [
        ...

        {
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateModelInvocationJob",
                "bedrock:GetModelInvocationJob",
                "bedrock:ListModelInvocationJobs",
                "bedrock:StopModelInvocationJob"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>::foundation-model/<model-id>",
                "arn:aws:bedrock:<region>:<account-id>:model-invocation-job/*"
            ]
        }
    ]
}

Add the iam:PassRole permission so that the IAM identity under which the indexing process runs can pass the custom service role to Bedrock:

{
    "Effect": "Allow",
    "Action": [
        "iam:PassRole"
    ],
    "Resource": "<custom-service-role-arn>"
}

Batch extraction job requirements

Each batch extraction job must follow Amazon Bedrock’s batch inference quotas. The lexical-graph’s batch extraction feature uses one input file per job.

Key requirements

Each batch job needs 100-50,000 records
Jobs with fewer than 100 records are processed individually, not in batch
The feature doesn’t check input file sizes — jobs will fail if they exceed Bedrock quotas

Worker configuration

Batch extraction can use multiple workers that trigger concurrent batch jobs:

If (workers × concurrent batches) exceeds Bedrock quotas, jobs will wait until capacity is available