Configuring Batch Extraction
Topics
Section titled “Topics”Overview
Section titled “Overview”BatchConfig parameters
Section titled “BatchConfig parameters”The BatchConfig object manages the configuration settings for Amazon Bedrock batch inference jobs. Here’s a detailed explanation of each parameter:
Required parameters
Section titled “Required parameters”bucket_name
Section titled “bucket_name”You must specify the name of an Amazon S3 bucket where your batch processing files (both input and output) will be stored.
region
Section titled “region”You need to provide the AWS Region name (such as “us-east-1”) where both your S3 bucket is located and where the Amazon Bedrock batch inference job will run.
role_arn
Section titled “role_arn”This is the Amazon Resource Name (ARN) for the service role that handles batch inference operations. You can either create a default service role through the console or follow the instructions in the Create a service role for batch inference documentation.
Optional parameters
Section titled “Optional parameters”key_prefix
Section titled “key_prefix”If desired, you can specify an S3 key prefix for organizing your input and output files.
max_batch_size
Section titled “max_batch_size”Controls how many records (chunks) can be included in each batch inference job. The default value is 25000 records.
max_num_concurrent_batches
Section titled “max_num_concurrent_batches”Determines how many batch inference jobs can run simultaneously per worker. This setting works in conjunction with GraphRAGConfig.extraction_num_workers. The default is 3 concurrent batches per worker.
s3_encryption_key_id
Section titled “s3_encryption_key_id”You can provide the unique identifier for an encryption key to secure the output data in S3.
VPC security parameters (optional)
Section titled “VPC security parameters (optional)”For more information about VPC protection, see Protect batch inference jobs using a VPC.
subnet_ids
Section titled “subnet_ids”An array of subnet IDs within your Virtual Private Cloud (VPC) for protecting batch inference jobs.
security_group_ids
Section titled “security_group_ids”An array of security group IDs within your VPC for protecting batch inference jobs.
File management
Section titled “File management”delete_on_success
Section titled “delete_on_success”Controls whether input and output JSON files are automatically deleted from the local filesystem after successful batch job completion. By default, this is set to True. Note that this setting does not affect files stored in S3, which are preserved regardless.
Optimizing batch extraction performance
Section titled “Optimizing batch extraction performance”The most important settings for controlling batch extraction performance are:
GraphRAGConfig.extraction_batch_size: Sets how many source documents go to the extraction pipeline. When calculating this value, consider that the total number of chunks (source documents × average chunks per document) should be sufficient to fill your planned simultaneous batch jobs.GraphRAGConfig.extraction_num_workers: Sets how many CPUs run batch jobs simultaneously.BatchConfig.max_num_concurrent_batches: Sets how many concurrent batch jobs each worker runs.BatchConfig.max_batch_size: Sets the maximum number of chunks per batch job.
To maximize the efficiency of batch extraction, follow these three key principles:
- Maximize file capacity Each batch job file can hold up to 50,000 records. However, Amazon Bedrock enforces input file size limits, typically between 1-5 GB. Check the specific limits for your model in the Amazon Bedrock service quotas section (see the Batch inference job size quotas in the Amazon Bedrock service quotas section for the limits particular to the model you are using). Note that the toolkit doesn’t automatically verify file sizes, so jobs may fail if they exceed these quotas. You may need to use fewer records than the maximum limit to stay within file size boundaries. Configure the
BatchConfig.max_batch_sizeto set the maximum number of records per batch job. - Use larger, fewer files Focus on using a minimal number of large files rather than splitting the work across many smaller ones. For example, it’s more efficient to process 40,000 records in a single job than to divide them into four parallel jobs of 10,000 records each.
- Leverage parallel processing Take advantage of parallel job execution using
GraphRAGConfig.extraction_num_workersandBatchConfig.max_num_concurrent_batches. The total number of jobs (number of workers × number of concurrent batches) must stay within Bedrock’s quota of 20 combined in-progress and submitted batch inference jobs per region. If you exceed this limit, additional jobs will wait in the queue until capacity becomes available.