Metadata Filtering
Topics
Section titled “Topics”- Overview
- Adding metadata when indexing
- Using metadata to filter queries
- Dates and datetimes
- Using metadata to filter documents in the extract and build stages
- Metadata and document identity
- Metadata filtering and multi-tenancy
Overview
Section titled “Overview”Metadata filtering allows you to retrieve a constrained set of sources, topics and statements based on metadata filters and associated values when querying a lexical graph.
Metadata is any data added to the metadata dictionary of a source document. Depending on the source document, examples of metadata may include title, url, filepath, date published, and author. A source document’s metadata is then associated with any chunks, topics and statements extracted from that document.
There are two parts to metadata filtering:
- Indexing Add metadata to source documents passed to the indexing process
- Querying Supply metadata filters when querying a lexical graph
You can also use metadata filtering to filter documents and chunks during the extract and build stages of the indexing process.
Adding metadata when indexing
Section titled “Adding metadata when indexing”The effectiveness of metadata filtering during querying is dependent on the quality of the metadata attached to source documents during ingestion. Different loaders have different mechanisms for adding metadata to ingested documents. Here are some examples.
Metadata and versioned updates
Section titled “Metadata and versioned updates”The lexical graphs supports versioned updates. With versioned updates, if you re-ingest a document whose contents and/or metadata have changed since it was last extracted, any old documents will be archived, and the newly ingested document treated as the current version of the source document.
Versioned updates uses a concept of version-independent metadata fields to represent a documents’ stable (i.e. version-independent) identify. When you index a document, you can specify which of that document’s metadata fields represent its stable identify. For example, if a document has title, author and last_updated metadata fields, you might specify that a combination of the title and author metadata fields represent that document’s stable identify. When the document is indexed, any previously indexed, non-versioned documents whose title and author field values match those of the newly ingested document will be archived.
When choosing which metadata to add to each source document that you ingest, bear in mind this use of metadata for versioning updates. Try to ensure that one of the fields, or a combination of multiple field values, constitute a stable identity.
Adding metadata to web pages
Section titled “Adding metadata to web pages”The LlamaIndex SimpleWebPageReader accepts a function that takes a url and returns a metadata dictionary. The following example populates the metadata dictionary with the url and the date on which the page was accessed.
from datetime import datefrom llama_index.readers.web import SimpleWebPageReader
doc_urls = [ 'https://docs.aws.amazon.com/neptune/latest/userguide/intro.html', 'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/what-is-neptune-analytics.html', 'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-features.html', 'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-vs-neptune-database.html']
def web_page_metadata(url): return { 'url': url, 'last_accessed_date': date.today() }
docs = SimpleWebPageReader( html_to_text=True, metadata_fn=web_page_metadata).load_data(doc_urls)Adding metadata to JSON documents
Section titled “Adding metadata to JSON documents”The JSONArrayReader allows you to split a JSON array document into separate documents, one per element in the array, and extract metadata from each sub-document. The following example splits a JSON source document containing news articles into separate documents, one per article. The get_text() and get_metadata() functions extract each article’s body text and associated metadata.
from graphrag_toolkit.lexical_graph.indexing.load import JSONArrayReader
def get_text(data): return data.get('body', '')
def get_metadata(data): return { field : data[field] for field in ['title', 'author', 'source', 'published_date'] if field in data }
docs = JSONArrayReader( text_fn=get_text, metadata_fn=get_metadata).load_data('./articles.json')Adding metadata to PDF documents
Section titled “Adding metadata to PDF documents”The following example shows one way of loading PDF documents and attaching metadata to each document.
from pathlib import Pathfrom pypdf import PdfReaderfrom llama_index.core.schema import Document
def get_pdf_docs(pdf_dir):
pdf_dir_path = Path(pdf_dir)
file_paths = [ file_path for file_path in pdf_dir_path.iterdir() if file_path.is_file() ]
for pdf_path in file_paths: reader = PdfReader(pdf_path) for page_num, page_content in enumerate(reader.pages): doc = Document( text=page_content.extract_text(), metadata={ 'filename': pdf_path.name, 'page_num': page_num } ) yield doc
docs = get_pdf_docs('./pdfs')Restrictions
Section titled “Restrictions”Metadata field values may comprise string, int, float, date and datetime single values. Lists, arrays, sets and nested dictionaries are not supported.
Using metadata to filter queries
Section titled “Using metadata to filter queries”The lexical graph uses the LlamaIndex vector store types MetadataFilters, MetadataFilter, FilterOperator, and FilterCondition to specify filter criteria. You supply these to a query engine in a FilterConfig object. The following example configures a traversal-based retriever to filter the lexical graph based on the url of source documents:
from graphrag_toolkit.lexical_graph import LexicalGraphQueryEnginefrom graphrag_toolkit.lexical_graph.metadata import FilterConfigfrom llama_index.core.vector_stores.types import FilterOperator, MetadataFilter
query_engine = LexicalGraphQueryEngine.for_traversal_based_search( graph_store, vector_store, filter_config = FilterConfig( MetadataFilter( key='url', value='https://docs.aws.amazon.com/neptune/latest/userguide/intro.html', operator=FilterOperator.EQ ) ))How are metadata filters applied?
Section titled “How are metadata filters applied?”Metadata filters that you supply to a query engine are applied at two points in the retrieval process:
- The filters are applied to all vector store top-k queries. The vector store is typically used to find starting points for graph traversals: filters therefore effectively constrain a retriever’s entry points into the graph.
- The filters are subsequently applied to all the results returned from the graph.
By its very nature, a graph can often connect disparate sources: traversals can hop from topics and statements belonging to one source, to topics and statements associated with an entirely different source. It’s not sufficient, therefore, to simply limit the starting points for a traversal; the retriever must also filter the results. The benefit of the dual application of a metadata filter is that it restricts the semantic similarity-based lookups that provide the start points of a query to a well-defined set of sources, but then allows the query to access structurally relevant but semantically dissimilar parts of the lexical graph, some of which may be allowed by the filter, some disallowed, before finally constraining the results to only those elements that pass the filter criteria.
Complex and nested filter expressions
Section titled “Complex and nested filter expressions”The constructor of the FilterConfig object accepts either a MetadataFilters object, a single MetadataFilter or a list of MetadataFilter objects.
A MetadataFilters object can hold a collection of MetadataFilter objects as well as other, nested MetadataFilters objects. Elements in a MetadataFilters object’s filters collection are chained to form complex conditions using either a FilterCondition.AND or FilterCondition.OR condition.
MetadataFilters also supports a third condition: FilterCondition.NOT. If you use the FilterCondition.NOT condition with a MetadataFilters object, the filters collection of that object must contain a single nested MetadataFilters object.
The following example shows the use of a nested MetadataFilters object to express a complex condition: either the source must be from https://docs.aws.amazon.com/neptune/latest/userguide/intro.html, OR its publication date must fall between 2024-01-01 and 2024-12-31:
FilterConfig( MetadataFilters( filters=[ MetadataFilter( key='url', value='https://docs.aws.amazon.com/neptune/latest/userguide/intro.html', operator=FilterOperator.EQ ), MetadataFilters( filters=[ MetadataFilter( key='pub_date', value='2024-01-01', operator=FilterOperator.GT ), MetadataFilter( key='pub_date', value='2024-12-31', operator=FilterOperator.LT ) ], condition=FilterCondition.AND ) ], condition=FilterCondition.OR ))The following example shows the use of a nested MetadataFilters object with a FilterCondition.NOT condition. Even though there is only one MetadataFilter that is being negated here, it must be nested inside a MetadataFilters object.
FilterConfig( MetadataFilters( filters=[ MetadataFilters( filters=[ MetadataFilter( key='url', value='https://docs.aws.amazon.com/neptune/latest/userguide/intro.html', operator=FilterOperator.EQ ) ] ) ], condition=FilterCondition.NOT ))Supported filter operators
Section titled “Supported filter operators”The lexical graph supports the following filter operators:
| Operator | Description | Data Types |
|---|---|---|
EQ | Equals – default operator | string, int, float, date/datetime |
GT | Greater than | int, float, date/datetime |
LT | Less than | int, float, date/datetime |
NE | Not equal to | string, int, float, date/datetime |
GTE | Greater than or equal to | int, float, date/datetime |
LTE | Less than or equal to | int, float, date/datetime |
TEXT_MATCH | Full text match (allows you to search for a specific substring, token or phrase within the text field) | string |
TEXT_MATCH_INSENSITIVE | Full text match (case insensitive) | string |
IS_EMPTY | The field does not exist |
The following operators are not supported:
| Operator | Description | Data Types |
|---|---|---|
IN | In array | string or number |
NIN | Not in array | string or number |
ANY | Contains any | array of strings |
ALL | Contains all | array of strings |
CONTAINS | Metadata array contains value (string or number) |
Dates and datetimes
Section titled “Dates and datetimes”Matadata filtering supports filtering by date and datetime values. There are two ways in which you can ensure datetime filtering is applied during indexing and querying:
- Supply Python
dateordatetimeobjects in the metadata fields attached to source documents, and in the metadata filters applied when querying. - Indicate that a field is to be treated as a datetime value by suffixing the field name with
_dateor_datetime. You can then supply eitherdateordatetimeobjects, or string representations of dates and datetime values, when indexing and querying.
In the build stage, Python date and datetime metadata values are converted to ISO-formatted datetime values before being persisted to the graph and vector stores. During querying, Python date and datetime metadata values are similarly converted to ISO-formatted datetime values before being applied in a filter. date and datetime Pyton objects explictly communicate that a value should be treated as a date or datetime. With this approach, you do not need to add a _date or _datetime suffix to a metadata field name. However, you must ensure that date and/or datetime objects are used both during indexing and querying: if one or other of these stages receives a string representation of a date or datetime, filtering may not work as intended.
Metadata fields that end with _date or _datetime are converted to ISO-formatted datetime values before being persisted to the graph and vector stores. Similarly, the values of metadata filters whose keys end with _date or _datetime are converted to ISO-formatted datetime values before being evaluated.
Using metadata to filter documents in the extract and build stages
Section titled “Using metadata to filter documents in the extract and build stages”Besides using metadata filtering to constrain the retrieval process, you can also use it to filter documents during the extract and build stages of the indexing process.
Using metadata filtering in the extract stage
Section titled “Using metadata filtering in the extract stage”You can filter the documents that pass through the extract stage by supplying filter criteria to the extraction_filters of an ExtractionConfig object. extraction_filters accepts either a MetadataFilters object, a single MetadataFilter or a list of MetadataFilter objects.
The following example shows how to filter source documents so that only documents with an email metadata field containing an amazon.com email address proceeed through the extraction pipeline. All other source documents will be discarded.
from graphrag_toolkit.lexical_graph import LexicalGraphIndex, ExtractionConfigfrom llama_index.core.vector_stores.types import FilterOperator, MetadataFilter
graph_index = LexicalGraphIndex( graph_store, vector_store, indexing_config=ExtractionConfig( extraction_filters=MetadataFilter( key='email', value='amazon.com', operator=FilterOperator.TEXT_MATCH ) ))Use extraction stage metadata filtering if you only want to extract a lexical graph from a subset of documents, but can’t control which documents are submitted to the ingestion process.
Using metadata filtering in the build stage
Section titled “Using metadata filtering in the build stage”You can filter the documents that are used to build a lexical graph by supplying a BuildFilters object whose source_filters property contains filter criteria to a BuildConfig object. source_filters accepts either a MetadataFilters object, a single MetadataFilter or a list of MetadataFilter objects.
The following example shows how to filter extracted documents so that only documents whose url metadata field contains https://docs.aws.amazon.com/neptune/ will proceed through the build pipeline. All other extracted documents will be ignored. The resulting lexical graph is assigned to the neptune tenant.
from graphrag_toolkit.lexical_graph import LexicalGraphIndex, BuildConfigfrom graphrag_toolkit.lexical_graph.indexing.build import BuildFiltersfrom llama_index.core.vector_stores.types import FilterOperator, MetadataFilter
graph_index = LexicalGraphIndex( graph_store, vector_store, indexing_config=BuildConfig( build_filters=BuildFilters( source_filters=MetadataFilter( key='url', value='https://docs.aws.amazon.com/neptune/', operator=FilterOperator.TEXT_MATCH ) ) ), tenant_id='neptune')Build-stage metadata filtering works well in an extract-once, build-many-times workload. You can extract the entire corpus to an S3BasedDocs sink or FileBasedDocs sink (see Run the extract and build stages separately), and then build multiple lexical graphs from the extracted documents. Using different sets of filtering criteria and the multi-tenancy feature, you can build multiple, discrete lexical graphs with different contents from the same underlying sources.
Metadata and document identity
Section titled “Metadata and document identity”The metadata associated with a source document comprises part of that document’s identity. A source document’s id is a function of the contents of the document and the metadata. Chunk, topic and statement ids are in turn a function of the source id. If you change a source document’s metadata (adding or removing fields, or changing field values), and reprocess the document, it will be indexed into new source, chunk, topic and statement nodes in the lexical graph.
Metadata filtering and multi-tenancy
Section titled “Metadata filtering and multi-tenancy”Metadata filtering constrains retrieval to one or more subgraphs within a particular lexical graph. Multi tenancy creates wholly separate lexical graphs within the same underlying graph and vector stores. Metadata filtering and multi-tenancy work well together. As described above, you can use metadata filtering to build different tenant graphs from the same extracted corpus. You can also use metadata filtering and multi tenancy when querying. The following example applies metadata filtering to a query in the context of the neptune tenant’s lexical graph:
from graphrag_toolkit.lexical_graph import LexicalGraphQueryEnginefrom graphrag_toolkit.lexical_graph.metadata import FilterConfigfrom llama_index.core.vector_stores.types import FilterOperator, MetadataFilter
query_engine = LexicalGraphQueryEngine.for_traversal_based_search( graph_store, vector_store, filter_config = FilterConfig( MetadataFilter( key='url', value='https://docs.aws.amazon.com/neptune/latest/userguide/intro.html', operator=FilterOperator.EQ ) ), tenant_id='neptune')