Skip to content

Metadata Filtering

Metadata filtering allows you to retrieve a constrained set of sources, topics and statements based on metadata filters and associated values when querying a lexical graph.

Metadata is any data added to the metadata dictionary of a source document. Depending on the source document, examples of metadata may include title, url, filepath, date published, and author. A source document’s metadata is then associated with any chunks, topics and statements extracted from that document.

There are two parts to metadata filtering:

  • Indexing Add metadata to source documents passed to the indexing process
  • Querying Supply metadata filters when querying a lexical graph

You can also use metadata filtering to filter documents and chunks during the extract and build stages of the indexing process.

The effectiveness of metadata filtering during querying is dependent on the quality of the metadata attached to source documents during ingestion. Different loaders have different mechanisms for adding metadata to ingested documents. Here are some examples.

The lexical graphs supports versioned updates. With versioned updates, if you re-ingest a document whose contents and/or metadata have changed since it was last extracted, any old documents will be archived, and the newly ingested document treated as the current version of the source document.

Versioned updates uses a concept of version-independent metadata fields to represent a documents’ stable (i.e. version-independent) identify. When you index a document, you can specify which of that document’s metadata fields represent its stable identify. For example, if a document has title, author and last_updated metadata fields, you might specify that a combination of the title and author metadata fields represent that document’s stable identify. When the document is indexed, any previously indexed, non-versioned documents whose title and author field values match those of the newly ingested document will be archived.

When choosing which metadata to add to each source document that you ingest, bear in mind this use of metadata for versioning updates. Try to ensure that one of the fields, or a combination of multiple field values, constitute a stable identity.

The LlamaIndex SimpleWebPageReader accepts a function that takes a url and returns a metadata dictionary. The following example populates the metadata dictionary with the url and the date on which the page was accessed.

from datetime import date
from llama_index.readers.web import SimpleWebPageReader
doc_urls = [
'https://docs.aws.amazon.com/neptune/latest/userguide/intro.html',
'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/what-is-neptune-analytics.html',
'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-features.html',
'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-vs-neptune-database.html'
]
def web_page_metadata(url):
return {
'url': url,
'last_accessed_date': date.today()
}
docs = SimpleWebPageReader(
html_to_text=True,
metadata_fn=web_page_metadata
).load_data(doc_urls)

The JSONArrayReader allows you to split a JSON array document into separate documents, one per element in the array, and extract metadata from each sub-document. The following example splits a JSON source document containing news articles into separate documents, one per article. The get_text() and get_metadata() functions extract each article’s body text and associated metadata.

from graphrag_toolkit.lexical_graph.indexing.load import JSONArrayReader
def get_text(data):
return data.get('body', '')
def get_metadata(data):
return {
field : data[field]
for field in ['title', 'author', 'source', 'published_date']
if field in data
}
docs = JSONArrayReader(
text_fn=get_text,
metadata_fn=get_metadata
).load_data('./articles.json')

The following example shows one way of loading PDF documents and attaching metadata to each document.

from pathlib import Path
from pypdf import PdfReader
from llama_index.core.schema import Document
def get_pdf_docs(pdf_dir):
pdf_dir_path = Path(pdf_dir)
file_paths = [
file_path for file_path in pdf_dir_path.iterdir()
if file_path.is_file()
]
for pdf_path in file_paths:
reader = PdfReader(pdf_path)
for page_num, page_content in enumerate(reader.pages):
doc = Document(
text=page_content.extract_text(),
metadata={
'filename': pdf_path.name,
'page_num': page_num
}
)
yield doc
docs = get_pdf_docs('./pdfs')

Metadata field values may comprise string, int, float, date and datetime single values. Lists, arrays, sets and nested dictionaries are not supported.

The lexical graph uses the LlamaIndex vector store types MetadataFilters, MetadataFilter, FilterOperator, and FilterCondition to specify filter criteria. You supply these to a query engine in a FilterConfig object. The following example configures a traversal-based retriever to filter the lexical graph based on the url of source documents:

from graphrag_toolkit.lexical_graph import LexicalGraphQueryEngine
from graphrag_toolkit.lexical_graph.metadata import FilterConfig
from llama_index.core.vector_stores.types import FilterOperator, MetadataFilter
query_engine = LexicalGraphQueryEngine.for_traversal_based_search(
graph_store,
vector_store,
filter_config = FilterConfig(
MetadataFilter(
key='url',
value='https://docs.aws.amazon.com/neptune/latest/userguide/intro.html',
operator=FilterOperator.EQ
)
)
)

Metadata filters that you supply to a query engine are applied at two points in the retrieval process:

  • The filters are applied to all vector store top-k queries. The vector store is typically used to find starting points for graph traversals: filters therefore effectively constrain a retriever’s entry points into the graph.
  • The filters are subsequently applied to all the results returned from the graph.

By its very nature, a graph can often connect disparate sources: traversals can hop from topics and statements belonging to one source, to topics and statements associated with an entirely different source. It’s not sufficient, therefore, to simply limit the starting points for a traversal; the retriever must also filter the results. The benefit of the dual application of a metadata filter is that it restricts the semantic similarity-based lookups that provide the start points of a query to a well-defined set of sources, but then allows the query to access structurally relevant but semantically dissimilar parts of the lexical graph, some of which may be allowed by the filter, some disallowed, before finally constraining the results to only those elements that pass the filter criteria.

The constructor of the FilterConfig object accepts either a MetadataFilters object, a single MetadataFilter or a list of MetadataFilter objects.

A MetadataFilters object can hold a collection of MetadataFilter objects as well as other, nested MetadataFilters objects. Elements in a MetadataFilters object’s filters collection are chained to form complex conditions using either a FilterCondition.AND or FilterCondition.OR condition.

MetadataFilters also supports a third condition: FilterCondition.NOT. If you use the FilterCondition.NOT condition with a MetadataFilters object, the filters collection of that object must contain a single nested MetadataFilters object.

The following example shows the use of a nested MetadataFilters object to express a complex condition: either the source must be from https://docs.aws.amazon.com/neptune/latest/userguide/intro.html, OR its publication date must fall between 2024-01-01 and 2024-12-31:

FilterConfig(
MetadataFilters(
filters=[
MetadataFilter(
key='url',
value='https://docs.aws.amazon.com/neptune/latest/userguide/intro.html',
operator=FilterOperator.EQ
),
MetadataFilters(
filters=[
MetadataFilter(
key='pub_date',
value='2024-01-01',
operator=FilterOperator.GT
),
MetadataFilter(
key='pub_date',
value='2024-12-31',
operator=FilterOperator.LT
)
],
condition=FilterCondition.AND
)
],
condition=FilterCondition.OR
)
)

The following example shows the use of a nested MetadataFilters object with a FilterCondition.NOT condition. Even though there is only one MetadataFilter that is being negated here, it must be nested inside a MetadataFilters object.

FilterConfig(
MetadataFilters(
filters=[
MetadataFilters(
filters=[
MetadataFilter(
key='url',
value='https://docs.aws.amazon.com/neptune/latest/userguide/intro.html',
operator=FilterOperator.EQ
)
]
)
],
condition=FilterCondition.NOT
)
)

The lexical graph supports the following filter operators:

OperatorDescriptionData Types
EQEquals – default operatorstring, int, float, date/datetime
GTGreater thanint, float, date/datetime
LTLess thanint, float, date/datetime
NENot equal tostring, int, float, date/datetime
GTEGreater than or equal toint, float, date/datetime
LTELess than or equal toint, float, date/datetime
TEXT_MATCHFull text match (allows you to search for a specific substring, token or phrase within the text field)string
TEXT_MATCH_INSENSITIVEFull text match (case insensitive)string
IS_EMPTYThe field does not exist

The following operators are not supported:

OperatorDescriptionData Types
INIn arraystring or number
NINNot in arraystring or number
ANYContains anyarray of strings
ALLContains allarray of strings
CONTAINSMetadata array contains value (string or number)

Matadata filtering supports filtering by date and datetime values. There are two ways in which you can ensure datetime filtering is applied during indexing and querying:

  • Supply Python date or datetime objects in the metadata fields attached to source documents, and in the metadata filters applied when querying.
  • Indicate that a field is to be treated as a datetime value by suffixing the field name with _date or _datetime. You can then supply either date or datetime objects, or string representations of dates and datetime values, when indexing and querying.

In the build stage, Python date and datetime metadata values are converted to ISO-formatted datetime values before being persisted to the graph and vector stores. During querying, Python date and datetime metadata values are similarly converted to ISO-formatted datetime values before being applied in a filter. date and datetime Pyton objects explictly communicate that a value should be treated as a date or datetime. With this approach, you do not need to add a _date or _datetime suffix to a metadata field name. However, you must ensure that date and/or datetime objects are used both during indexing and querying: if one or other of these stages receives a string representation of a date or datetime, filtering may not work as intended.

Metadata fields that end with _date or _datetime are converted to ISO-formatted datetime values before being persisted to the graph and vector stores. Similarly, the values of metadata filters whose keys end with _date or _datetime are converted to ISO-formatted datetime values before being evaluated.

Using metadata to filter documents in the extract and build stages

Section titled “Using metadata to filter documents in the extract and build stages”

Besides using metadata filtering to constrain the retrieval process, you can also use it to filter documents during the extract and build stages of the indexing process.

Using metadata filtering in the extract stage

Section titled “Using metadata filtering in the extract stage”

You can filter the documents that pass through the extract stage by supplying filter criteria to the extraction_filters of an ExtractionConfig object. extraction_filters accepts either a MetadataFilters object, a single MetadataFilter or a list of MetadataFilter objects.

The following example shows how to filter source documents so that only documents with an email metadata field containing an amazon.com email address proceeed through the extraction pipeline. All other source documents will be discarded.

from graphrag_toolkit.lexical_graph import LexicalGraphIndex, ExtractionConfig
from llama_index.core.vector_stores.types import FilterOperator, MetadataFilter
graph_index = LexicalGraphIndex(
graph_store,
vector_store,
indexing_config=ExtractionConfig(
extraction_filters=MetadataFilter(
key='email',
value='amazon.com',
operator=FilterOperator.TEXT_MATCH
)
)
)

Use extraction stage metadata filtering if you only want to extract a lexical graph from a subset of documents, but can’t control which documents are submitted to the ingestion process.

Using metadata filtering in the build stage

Section titled “Using metadata filtering in the build stage”

You can filter the documents that are used to build a lexical graph by supplying a BuildFilters object whose source_filters property contains filter criteria to a BuildConfig object. source_filters accepts either a MetadataFilters object, a single MetadataFilter or a list of MetadataFilter objects.

The following example shows how to filter extracted documents so that only documents whose url metadata field contains https://docs.aws.amazon.com/neptune/ will proceed through the build pipeline. All other extracted documents will be ignored. The resulting lexical graph is assigned to the neptune tenant.

from graphrag_toolkit.lexical_graph import LexicalGraphIndex, BuildConfig
from graphrag_toolkit.lexical_graph.indexing.build import BuildFilters
from llama_index.core.vector_stores.types import FilterOperator, MetadataFilter
graph_index = LexicalGraphIndex(
graph_store,
vector_store,
indexing_config=BuildConfig(
build_filters=BuildFilters(
source_filters=MetadataFilter(
key='url',
value='https://docs.aws.amazon.com/neptune/',
operator=FilterOperator.TEXT_MATCH
)
)
),
tenant_id='neptune'
)

Build-stage metadata filtering works well in an extract-once, build-many-times workload. You can extract the entire corpus to an S3BasedDocs sink or FileBasedDocs sink (see Run the extract and build stages separately), and then build multiple lexical graphs from the extracted documents. Using different sets of filtering criteria and the multi-tenancy feature, you can build multiple, discrete lexical graphs with different contents from the same underlying sources.

The metadata associated with a source document comprises part of that document’s identity. A source document’s id is a function of the contents of the document and the metadata. Chunk, topic and statement ids are in turn a function of the source id. If you change a source document’s metadata (adding or removing fields, or changing field values), and reprocess the document, it will be indexed into new source, chunk, topic and statement nodes in the lexical graph.

Metadata filtering constrains retrieval to one or more subgraphs within a particular lexical graph. Multi tenancy creates wholly separate lexical graphs within the same underlying graph and vector stores. Metadata filtering and multi-tenancy work well together. As described above, you can use metadata filtering to build different tenant graphs from the same extracted corpus. You can also use metadata filtering and multi tenancy when querying. The following example applies metadata filtering to a query in the context of the neptune tenant’s lexical graph:

from graphrag_toolkit.lexical_graph import LexicalGraphQueryEngine
from graphrag_toolkit.lexical_graph.metadata import FilterConfig
from llama_index.core.vector_stores.types import FilterOperator, MetadataFilter
query_engine = LexicalGraphQueryEngine.for_traversal_based_search(
graph_store,
vector_store,
filter_config = FilterConfig(
MetadataFilter(
key='url',
value='https://docs.aws.amazon.com/neptune/latest/userguide/intro.html',
operator=FilterOperator.EQ
)
),
tenant_id='neptune'
)