Metadata Filtering

Topics

Overview
Adding metadata when indexing
Using metadata to filter queries
Dates and datetimes
Using metadata to filter documents in the extract and build stages
- Using metadata filtering in the extract stage
- Using metadata filtering in the build stage
Metadata and document identity
Metadata filtering and multi-tenancy

Overview

Metadata filtering allows you to retrieve a constrained set of sources, topics and statements based on metadata filters and associated values when querying a lexical graph.

Metadata is any data added to the metadata dictionary of a source document. Depending on the source document, examples of metadata may include title, url, filepath, date published, and author. A source document’s metadata is then associated with any chunks, topics and statements extracted from that document.

There are two parts to metadata filtering:

Indexing Add metadata to source documents passed to the indexing process
Querying Supply metadata filters when querying a lexical graph

You can also use metadata filtering to filter documents and chunks during the extract and build stages of the indexing process.

Adding metadata when indexing

The effectiveness of metadata filtering during querying is dependent on the quality of the metadata attached to source documents during ingestion. Different loaders have different mechanisms for adding metadata to ingested documents. Here are some examples.

Metadata and versioned updates

The lexical graphs supports versioned updates. With versioned updates, if you re-ingest a document whose contents and/or metadata have changed since it was last extracted, any old documents will be archived, and the newly ingested document treated as the current version of the source document.

Versioned updates uses a concept of version-independent metadata fields to represent a documents’ stable (i.e. version-independent) identify. When you index a document, you can specify which of that document’s metadata fields represent its stable identify. For example, if a document has title, author and last_updated metadata fields, you might specify that a combination of the title and author metadata fields represent that document’s stable identify. When the document is indexed, any previously indexed, non-versioned documents whose title and author field values match those of the newly ingested document will be archived.

When choosing which metadata to add to each source document that you ingest, bear in mind this use of metadata for versioning updates. Try to ensure that one of the fields, or a combination of multiple field values, constitute a stable identity.

Adding metadata to web pages

The LlamaIndex SimpleWebPageReader accepts a function that takes a url and returns a metadata dictionary. The following example populates the metadata dictionary with the url and the date on which the page was accessed.

from datetime import date
from llama_index.readers.web import SimpleWebPageReader

doc_urls = [
    'https://docs.aws.amazon.com/neptune/latest/userguide/intro.html',
    'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/what-is-neptune-analytics.html',
    'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-features.html',
    'https://docs.aws.amazon.com/neptune-analytics/latest/userguide/neptune-analytics-vs-neptune-database.html'
]

def web_page_metadata(url):
    return {
        'url': url,
        'last_accessed_date': date.today()
    }

docs = SimpleWebPageReader(
    html_to_text=True,
    metadata_fn=web_page_metadata
).load_data(doc_urls)

Adding metadata to JSON documents

The JSONArrayReader allows you to split a JSON array document into separate documents, one per element in the array, and extract metadata from each sub-document. The following example splits a JSON source document containing news articles into separate documents, one per article. The get_text() and get_metadata() functions extract each article’s body text and associated metadata.

from graphrag_toolkit.lexical_graph.indexing.load import JSONArrayReader

def get_text(data):
    return data.get('body', '')

def get_metadata(data):
  return {
    field : data[field]
    for field in ['title', 'author', 'source', 'published_date']
    if field in data
  }

docs = JSONArrayReader(
    text_fn=get_text,
    metadata_fn=get_metadata
).load_data('./articles.json')

Adding metadata to PDF documents

The following example shows one way of loading PDF documents and attaching metadata to each document.

from pathlib import Path
from pypdf import PdfReader
from llama_index.core.schema import Document

def get_pdf_docs(pdf_dir):

    pdf_dir_path = Path(pdf_dir)

    file_paths = [
        file_path for file_path in pdf_dir_path.iterdir()
        if file_path.is_file()
    ]

    for pdf_path in file_paths:
        reader = PdfReader(pdf_path)
        for page_num, page_content in enumerate(reader.pages):
            doc = Document(
                text=page_content.extract_text(),
                metadata={
                    'filename': pdf_path.name,
                    'page_num': page_num
                }
            )
            yield doc

docs = get_pdf_docs('./pdfs')

Restrictions

Metadata field values may comprise string, int, float, date and datetime single values. Lists, arrays, sets and nested dictionaries are not supported.

Using metadata to filter queries

The lexical graph uses the LlamaIndex vector store types MetadataFilters, MetadataFilter, FilterOperator, and FilterCondition to specify filter criteria. You supply these to a query engine in a FilterConfig object. The following example configures a traversal-based retriever to filter the lexical graph based on the url of source documents:

from graphrag_toolkit.lexical_graph import LexicalGraphQueryEngine
from graphrag_toolkit.lexical_graph.metadata import FilterConfig
from llama_index.core.vector_stores.types import FilterOperator, MetadataFilter

query_engine = LexicalGraphQueryEngine.for_traversal_based_search(
    graph_store,
    vector_store,
    filter_config = FilterConfig(
        MetadataFilter(
            key='url',
            value='https://docs.aws.amazon.com/neptune/latest/userguide/intro.html',
            operator=FilterOperator.EQ
        )
    )
)

How are metadata filters applied?

Metadata filters that you supply to a query engine are applied at two points in the retrieval process:

The filters are applied to all vector store top-k queries. The vector store is typically used to find starting points for graph traversals: filters therefore effectively constrain a retriever’s entry points into the graph.
The filters are subsequently applied to all the results returned from the graph.

By its very nature, a graph can often connect disparate sources: traversals can hop from topics and statements belonging to one source, to topics and statements associated with an entirely different source. It’s not sufficient, therefore, to simply limit the starting points for a traversal; the retriever must also filter the results. The benefit of the dual application of a metadata filter is that it restricts the semantic similarity-based lookups that provide the start points of a query to a well-defined set of sources, but then allows the query to access structurally relevant but semantically dissimilar parts of the lexical graph, some of which may be allowed by the filter, some disallowed, before finally constraining the results to only those elements that pass the filter criteria.

Complex and nested filter expressions

The constructor of the FilterConfig object accepts either a MetadataFilters object, a single MetadataFilter or a list of MetadataFilter objects.

A MetadataFilters object can hold a collection of MetadataFilter objects as well as other, nested MetadataFilters objects. Elements in a MetadataFilters object’s filters collection are chained to form complex conditions using either a FilterCondition.AND or FilterCondition.OR condition.

MetadataFilters also supports a third condition: FilterCondition.NOT. If you use the FilterCondition.NOT condition with a MetadataFilters object, the filters collection of that object must contain a single nested MetadataFilters object.

The following example shows the use of a nested MetadataFilters object to express a complex condition: either the source must be from https://docs.aws.amazon.com/neptune/latest/userguide/intro.html, OR its publication date must fall between 2024-01-01 and 2024-12-31:

FilterConfig(
    MetadataFilters(
        filters=[
            MetadataFilter(
                key='url',
                value='https://docs.aws.amazon.com/neptune/latest/userguide/intro.html',
                operator=FilterOperator.EQ
            ),
            MetadataFilters(
                filters=[
                    MetadataFilter(
                        key='pub_date',
                        value='2024-01-01',
                        operator=FilterOperator.GT
                    ),
                    MetadataFilter(
                        key='pub_date',
                        value='2024-12-31',
                        operator=FilterOperator.LT
                    )
                ],
                condition=FilterCondition.AND
            )
        ],
        condition=FilterCondition.OR
    )
)

The following example shows the use of a nested MetadataFilters object with a FilterCondition.NOT condition. Even though there is only one MetadataFilter that is being negated here, it must be nested inside a MetadataFilters object.

FilterConfig(
    MetadataFilters(
        filters=[
            MetadataFilters(
                filters=[
                    MetadataFilter(
                        key='url',
                        value='https://docs.aws.amazon.com/neptune/latest/userguide/intro.html',
                        operator=FilterOperator.EQ
                    )
                ]
            )
        ],
        condition=FilterCondition.NOT
    )
)

Supported filter operators

The lexical graph supports the following filter operators:

Operator	Description	Data Types
`EQ`	Equals – default operator	string, int, float, date/datetime
`GT`	Greater than	int, float, date/datetime
`LT`	Less than	int, float, date/datetime
`NE`	Not equal to	string, int, float, date/datetime
`GTE`	Greater than or equal to	int, float, date/datetime
`LTE`	Less than or equal to	int, float, date/datetime
`TEXT_MATCH`	Full text match (allows you to search for a specific substring, token or phrase within the text field)	string
`TEXT_MATCH_INSENSITIVE`	Full text match (case insensitive)	string
`IS_EMPTY`	The field does not exist

The following operators are not supported:

Operator	Description	Data Types
`IN`	In array	string or number
`NIN`	Not in array	string or number
`ANY`	Contains any	array of strings
`ALL`	Contains all	array of strings
`CONTAINS`	Metadata array contains value (string or number)

Dates and datetimes

Matadata filtering supports filtering by date and datetime values. There are two ways in which you can ensure datetime filtering is applied during indexing and querying:

Supply Python date or datetime objects in the metadata fields attached to source documents, and in the metadata filters applied when querying.
Indicate that a field is to be treated as a datetime value by suffixing the field name with _date or _datetime. You can then supply either date or datetime objects, or string representations of dates and datetime values, when indexing and querying.

In the build stage, Python date and datetime metadata values are converted to ISO-formatted datetime values before being persisted to the graph and vector stores. During querying, Python date and datetime metadata values are similarly converted to ISO-formatted datetime values before being applied in a filter. date and datetime Pyton objects explictly communicate that a value should be treated as a date or datetime. With this approach, you do not need to add a _date or _datetime suffix to a metadata field name. However, you must ensure that date and/or datetime objects are used both during indexing and querying: if one or other of these stages receives a string representation of a date or datetime, filtering may not work as intended.

Metadata fields that end with _date or _datetime are converted to ISO-formatted datetime values before being persisted to the graph and vector stores. Similarly, the values of metadata filters whose keys end with _date or _datetime are converted to ISO-formatted datetime values before being evaluated.

Using metadata to filter documents in the extract and build stages

Besides using metadata filtering to constrain the retrieval process, you can also use it to filter documents during the extract and build stages of the indexing process.

Using metadata filtering in the extract stage

You can filter the documents that pass through the extract stage by supplying filter criteria to the extraction_filters of an ExtractionConfig object. extraction_filters accepts either a MetadataFilters object, a single MetadataFilter or a list of MetadataFilter objects.

The following example shows how to filter source documents so that only documents with an email metadata field containing an amazon.com email address proceeed through the extraction pipeline. All other source documents will be discarded.

from graphrag_toolkit.lexical_graph import LexicalGraphIndex, ExtractionConfig
from llama_index.core.vector_stores.types import FilterOperator, MetadataFilter

graph_index = LexicalGraphIndex(
    graph_store,
    vector_store,
    indexing_config=ExtractionConfig(
        extraction_filters=MetadataFilter(
            key='email',
            value='amazon.com',
            operator=FilterOperator.TEXT_MATCH
        )
    )
)

Use extraction stage metadata filtering if you only want to extract a lexical graph from a subset of documents, but can’t control which documents are submitted to the ingestion process.

Using metadata filtering in the build stage

You can filter the documents that are used to build a lexical graph by supplying a BuildFilters object whose source_filters property contains filter criteria to a BuildConfig object. source_filters accepts either a MetadataFilters object, a single MetadataFilter or a list of MetadataFilter objects.

The following example shows how to filter extracted documents so that only documents whose url metadata field contains https://docs.aws.amazon.com/neptune/ will proceed through the build pipeline. All other extracted documents will be ignored. The resulting lexical graph is assigned to the neptune tenant.

from graphrag_toolkit.lexical_graph import LexicalGraphIndex, BuildConfig
from graphrag_toolkit.lexical_graph.indexing.build import BuildFilters
from llama_index.core.vector_stores.types import FilterOperator, MetadataFilter

graph_index = LexicalGraphIndex(
    graph_store,
    vector_store,
    indexing_config=BuildConfig(
        build_filters=BuildFilters(
            source_filters=MetadataFilter(
                key='url',
                value='https://docs.aws.amazon.com/neptune/',
                operator=FilterOperator.TEXT_MATCH
            )
        )
    ),
    tenant_id='neptune'
)

Build-stage metadata filtering works well in an extract-once, build-many-times workload. You can extract the entire corpus to an S3BasedDocs sink or FileBasedDocs sink (see Run the extract and build stages separately), and then build multiple lexical graphs from the extracted documents. Using different sets of filtering criteria and the multi-tenancy feature, you can build multiple, discrete lexical graphs with different contents from the same underlying sources.

Metadata and document identity

The metadata associated with a source document comprises part of that document’s identity. A source document’s id is a function of the contents of the document and the metadata. Chunk, topic and statement ids are in turn a function of the source id. If you change a source document’s metadata (adding or removing fields, or changing field values), and reprocess the document, it will be indexed into new source, chunk, topic and statement nodes in the lexical graph.

Metadata filtering and multi-tenancy

Metadata filtering constrains retrieval to one or more subgraphs within a particular lexical graph. Multi tenancy creates wholly separate lexical graphs within the same underlying graph and vector stores. Metadata filtering and multi-tenancy work well together. As described above, you can use metadata filtering to build different tenant graphs from the same extracted corpus. You can also use metadata filtering and multi tenancy when querying. The following example applies metadata filtering to a query in the context of the neptune tenant’s lexical graph:

from graphrag_toolkit.lexical_graph import LexicalGraphQueryEngine
from graphrag_toolkit.lexical_graph.metadata import FilterConfig
from llama_index.core.vector_stores.types import FilterOperator, MetadataFilter

query_engine = LexicalGraphQueryEngine.for_traversal_based_search(
    graph_store,
    vector_store,
    filter_config = FilterConfig(
        MetadataFilter(
            key='url',
            value='https://docs.aws.amazon.com/neptune/latest/userguide/intro.html',
            operator=FilterOperator.EQ
        )
    ),
  tenant_id='neptune'
)