Traversal-Based Search Configuration
Topics
Section titled “Topics”- Overview
- Search results configuration
- Retriever selection
- Reranking strategy
- Graph and vector search parameters
- Entity network context selection
Overview
Section titled “Overview”You can use the traversal-based search configuration options to customize traversal-based search operations to better suit your specific application, dataset, and query types. The following configuration options are available to help you optimize search performance:
- Search results configuration Adjust the number of search results and statements returned and set scoring thresholds to filter out low-quality statements and results
- Retriever selection Specify which retrievers to use when fetching information
- Reranking strategy Modify how statements and results are reranked and sorted
- Graph and vector search parameters Customize parameters that control graph queries and vector searches
- Entity network context selection Configure parameters used to select entity network contexts
These options allow you to fine-tune your search behavior based on your specific requirements and improve the relevance of returned results.
Search results configuration
Section titled “Search results configuration”When configuring search functionality, you can use the following parameters to control the number and quality of returned results:
max_search_results
Section titled “max_search_results”Defines the maximum number of search results to return. Each search result contains one or more statements that belong to the same topic (and source). If you set this to None, all matching search results will be returned. The default value is 10.
max_statements_per_topic
Section titled “max_statements_per_topic”Controls how many statements can be included with a single topic, effectively limiting the size of each search result. If set to None, all statements belonging to the topic that match the search will be included in the result. The default value is 10.
max_statements
Section titled “max_statements”Limits the total number of statements across the entire resultset. If you set this to None, all statements from all results will be returned. The default value is 100.
statement_pruning_factor
Section titled “statement_pruning_factor”This parameter helps filter out lower-quality statements based on a percentage of the highest statement score in the entire set of results. Any statement with a score less than <maximum_statement_score> * statement_pruning_factor will be removed from the results. The default value is 0.05 (5% of the maximum score).
statement_pruning_threshold
Section titled “statement_pruning_threshold”Sets an absolute minimum score threshold for statements. Any statement with a score lower than this threshold will be removed from the results. The default value is None.
Example
Section titled “Example”query_engine = LexicalGraphQueryEngine.for_traversal_based_search( graph_store, vector_store, statement_pruning_threshold=0.2)When to use search results configuration
Section titled “When to use search results configuration”The max_search_results, max_statements_per_topic and max_statements parameters allow you to control the overall size of the results.
Each search result comprises one or more statements belonging to a single topic from a single source. Statements from the same source but different topics appear as separate search results. Increasing max_search_results increases the variety of sources in your results. Increasing max_statements_per_topic adds more detail to each individual search result.
When increasing the number of statements (either overall or per topic), you should consider increasing the statement pruning parameters as well. This helps ensure that even with larger result sets, you’re still getting highly relevant statements rather than less relevant information.
Retriever selection
Section titled “Retriever selection”You can use the retrievers parameter to configure traversal-based search with up to three different retrievers.
retrievers
Section titled “retrievers”Accepts an array of retriever class names. Choose from:
ChunkBasedSearchThis retriever uses a vector similarity search to find information that is similar to the original query. The retriever first finds relevant chunks using vector similarity search. From these chunks, the retriever traverses topics, statements, and facts. Chunk-based search tends to return a narrowly-scoped set of results based on the statement and fact neighbourhoods of chunks that match the original query.EntityBasedSearchThis retriever uses as its starting points the entities in an entity network context. From these entities, the retriever traverses facts, statements and topics. Entity-based search tends to return a broadly-scoped set of results, based on the neighbourhoods of individual entities and the facts that connect entities.EntityNetworkSearchThis retriever uses textual transcriptions of an entity network context to drive vector searches for information that is dissimilar to the original query but nonetheless structurally relevant for creating an accurate and full response. These vector searches return chunks that are similar to ‘something different from the question being asked’. From these chunks, the retriever traverses topics, statements, and facts to explore the structurally relevant space of dissimilar content.
Example
Section titled “Example”from graphrag_toolkit.lexical_graph.retrieval.retrievers import *
query_engine = LexicalGraphQueryEngine.for_traversal_based_search( graph_store, vector_store, retrievers=[ChunkBasedSearch, EntityBasedSearch])When to use different retrievers
Section titled “When to use different retrievers”By default, traversal-based search is configured to use a combination of ChunkBasedSearch and EntityNetworkSearch. This combination provides access to content that is both directly similar to the question and content that may be relevant but not explicitly mentioned in the query.
Consider using the ChunkBasedSearch retriever by itself if:
- Your queries need primarily similarity-based search
- You want to focus on individual relevant statements rather than entire chunks
- You need broader search scope than traditional vector search
This retriever uses local connectivity to find relevant statements in other chunks from the same source, expanding beyond basic vector similarity.
The EntityBasedSearch and EntityNetworkSearch retrievers provide different ways of utilising entity networks in a search:
- The
EntityBasedSearchuses global connectivity to find statements from different sources connected by the same facts. It often produces more diverse results than other retrievers. - The
EntityNetworkSearchretriever converts an entity network (retrieved through graph traversal) into a set of similarity searches. This approach balances global and local connectivity.
Reranking strategy
Section titled “Reranking strategy”Traversal-based search incorporates reranking at two key points during the retrieval process:
- When generating entity network contexts, both entities and entity networks are reranked
- Before finalizing search results, the complete set of statements undergoes reranking
Reranking is managed through a single parameter:
reranker
Section titled “reranker”Parameters options:
model: Uses a LlamaIndex-basedSentenceRerankerto rerank all statements in the result settfidf(default): Applies a term frequency-inverse document frequency measure to rank statementsNone: Disables the reranking feature completely
The tfidf-based option is significantly faster than the model-based approach. To use the model reranker, you must first install the following additional dependencies:
pip install torch sentence_transformersExample
Section titled “Example”query_engine = LexicalGraphQueryEngine.for_traversal_based_search( graph_store, vector_store, reranker='model')Choosing a reranker strategy
Section titled “Choosing a reranker strategy”The tfidf reranker option provides a fast, cost-effective, and generally effective solution for most use cases. However, if you find that the results don’t meet your requirements, consider switching to the model reranker. Be aware that while model may provide different results, it operates significantly slower than tfidf and doesn’t guarantee improved outcomes.
Troubleshooting reranking results
Section titled “Troubleshooting reranking results”An effective reranking strategy should ensure that only highly relevant statements appear in your final results. For reranking to work properly, the relevant statements must first be captured by your retrievers before the reranking process begins.
If your search results don’t include content you expect to see, verify whether this content is present in the pre-ranked results by:
- Disabling the reranker by setting
reranker=None - Increasing the following parameters in your search results configuration:
After making these adjustments, review the results returned by the retrieve() operation. If the expected content still doesn’t appear, the issue isn’t related to reranking. Instead, consider other tuning approaches described elsewhere in the documentation, such as:
- Changing your retriever configuration
- Adjusting pruning thresholds
- Configuring entity network contexts
Graph and vector search parameters
Section titled “Graph and vector search parameters”These settings govern how the system queries both the graph and vector stores. When a user submits a query, multiple searches run across both stores, with some executing in parallel. The vector store returns the most similar items based on a top K approach. Results can be diversified across different sources. Graph store queries return statement sets, grouped by their source. Graph queries use a two-phase process: initial statement identification followed by connection exploration.
intermediate_limit
Section titled “intermediate_limit”Controls how many statements are identified in the first phase of a graph query, before exploring their connections (both local and global). The default value is 50.
query_limit
Section titled “query_limit”Defines how many results each graph query returns. Each result consists of statements from a single source. The default value is 10.
vss_top_k
Section titled “vss_top_k”Specifies how many top matching results are used to begin similarity-based traversals. The default value is 10.
vss_diversity_factor
Section titled “vss_diversity_factor”Ensures results come from a diverse range of sources. Queries to a vector store retrieve (vss_top_k × vss_diversity_factor) initial matches, and then iteratively select the most relevant result from previously unused sources. This process continues until reaching vss_top_k total results. If set to None, simply returns the first vss_top_k matches. The default value is 5.
num_workers
Section titled “num_workers”Sets the number of threads available for running graph queries in parallel. The default value is 10.
Example
Section titled “Example”query_engine = LexicalGraphQueryEngine.for_traversal_based_search( graph_store, vector_store, intermediate_limit=25, num_workers=3)When to change the graph and vector search parameters
Section titled “When to change the graph and vector search parameters”Whereas the search results configuration parameters control the handling of the search results, the graph and vector store configuration parameters control the query processing used to generate the results.
If your queries require finding highly diverse content from across multiple sources, increase the vss_diversity_factor. If your queries require content that derives directly from primary sources, reduce vss_diversity_factor, or set it to None.
If you experience out of memeory issue while running user queries, reduce the intermediate_limit and num_workers. This will reduce the size of the working set for each graph query, and reduce the number of graph queries running in parallel.
If your application requires a large number of search results, you should consider increasing the intermediate_limit, query_limit and/or vss_top_k. Note that increasing these parameters can increase query latencies, and require more memory.
Entity network context selection
Section titled “Entity network context selection”The system creates focused entity network contexts based on the user’s query terms. These contextual networks guide both retrieval and response generation phases.
Entity network generation
Section titled “Entity network generation”The process for generating entity network contexts is as follows:
- Initial entity discovery Match query terms to entities using various search methods: lookup by id, exact match, partial match, full text search, or any other search technique offered by the graph store.
- Entity prioritization Sort matched entities by relevance to the query. Calculate the degree centrality of the top entity: this will be used as a benchmark for subsequent filtering.
- Network expansion Starting from each root entity node, follow entity-to-entity relationships, expanding to a depth of 2-3 levels.
- Network pruning Apply filtering based on degree centrality thresholds derived from the benchmark created in step 2. Remove entities above and below these thresholds along each path.
- Path selection Rerank all valid paths and select the top N highest-ranking paths. These form the final set of entity network contexts.
You can configure entity network generation using the following parameters:
ec_max_depth
Section titled “ec_max_depth”Determines the maximum number of entities in each entity network path.
The default value is 3.
ec_max_contexts
Section titled “ec_max_contexts”Limits the number of entity contexts returned by providers. Note: Multiple entity contexts may originate from the same root entity. The default value is 3.
ec_max_score_factor
Section titled “ec_max_score_factor”Filters out entities whose degree centrality exceeds a threshold based on a percentage of the degree centrality of the top entity. The default value is 10 (1000% of the top entity’s score).
ec_min_score_factor
Section titled “ec_min_score_factor”Filters out entities whose degree centrality falls below a threshold based on a percentage of the degree centrality of the top entity. The default value is 0.1 (10% of the top entity’s score).
Example
Section titled “Example”query_engine = LexicalGraphQueryEngine.for_traversal_based_search( graph_store, vector_store, ec_max_depth=3, ec_max_contexts=3)When to adjust entity network generation
Section titled “When to adjust entity network generation”The entity network context settings control how extensively the system searches for related content and how it filters results based on entity relationships. Increase the search scope to find structurally relevant but dissimilar content. Reduce the search scope to focus on content similar to the query.
A broad but shallow search – e.g. ec_max_depth=1 and ec_max_contexts=5 – helps explore diverse contexts focused on direct matches to the query.
A deep but narrow search – e.g. ec_max_depth=3 and ec_max_contexts=2 – helps explore distantly related content through key entities.
The ec_max_score_factor and ec_min_score_factor parameters allow you to filter out ‘whales’ and ‘minnows’ in proportion to the significance of the top entity.
ec_max_score_factor controls how prominently high-scoring distant entities appear in the search results. Higher values will include well-connected entities even if they’re distantly related. Increase ec_max_score_factor when you want to see important entities that aren’t directly connected.
ec_min_score_factor controls the inclusion of less significant distant entities. Lower values will result in the inclusion of rarely mentioned entities even if they’re distantly related. Decrease ec_min_score_factor to find niche or uncommon connections.