Customizing Your Evaluation
Stickler gives you fine-grained control over how every field in your structured data is compared. You can tune comparison algorithms, thresholds, and weights at the field level -- whether you define your models in Python or through JSON Schema configuration files.
This guide covers three ways to configure evaluation behavior:
- ComparableField parameters in Python model definitions
- The
compare_with()method for running comparisons - JSON Schema extensions for configuration-driven evaluation
Evaluating a test set?
If you need to evaluate many document pairs (not just one), use BulkStructuredModelEvaluator — it handles streaming aggregation, progress reporting, and metrics export. See the Bulk Evaluation guide.
How Evaluation Works
Stickler evaluates structured data from the inside out. At the innermost layer, Comparators compute a raw similarity score (0.0–1.0) between two primitive values. Each ComparableField wraps a comparator with a threshold and weight — scores below the threshold are clipped to zero, and weights control how much the field matters. StructuredModels aggregate field scores into a weighted average. For list fields, the Hungarian algorithm finds the optimal one-to-one pairing between ground truth and prediction items before scoring.
graph LR
A["Comparator<br/>(raw score 0.0–1.0)"] --> B["ComparableField<br/>(threshold + clip + weight)"]
B --> C["StructuredModel<br/>(weighted average)"]
C --> D["List matching<br/>(Hungarian algorithm)"]
D --> E["Overall Score"]
ComparableField Parameters
When defining a StructuredModel subclass in Python, each field is declared with ComparableField(...). This function accepts comparison parameters that control how that specific field is evaluated.
Parameter Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
comparator |
BaseComparator |
LevenshteinComparator() |
The comparison algorithm to use. See Comparators for the full list. |
threshold |
float (0.0--1.0) |
0.5 |
Minimum similarity score required for a field to be classified as a match. |
weight |
float (> 0.0) |
1.0 |
Relative importance of this field when computing aggregate scores. |
clip_under_threshold |
bool |
True |
When True, scores below threshold are zeroed out before contributing to the weighted average. |
aggregate |
bool |
False |
Deprecated. Previously controlled inclusion in parent-level metrics. All nodes now include an automatic aggregate field in the confusion matrix output. |
How Each Parameter Affects Scoring
comparator -- Determines which algorithm computes the raw similarity score between two field values. Different comparators suit different data types:
LevenshteinComparatorfor names and addresses (edit-distance based)ExactComparatorfor IDs and codes (binary match)NumericComparatorfor prices and quantities (tolerance-based)FuzzyComparatorfor descriptions (token-based, order-independent)
See the Comparators section for the complete list and detailed descriptions.
threshold -- Acts as a binary classification cutoff. A similarity score at or above the threshold counts as a True Positive; below it counts as a False Discovery. Choose stricter thresholds (0.9--1.0) for critical fields and looser thresholds (0.5--0.7) for flexible fields.
weight -- Controls how much a field contributes to the overall score. The overall score is computed as:
overall_score = sum(field_score * weight) / sum(weights)
Fields with higher weights pull the overall score toward their individual result.
clip_under_threshold -- When enabled (the default), a field that scores below its threshold contributes 0.0 to the weighted average instead of its partial similarity. This prevents low-confidence matches from inflating the overall score.
Example Model
from stickler.comparators.levenshtein import LevenshteinComparator
from stickler.comparators.exact import ExactComparator
from stickler.comparators.numeric import NumericComparator
from stickler.comparators.fuzzy import FuzzyComparator
from stickler.structured_object_evaluator.models.comparable_field import ComparableField
from stickler.structured_object_evaluator.models.structured_model import StructuredModel
class Invoice(StructuredModel):
invoice_id: str = ComparableField(
comparator=ExactComparator(), # Must match exactly
threshold=1.0,
weight=3.0, # Highest weight — wrong ID = wrong customer
clip_under_threshold=True,
)
customer_name: str = ComparableField(
comparator=LevenshteinComparator(), # Tolerates typos
threshold=0.8,
weight=1.5,
)
total_amount: float = ComparableField(
comparator=NumericComparator(), # Tolerance-based numeric comparison
threshold=0.95,
weight=2.5,
)
notes: str = ComparableField(
comparator=FuzzyComparator(), # Low threshold, minimal weight — cosmetic field
threshold=0.6,
weight=0.3,
)
The compare_with() Method
Once you have two model instances -- a ground truth and a prediction -- call compare_with() to evaluate them.
Basic Usage
result = ground_truth.compare_with(prediction)
print(f"Overall score: {result['overall_score']:.2%}")
print(f"All fields matched: {result['all_fields_matched']}")
for field, score in result['field_scores'].items():
print(f" {field}: {score:.3f}")
The default output contains three keys:
overall_score(float) -- Weighted average of all field scores (0.0 to 1.0).field_scores(dict) -- Maps each field name to its similarity score.all_fields_matched(bool) --Truewhen every field meets or exceeds its threshold.
Key Parameters
| Parameter | Type | Default | What It Enables |
|---|---|---|---|
include_confusion_matrix |
bool |
False |
Adds a confusion_matrix key with TP/FP/TN/FN/FD/FA counts and derived precision, recall, F1, and accuracy metrics at both the overall and field levels. |
document_non_matches |
bool |
False |
Adds a non_matches list with details about every field that failed to match, including the field path, non-match type, both values, and a human-readable reason. |
document_field_comparisons |
bool |
False |
Adds a field_comparisons list documenting every field-level comparison (both matches and non-matches) with expected/actual keys and values, scores, and reasons. |
add_confidence_metrics |
bool |
False |
Adds an auroc_confidence_metric for evaluating confidence calibration. |
evaluator_format |
bool |
False |
Restructures the output for bulk evaluation integration. See Evaluator Format below. |
Example with Detailed Metrics
result = ground_truth.compare_with(
prediction,
include_confusion_matrix=True,
document_non_matches=True,
)
# Overall score
print(f"Score: {result['overall_score']:.3f}")
# Confusion matrix totals
cm = result['confusion_matrix']['aggregate']
print(f"Precision: {cm['derived']['cm_precision']:.3f}")
print(f"Recall: {cm['derived']['cm_recall']:.3f}")
print(f"F1: {cm['derived']['cm_f1']:.3f}")
# Inspect non-matches for debugging
for nm in result.get('non_matches', []):
print(f" {nm['field_path']}: {nm['non_match_type']} "
f"(score={nm['similarity_score']:.3f})")
JSON Schema Extensions
For configuration-driven evaluation -- where you want to define models and comparison logic without writing Python code -- Stickler supports standard JSON Schema (Draft 7+) with custom x-aws-stickler-* extensions.
Extension Reference
Add these extensions to any property in your JSON Schema to control comparison behavior:
| Extension | Type | Default | Purpose |
|---|---|---|---|
x-aws-stickler-comparator |
string | Type-dependent | Comparison algorithm (e.g., "ExactComparator", "LevenshteinComparator") |
x-aws-stickler-threshold |
number (0.0--1.0) | 0.5 or 1.0 | Match classification cutoff |
x-aws-stickler-weight |
number (> 0.0) | 1.0 | Field importance multiplier |
x-aws-stickler-clip-under-threshold |
boolean | false |
Zero out scores below threshold |
x-aws-stickler-aggregate |
boolean | false |
Include in parent-level aggregate metrics |
x-aws-stickler-model-name |
string | "DynamicModel" |
Name of the generated Python class (root level) |
x-aws-stickler-match-threshold |
number (0.0--1.0) | 0.7 | Model-level matching threshold for Hungarian algorithm (root level) |
Example Schema
{
"type": "object",
"x-aws-stickler-model-name": "Invoice",
"x-aws-stickler-match-threshold": 0.75,
"properties": {
"invoice_id": {
"type": "string",
"x-aws-stickler-comparator": "ExactComparator",
"x-aws-stickler-threshold": 1.0,
"x-aws-stickler-weight": 3.0,
"x-aws-stickler-clip-under-threshold": true
},
"customer_name": {
"type": "string",
"x-aws-stickler-comparator": "LevenshteinComparator",
"x-aws-stickler-threshold": 0.8,
"x-aws-stickler-weight": 1.5
},
"total_amount": {
"type": "number",
"x-aws-stickler-comparator": "NumericComparator",
"x-aws-stickler-threshold": 0.95,
"x-aws-stickler-weight": 2.5
}
},
"required": ["invoice_id", "customer_name", "total_amount"]
}
Loading a Schema
from stickler.structured_object_evaluator.models.structured_model import StructuredModel
import json
with open("invoice_schema.json") as f:
schema = json.load(f)
Invoice = StructuredModel.from_json_schema(schema)
ground_truth = Invoice(**{"invoice_id": "INV-001", "customer_name": "Acme Corp", "total_amount": 1250.00})
prediction = Invoice(**{"invoice_id": "INV-001", "customer_name": "ACME Corporation", "total_amount": 1250.00})
result = ground_truth.compare_with(prediction)
print(f"Overall Score: {result['overall_score']:.3f}")
Sample Output
{
"field_scores": {
"invoice_id": 1.0,
"customer_name": 0.0,
"total_amount": 1.0
},
"overall_score": 0.786,
"all_fields_matched": false
}
Note that customer_name scores 0.0: "Acme Corp" vs "ACME Corporation" produces a Levenshtein similarity below the 0.8 threshold, so clip_under_threshold zeros it out.
For the complete reference on all extensions, default comparators by type, and a full production example, see the JSON Schema Extensions section in the project README.
Evaluator Format
When you need output structured for bulk evaluation pipelines, pass evaluator_format=True. This restructures the result into a format optimized for aggregation.
result = ground_truth.compare_with(prediction, evaluator_format=True)
The output changes to:
{
"overall": {
"precision": 0.83,
"recall": 1.0,
"f1": 0.91,
"accuracy": 0.83,
"anls_score": 0.83
},
"fields": {
"invoice_id": { "precision": 1.0, "recall": 1.0, "f1": 1.0, "accuracy": 1.0, "anls_score": 1.0 },
"customer_name": { "precision": 1.0, "recall": 1.0, "f1": 1.0, "accuracy": 1.0, "anls_score": 0.85 }
},
"confusion_matrix": {},
"non_matches": []
}
Key differences from the default output:
- The top-level key is
overall(a dict of metrics), notoverall_score(a float). - Each field in
fieldscontains precision, recall, F1, accuracy, and ANLS score. - The
confusion_matrixandnon_matcheskeys are present but empty in this format.
This format is what BulkStructuredModelEvaluator expects when aggregating results across many documents.
Next Steps
- Bulk Evaluation -- Evaluate hundreds or thousands of document pairs with memory-efficient streaming.
- Understanding Results -- Learn how to read confusion matrices, derived metrics, and non-match reports.