Getting Started
What is Stickler?
You're building an invoice extraction pipeline. Your AI reads scanned documents and produces structured JSON — invoice IDs, amounts, line items. How accurate is it? Do the errors matter? A wrong total is a billing error. A wrong ID routes a package to the wrong warehouse. A minor typo in a vendor name is cosmetic.
Stickler answers these questions. It compares structured AI output against ground truth using specialized comparators tailored to each data type (exact, numeric, fuzzy, semantic), business-weighted scoring so critical fields count more than cosmetic ones, and Hungarian algorithm matching for lists regardless of order. The result is a single weighted score that reflects real operational impact, not just raw accuracy.
Your First Evaluation in 30 Seconds
# pip install stickler-eval
from typing import List
from stickler import StructuredModel, ComparableField
from stickler.comparators import ExactComparator, NumericComparator, LevenshteinComparator
# Define your models
class LineItem(StructuredModel):
product: str = ComparableField(comparator=LevenshteinComparator(), weight=1.0)
quantity: int = ComparableField(weight=0.8)
price: float = ComparableField(comparator=NumericComparator(tolerance=0.01), weight=1.2)
class Invoice(StructuredModel):
shipment_id: str = ComparableField(comparator=ExactComparator(), weight=3.0) # Critical
amount: float = ComparableField(comparator=NumericComparator(tolerance=0.01), weight=2.0)
line_items: List[LineItem] = ComparableField(weight=2.0) # Hungarian matching!
# JSON from your systems (agent output, ground truth, etc.)
ground_truth_json = {
"shipment_id": "SHP-2024-001",
"amount": 1247.50,
"line_items": [
{"product": "Wireless Mouse", "quantity": 2, "price": 29.99},
{"product": "USB Cable", "quantity": 5, "price": 12.99}
]
}
prediction_json = {
"shipment_id": "SHP-2024-001", # Perfect match
"amount": 1247.48, # Within tolerance
"line_items": [
{"product": "USB Cord", "quantity": 5, "price": 12.99}, # Name variation
{"product": "Wireless Mouse", "quantity": 2, "price": 29.99} # Reordered
]
}
# Construct from JSON and compare
ground_truth = Invoice(**ground_truth_json)
prediction = Invoice(**prediction_json)
result = ground_truth.compare_with(prediction)
print(f"Overall Score: {result['overall_score']:.3f}") # 0.693
print(f"Shipment ID: {result['field_scores']['shipment_id']:.3f}") # 1.000 - exact match
print(f"Line Items: {result['field_scores']['line_items']:.3f}") # 0.926 - Hungarian optimal matching
Sample Output
Console output:
Overall Score: 0.693
Shipment ID: 1.000
Line Items: 0.926
Full result dictionary:
{
"field_scores": {
"shipment_id": 1.0,
"amount": 0.0,
"line_items": 0.926
},
"overall_score": 0.693,
"all_fields_matched": false
}
The amount field scores 0.0 because the default clip_under_threshold behavior zeros out the score — the difference between 1247.50 and 1247.48 exceeds the NumericComparator's default absolute tolerance of 0.01, and the resulting score falls below the default threshold.
What You Just Did
- Defined models with
ComparableField: Each field declares its own comparator and weight, turning a plain data class into an evaluation-aware structure. - Chose specialized comparators:
ExactComparatorfor the shipment ID that must match perfectly,NumericComparatorwith a tolerance for currency amounts, andLevenshteinComparatorfor product names that may have minor variations. - Applied business-weighted scoring: The shipment ID carries a weight of 3.0 because a wrong ID routes packages to the wrong warehouse. Lower-priority fields have smaller weights. The overall score is a weighted average that reflects operational impact.
- Used Hungarian matching for lists: The
line_itemsfield contains a list ofLineItemobjects. Stickler uses the Hungarian algorithm to find the optimal one-to-one pairing between ground truth and prediction items, regardless of order. - Compared and got results:
compare_withreturns a dictionary with anoverall_scoreand per-fieldfield_scores, so you can see exactly where the differences are and how much they matter.
Evaluate a Test Set
In production, you'll compare many document pairs at once. BulkStructuredModelEvaluator handles this with streaming aggregation and progress reporting:
from stickler.structured_object_evaluator.bulk_structured_model_evaluator import BulkStructuredModelEvaluator
evaluator = BulkStructuredModelEvaluator(target_schema=Invoice, verbose=True)
for gt_json, pred_json, doc_id in your_test_set:
gt = Invoice(**gt_json)
pred = Invoice(**pred_json)
evaluator.update(gt, pred, doc_id)
result = evaluator.compute()
print(f"Aggregate F1: {result.overall_metrics['f1']:.3f}")
See Bulk Evaluation for the full guide.