Use Cases
Stickler is designed for any scenario where you need to compare structured JSON outputs against expected results. Whether you are evaluating GenAI extraction pipelines, validating ETL transformations, or monitoring data quality, Stickler provides field-level control over how comparisons are performed and scored.
Below are common patterns with model examples and comparator recommendations for each.
Document Extraction
The primary use case. Extract structured data from documents (invoices, forms, receipts) and evaluate extraction accuracy against ground truth annotations.
Comparator recommendations:
ExactComparatorfor IDs and codes (invoice numbers, PO numbers)NumericComparatorfor monetary amounts and quantitiesLevenshteinComparatorfor names, addresses, and short textFuzzyComparatorfor descriptions and free-form notes
from typing import List
from stickler import StructuredModel, ComparableField
from stickler.comparators import ExactComparator, NumericComparator, LevenshteinComparator, FuzzyComparator
class LineItem(StructuredModel):
description: str = ComparableField(comparator=FuzzyComparator(), weight=1.0)
quantity: int = ComparableField(comparator=NumericComparator(tolerance=0), weight=1.2)
unit_price: float = ComparableField(comparator=NumericComparator(tolerance=0.01), weight=1.5)
class Invoice(StructuredModel):
invoice_id: str = ComparableField(comparator=ExactComparator(), weight=3.0, threshold=1.0)
vendor_name: str = ComparableField(comparator=LevenshteinComparator(), weight=1.5, threshold=0.8)
total_amount: float = ComparableField(comparator=NumericComparator(tolerance=0.01), weight=2.5)
line_items: List[LineItem] = ComparableField(weight=2.0)
See the Comparators documentation for details on each comparator.
Evaluate your full test set
Use BulkStructuredModelEvaluator to evaluate all your document pairs at once with streaming aggregation and metrics export. See Bulk Evaluation.
OCR Evaluation
Compare OCR engine output against ground truth transcriptions. LevenshteinComparator is well suited here because it measures character-level edit distance, directly reflecting the kinds of errors OCR systems produce (substitutions, insertions, deletions).
from stickler import StructuredModel, ComparableField
from stickler.comparators import LevenshteinComparator, NumericComparator
class OCRTextBlock(StructuredModel):
text: str = ComparableField(
comparator=LevenshteinComparator(), weight=3.0, threshold=0.7
)
x_position: float = ComparableField(
comparator=NumericComparator(tolerance=5.0), weight=0.5
)
y_position: float = ComparableField(
comparator=NumericComparator(tolerance=5.0), weight=0.5
)
page_number: int = ComparableField(weight=1.0, threshold=1.0)
Use BulkStructuredModelEvaluator when evaluating OCR across many documents. See Bulk Evaluation for guidance.
Entity Extraction
Named entity recognition (NER) and entity extraction from text. Compare extracted entities against a labeled ground truth set. The Hungarian algorithm handles reordered entities automatically.
from typing import List
from stickler import StructuredModel, ComparableField
from stickler.comparators import ExactComparator, LevenshteinComparator, NumericComparator
class Entity(StructuredModel):
name: str = ComparableField(
comparator=LevenshteinComparator(), weight=2.0, threshold=0.8
)
entity_type: str = ComparableField(
comparator=ExactComparator(), weight=2.5, threshold=1.0
)
confidence: float = ComparableField(
comparator=NumericComparator(tolerance=0.1), weight=0.3
)
class ExtractionResult(StructuredModel):
document_id: str = ComparableField(comparator=ExactComparator(), weight=1.0)
entities: List[Entity] = ComparableField(weight=3.0)
When entity type must be exact but the entity name can tolerate minor variations (e.g., "John Smith" vs. "Jon Smith"), assign a higher weight and stricter threshold to entity_type and a more lenient threshold to name.
ML Model Evaluation
Compare ML model predictions against ground truth labels. Useful for both regression outputs (use NumericComparator with tolerance) and classification outputs (use ExactComparator for labels).
from typing import List
from stickler import StructuredModel, ComparableField
from stickler.comparators import ExactComparator, NumericComparator
class Prediction(StructuredModel):
sample_id: str = ComparableField(comparator=ExactComparator(), weight=1.0)
predicted_label: str = ComparableField(
comparator=ExactComparator(), weight=3.0, threshold=1.0
)
predicted_score: float = ComparableField(
comparator=NumericComparator(tolerance=0.05), weight=1.5
)
class RegressionOutput(StructuredModel):
sample_id: str = ComparableField(comparator=ExactComparator(), weight=1.0)
predicted_value: float = ComparableField(
comparator=NumericComparator(tolerance=0.1), weight=3.0
)
Enable include_confusion_matrix=True when calling compare_with() to get precision, recall, and F1 metrics alongside the similarity scores.
ETL Validation
Validate that ETL pipeline outputs match expected results. Stickler ensures data transformations produce the correct structured output by comparing field-by-field with appropriate tolerances.
from typing import List
from stickler import StructuredModel, ComparableField
from stickler.comparators import ExactComparator, NumericComparator, LevenshteinComparator
class TransformedRecord(StructuredModel):
record_id: str = ComparableField(
comparator=ExactComparator(), weight=3.0, threshold=1.0
)
category: str = ComparableField(
comparator=ExactComparator(), weight=2.0, threshold=1.0
)
normalized_name: str = ComparableField(
comparator=LevenshteinComparator(), weight=1.5, threshold=0.9
)
computed_total: float = ComparableField(
comparator=NumericComparator(tolerance=0.001), weight=2.5, threshold=0.99
)
class ETLBatch(StructuredModel):
batch_id: str = ComparableField(comparator=ExactComparator(), weight=1.0)
records: List[TransformedRecord] = ComparableField(weight=3.0)
For ETL validation, use tight tolerances and high thresholds. Deterministic transformations should produce near-exact results, so the evaluation criteria should reflect that expectation.
Data Quality Monitoring
Ongoing monitoring of data quality by comparing incoming data against baseline or expected patterns. Run evaluations periodically and track scores over time to catch regressions.
from stickler import StructuredModel, ComparableField
from stickler.comparators import ExactComparator, NumericComparator, LevenshteinComparator
class CustomerRecord(StructuredModel):
customer_id: str = ComparableField(
comparator=ExactComparator(), weight=3.0, threshold=1.0
)
full_name: str = ComparableField(
comparator=LevenshteinComparator(), weight=1.5, threshold=0.85
)
account_balance: float = ComparableField(
comparator=NumericComparator(tolerance=0.01), weight=2.0
)
status: str = ComparableField(
comparator=ExactComparator(), weight=2.0, threshold=1.0
)
Combine with BulkStructuredModelEvaluator and save_metrics() to produce evaluation reports. Compare metrics across runs to detect data quality drift. See Bulk Evaluation for batch processing patterns.
Choosing the Right Pattern
| Use Case | Primary Comparators | Key Consideration |
|---|---|---|
| Document Extraction | Exact, Numeric, Levenshtein, Fuzzy | Weight fields by business impact |
| OCR Evaluation | Levenshtein | Character-level accuracy matters |
| Entity Extraction | Exact, Levenshtein | Entity type must be exact; names can be lenient |
| ML Model Evaluation | Exact, Numeric | Use confusion matrix for classification metrics |
| ETL Validation | Exact, Numeric | Tight tolerances for deterministic pipelines |
| Data Quality Monitoring | Exact, Numeric, Levenshtein | Track scores over time to detect drift |
For all use cases, BulkStructuredModelEvaluator is the recommended way to evaluate a test set. Define your StructuredModel, choose comparators based on field semantics, set thresholds and weights based on business impact, and use BulkStructuredModelEvaluator for production evaluation. See Bulk Evaluation for the full guide and Best Practices for guidance on threshold tuning and weight assignment.