Skip to content

Customizing Your Evaluation

Stickler gives you fine-grained control over how every field in your structured data is compared. You can tune comparison algorithms, thresholds, and weights at the field level -- whether you define your models in Python or through JSON Schema configuration files.

This guide covers three ways to configure evaluation behavior:

  1. ComparableField parameters in Python model definitions
  2. The compare_with() method for running comparisons
  3. JSON Schema extensions for configuration-driven evaluation

Evaluating a test set?

If you need to evaluate many document pairs (not just one), use BulkStructuredModelEvaluator — it handles streaming aggregation, progress reporting, and metrics export. See the Bulk Evaluation guide.


How Evaluation Works

Stickler evaluates structured data from the inside out. At the innermost layer, Comparators compute a raw similarity score (0.0–1.0) between two primitive values. Each ComparableField wraps a comparator with a threshold and weight — scores below the threshold are clipped to zero, and weights control how much the field matters. StructuredModels aggregate field scores into a weighted average. For list fields, the Hungarian algorithm finds the optimal one-to-one pairing between ground truth and prediction items before scoring.

graph LR
    A["Comparator<br/>(raw score 0.0–1.0)"] --> B["ComparableField<br/>(threshold + clip + weight)"]
    B --> C["StructuredModel<br/>(weighted average)"]
    C --> D["List matching<br/>(Hungarian algorithm)"]
    D --> E["Overall Score"]

ComparableField Parameters

When defining a StructuredModel subclass in Python, each field is declared with ComparableField(...). This function accepts comparison parameters that control how that specific field is evaluated.

Parameter Reference

Parameter Type Default Description
comparator BaseComparator LevenshteinComparator() The comparison algorithm to use. See Comparators for the full list.
threshold float (0.0--1.0) 0.5 Minimum similarity score required for a field to be classified as a match.
weight float (> 0.0) 1.0 Relative importance of this field when computing aggregate scores.
clip_under_threshold bool True When True, scores below threshold are zeroed out before contributing to the weighted average.
aggregate bool False Deprecated. Previously controlled inclusion in parent-level metrics. All nodes now include an automatic aggregate field in the confusion matrix output.

How Each Parameter Affects Scoring

comparator -- Determines which algorithm computes the raw similarity score between two field values. Different comparators suit different data types:

  • LevenshteinComparator for names and addresses (edit-distance based)
  • ExactComparator for IDs and codes (binary match)
  • NumericComparator for prices and quantities (tolerance-based)
  • FuzzyComparator for descriptions (token-based, order-independent)

See the Comparators section for the complete list and detailed descriptions.

threshold -- Acts as a binary classification cutoff. A similarity score at or above the threshold counts as a True Positive; below it counts as a False Discovery. Choose stricter thresholds (0.9--1.0) for critical fields and looser thresholds (0.5--0.7) for flexible fields.

weight -- Controls how much a field contributes to the overall score. The overall score is computed as:

overall_score = sum(field_score * weight) / sum(weights)

Fields with higher weights pull the overall score toward their individual result.

clip_under_threshold -- When enabled (the default), a field that scores below its threshold contributes 0.0 to the weighted average instead of its partial similarity. This prevents low-confidence matches from inflating the overall score.

Example Model

from stickler.comparators.levenshtein import LevenshteinComparator
from stickler.comparators.exact import ExactComparator
from stickler.comparators.numeric import NumericComparator
from stickler.comparators.fuzzy import FuzzyComparator
from stickler.structured_object_evaluator.models.comparable_field import ComparableField
from stickler.structured_object_evaluator.models.structured_model import StructuredModel


class Invoice(StructuredModel):
    invoice_id: str = ComparableField(
        comparator=ExactComparator(),  # Must match exactly
        threshold=1.0,
        weight=3.0,                    # Highest weight — wrong ID = wrong customer
        clip_under_threshold=True,
    )

    customer_name: str = ComparableField(
        comparator=LevenshteinComparator(),  # Tolerates typos
        threshold=0.8,
        weight=1.5,
    )

    total_amount: float = ComparableField(
        comparator=NumericComparator(),  # Tolerance-based numeric comparison
        threshold=0.95,
        weight=2.5,
    )

    notes: str = ComparableField(
        comparator=FuzzyComparator(),  # Low threshold, minimal weight — cosmetic field
        threshold=0.6,
        weight=0.3,
    )

The compare_with() Method

Once you have two model instances -- a ground truth and a prediction -- call compare_with() to evaluate them.

Basic Usage

result = ground_truth.compare_with(prediction)

print(f"Overall score: {result['overall_score']:.2%}")
print(f"All fields matched: {result['all_fields_matched']}")

for field, score in result['field_scores'].items():
    print(f"  {field}: {score:.3f}")

The default output contains three keys:

  • overall_score (float) -- Weighted average of all field scores (0.0 to 1.0).
  • field_scores (dict) -- Maps each field name to its similarity score.
  • all_fields_matched (bool) -- True when every field meets or exceeds its threshold.

Key Parameters

Parameter Type Default What It Enables
include_confusion_matrix bool False Adds a confusion_matrix key with TP/FP/TN/FN/FD/FA counts and derived precision, recall, F1, and accuracy metrics at both the overall and field levels.
document_non_matches bool False Adds a non_matches list with details about every field that failed to match, including the field path, non-match type, both values, and a human-readable reason.
document_field_comparisons bool False Adds a field_comparisons list documenting every field-level comparison (both matches and non-matches) with expected/actual keys and values, scores, and reasons.
add_confidence_metrics bool False Adds an auroc_confidence_metric for evaluating confidence calibration.
evaluator_format bool False Restructures the output for bulk evaluation integration. See Evaluator Format below.

Example with Detailed Metrics

result = ground_truth.compare_with(
    prediction,
    include_confusion_matrix=True,
    document_non_matches=True,
)

# Overall score
print(f"Score: {result['overall_score']:.3f}")

# Confusion matrix totals
cm = result['confusion_matrix']['aggregate']
print(f"Precision: {cm['derived']['cm_precision']:.3f}")
print(f"Recall:    {cm['derived']['cm_recall']:.3f}")
print(f"F1:        {cm['derived']['cm_f1']:.3f}")

# Inspect non-matches for debugging
for nm in result.get('non_matches', []):
    print(f"  {nm['field_path']}: {nm['non_match_type']} "
          f"(score={nm['similarity_score']:.3f})")

JSON Schema Extensions

For configuration-driven evaluation -- where you want to define models and comparison logic without writing Python code -- Stickler supports standard JSON Schema (Draft 7+) with custom x-aws-stickler-* extensions.

Extension Reference

Add these extensions to any property in your JSON Schema to control comparison behavior:

Extension Type Default Purpose
x-aws-stickler-comparator string Type-dependent Comparison algorithm (e.g., "ExactComparator", "LevenshteinComparator")
x-aws-stickler-threshold number (0.0--1.0) 0.5 or 1.0 Match classification cutoff
x-aws-stickler-weight number (> 0.0) 1.0 Field importance multiplier
x-aws-stickler-clip-under-threshold boolean false Zero out scores below threshold
x-aws-stickler-aggregate boolean false Include in parent-level aggregate metrics
x-aws-stickler-model-name string "DynamicModel" Name of the generated Python class (root level)
x-aws-stickler-match-threshold number (0.0--1.0) 0.7 Model-level matching threshold for Hungarian algorithm (root level)

Example Schema

{
  "type": "object",
  "x-aws-stickler-model-name": "Invoice",
  "x-aws-stickler-match-threshold": 0.75,
  "properties": {
    "invoice_id": {
      "type": "string",
      "x-aws-stickler-comparator": "ExactComparator",
      "x-aws-stickler-threshold": 1.0,
      "x-aws-stickler-weight": 3.0,
      "x-aws-stickler-clip-under-threshold": true
    },
    "customer_name": {
      "type": "string",
      "x-aws-stickler-comparator": "LevenshteinComparator",
      "x-aws-stickler-threshold": 0.8,
      "x-aws-stickler-weight": 1.5
    },
    "total_amount": {
      "type": "number",
      "x-aws-stickler-comparator": "NumericComparator",
      "x-aws-stickler-threshold": 0.95,
      "x-aws-stickler-weight": 2.5
    }
  },
  "required": ["invoice_id", "customer_name", "total_amount"]
}

Loading a Schema

from stickler.structured_object_evaluator.models.structured_model import StructuredModel
import json

with open("invoice_schema.json") as f:
    schema = json.load(f)

Invoice = StructuredModel.from_json_schema(schema)

ground_truth = Invoice(**{"invoice_id": "INV-001", "customer_name": "Acme Corp", "total_amount": 1250.00})
prediction = Invoice(**{"invoice_id": "INV-001", "customer_name": "ACME Corporation", "total_amount": 1250.00})

result = ground_truth.compare_with(prediction)
print(f"Overall Score: {result['overall_score']:.3f}")
Sample Output
{
  "field_scores": {
    "invoice_id": 1.0,
    "customer_name": 0.0,
    "total_amount": 1.0
  },
  "overall_score": 0.786,
  "all_fields_matched": false
}

Note that customer_name scores 0.0: "Acme Corp" vs "ACME Corporation" produces a Levenshtein similarity below the 0.8 threshold, so clip_under_threshold zeros it out.

For the complete reference on all extensions, default comparators by type, and a full production example, see the JSON Schema Extensions section in the project README.


Evaluator Format

When you need output structured for bulk evaluation pipelines, pass evaluator_format=True. This restructures the result into a format optimized for aggregation.

result = ground_truth.compare_with(prediction, evaluator_format=True)

The output changes to:

{
  "overall": {
    "precision": 0.83,
    "recall": 1.0,
    "f1": 0.91,
    "accuracy": 0.83,
    "anls_score": 0.83
  },
  "fields": {
    "invoice_id": { "precision": 1.0, "recall": 1.0, "f1": 1.0, "accuracy": 1.0, "anls_score": 1.0 },
    "customer_name": { "precision": 1.0, "recall": 1.0, "f1": 1.0, "accuracy": 1.0, "anls_score": 0.85 }
  },
  "confusion_matrix": {},
  "non_matches": []
}

Key differences from the default output:

  • The top-level key is overall (a dict of metrics), not overall_score (a float).
  • Each field in fields contains precision, recall, F1, accuracy, and ANLS score.
  • The confusion_matrix and non_matches keys are present but empty in this format.

This format is what BulkStructuredModelEvaluator expects when aggregating results across many documents.


Next Steps

  • Bulk Evaluation -- Evaluate hundreds or thousands of document pairs with memory-efficient streaming.
  • Understanding Results -- Learn how to read confusion matrices, derived metrics, and non-match reports.