Classification Logic for Evaluation Metrics
This document defines the classification logic used in the stickler library for evaluating predictions against ground truth.
Core Definitions
The confusion matrix metrics classify comparisons into five categories:
| Category | Abbreviation | Definition |
|---|---|---|
| True Positive (TP) | TP | GT != null, EST != null, GT == EST (match above threshold) |
| False Alarm (FA) | FA | GT == null, EST != null |
| True Negative (TN) | TN | GT == null, EST == null |
| False Negative (FN) | FN | GT != null, EST == null |
| False Discovery (FD) | FD | GT != null, EST != null, GT != EST (match below threshold) |
False alarm (FA) and false discovery (FD) are a subset of false positives (FP).
FP = FA + FD
Where: - GT = Ground Truth - EST = Estimate (Prediction)
Classification Logic by Data Type
Simple Values (Strings, Numbers, etc.)
| Ground Truth | Prediction | Classification | Explanation |
|---|---|---|---|
| "value" | "value" | TP | Exact match |
| "value" | "similar" | FD | Both non-null but don't match above threshold |
| "value" | null | FN | Missing prediction for existing ground truth |
| null | "value" | FP | Prediction exists but no ground truth (False Alarm) |
| null | null | TN | Correctly predicted absence |
| "" (empty) | null | Treated as TN | Empty strings are treated as null |
| null | "" (empty) | Treated as TN | Empty strings are treated as null |
Lists
For lists, we use the Hungarian algorithm to find optimal matching between elements:
- Empty Lists:
- GT = [], EST = [] → TN (both empty)
- GT = [], EST = ["item"] → FA (False Alarm for each prediction item)
-
GT = ["item"], EST = [] → FN (False Negative for each ground truth item)
-
Element Matching:
- Each element in GT is matched with at most one element in EST
- Each element in EST is matched with at most one element in GT
-
Matching maximizes overall similarity
-
Classification of Matched Elements:
- If similarity ≥ threshold → TP
-
If similarity < threshold → FD (False Discovery)
-
Classification of Unmatched Elements:
- Unmatched GT elements → FN
- Unmatched EST elements → FA (False Alarm)
Example 1: Mixed Matching
GT = ["red", "blue", "green"]
EST = ["red", "yellow", "orange", "blue"]
Matching: - "red" matches "red" → TP - "blue" matches "blue" → TP - "green" has no match → FN - "yellow" has no match in GT → FA (False Alarm) - "orange" has no match in GT → FA (False Alarm)
Result: TP=2, FP=A, TN=0, FN=1, FD=0
Example 2: Similar But Not Exact
GT = ["apple", "banana", "cherry"]
EST = ["aple", "bananna", "cheery"]
Matching (assuming threshold = 0.7): - "apple" matches "aple" with similarity 0.8 → TP - "banana" matches "bananna" with similarity 0.85 → TP - "cherry" matches "cheery" with similarity 0.83 → TP
Result: TP=3, FA=0, TN=0, FN=0, FD=0
Example 3: Below Threshold
GT = ["apple", "banana", "cherry"]
EST = ["appx", "bnn", "chry"]
Matching (assuming threshold = 0.7): - "apple" matches "appx" with similarity 0.5 → FD - "banana" matches "bnn" with similarity 0.6 → FD - "cherry" matches "chry" with similarity 0.65 → FD
Result: TP=0, FA=0, TN=0, FN=0, FD=3
Nested Objects/Dictionaries
For nested objects, we apply the classification logic recursively:
- Empty Objects:
- GT = {}, EST = {} → TN
- GT = {}, EST = {key: value} → FA (False Alarm)
-
GT = {key: value}, EST = {} → FN
-
Field Matching:
- Each field is evaluated independently
- Fields present in both GT and EST are compared for similarity
-
Fields present in only one are classified as FA or FN
-
Classification of Fields:
- If both have field and similarity ≥ threshold → TP
- If both have field and similarity < threshold → FD (False Discovery)
- If only GT has field → FN
- If only EST has field → FA (False Alarm)
Example:
GT = {name: "John", age: 30, address: "123 Main St"}
EST = {name: "John", age: 31, phone: "555-1234"}
Field-by-field: - name: Both have it, exact match → TP - age: Both have it, but different → FD - address: Only in GT → FN - phone: Only in EST → FA (False Alarm)
Result: TP=1, FA=1, TN=0, FN=1, FD=1
Derived Metrics
From the base confusion matrix counts, we derive the following metrics:
- Precision: TP / (TP + FP)
-
Measures how many of the predicted values are correct
-
Recall: TP / (TP + FN)
-
Measures how many of the ground truth values are correctly predicted
-
F1 Score: 2 * (Precision * Recall) / (Precision + Recall)
-
Harmonic mean of precision and recall
-
Accuracy: (TP + TN) / (TP + TN + FP + FN)
- Overall correctness of predictions
Edge Cases and Clarifications
1. Null vs. Empty Equivalence
Design Decision: Empty collections and null values are treated as equivalent in all comparisons.
- Empty strings (""), empty lists ([]), and empty objects ({}) are treated as null values
- This means comparing null with an empty collection results in TN (True Negative)
- Examples:
- GT =
null, EST =[]→ TN (equivalent states representing "no data") - GT =
[], EST =null→ TN (equivalent states representing "no data") - GT =
"", EST =null→ TN (equivalent states representing "no data") - GT =
{}, EST =null→ TN (equivalent states representing "no data")
Rationale: Semantically, both null values and empty collections represent the absence of meaningful data. Distinguishing between these states would introduce unnecessary complexity and inconsistency in evaluation metrics. For practical evaluation purposes, "no data" should be treated uniformly regardless of its representation.
2. Threshold Boundary
- Values exactly at the threshold are considered matches (TP)
- For example, if threshold = 0.7 and similarity = 0.7, this is a TP
3. List Order
- List order doesn't matter for matching
- The Hungarian algorithm finds the optimal matching regardless of order
4. Partial Matches in Lists
- For lists, we don't have "partial credit" for individual elements
- Each element is classified as TP, FA, FN, or FD independently
5. Nested Lists
- For lists of objects, we apply the Hungarian algorithm at the list level
- Each matched pair of objects is then evaluated recursively
6. Missing Fields vs. Null Fields
- A missing field and a field with null value are treated differently:
- Missing field in EST when GT has it → FN
- Null field in EST when GT has non-null → FN
- Missing field in GT when EST has it → FA (False Alarm)
- Null field in GT when EST has non-null → FA (False Alarm)
Summary of Key Points
-
False Alarm (FA) occurs when the prediction includes something that doesn't exist in the ground truth (GT is null, EST is not null)
-
False Discovery (FD) occurs when the prediction recognizes something that exists but gets it wrong (both GT and EST are not null, but they don't match)
-
List matching uses the Hungarian algorithm to find optimal pairings, with unmatched items classified as FP or FN
-
Nested structures are evaluated recursively, with each field or element classified independently
-
Empty collections are generally treated as null values for classification purposes