Skip to content

Evaluator

stickler.structured_object_evaluator.bulk_structured_model_evaluator

Stateful Bulk Evaluator for StructuredModel objects.

This module provides a modern stateful bulk evaluator inspired by PyTorch Lightning's stateful metrics and scikit-learn's incremental learning patterns. It supports memory-efficient processing of large datasets through accumulation-based evaluation.

stickler.structured_object_evaluator.bulk_structured_model_evaluator.BulkStructuredModelEvaluator

Stateful bulk evaluator for StructuredModel objects.

Inspired by PyTorch Lightning's stateful metrics and scikit-learn's incremental learning patterns. This evaluator accumulates evaluation state across multiple document processing calls, enabling memory-efficient evaluation of arbitrarily large datasets without loading everything into memory at once.

Key Features: - Stateful accumulation (like PyTorch Lightning metrics) - Memory-efficient streaming processing (like scikit-learn partial_fit) - External control over data flow and error handling - Checkpointing and recovery capabilities - Distributed processing support via state merging - Uses StructuredModel.compare_with() method directly

Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
class BulkStructuredModelEvaluator:
    """
    Stateful bulk evaluator for StructuredModel objects.

    Inspired by PyTorch Lightning's stateful metrics and scikit-learn's incremental
    learning patterns. This evaluator accumulates evaluation state across multiple
    document processing calls, enabling memory-efficient evaluation of arbitrarily
    large datasets without loading everything into memory at once.

    Key Features:
    - Stateful accumulation (like PyTorch Lightning metrics)
    - Memory-efficient streaming processing (like scikit-learn partial_fit)
    - External control over data flow and error handling
    - Checkpointing and recovery capabilities
    - Distributed processing support via state merging
    - Uses StructuredModel.compare_with() method directly
    """

    def __init__(
        self,
        target_schema: Optional[Type[StructuredModel]] = None,
        verbose: bool = False,
        document_non_matches: bool = True,
        elide_errors: bool = False,
        individual_results_jsonl: Optional[str] = None,
    ):
        """
        Initialize the stateful bulk evaluator.

        Args:
            target_schema: Optional StructuredModel class for validation and processing.
                Required for update() and evaluate_dataframe(). Not required when using
                update_from_comparison_result() with pre-computed results.
            verbose: Whether to print detailed progress information
            document_non_matches: Whether to document detailed non-match information
            elide_errors: If True, skip documents with errors; if False, accumulate error metrics
            individual_results_jsonl: Optional path to JSONL file for appending individual comparison results
        """
        self.target_schema = target_schema
        self.verbose = verbose
        self.document_non_matches = document_non_matches
        self.elide_errors = elide_errors
        self.individual_results_jsonl = individual_results_jsonl

        # Initialize state
        self.reset()

        self._schema_name = target_schema.__name__ if target_schema else "unknown"

        if self.verbose:
            print(f"Initialized BulkStructuredModelEvaluator for {self._schema_name}")
            if self.individual_results_jsonl:
                print(
                    f"Individual results will be appended to: {self.individual_results_jsonl}"
                )

    def reset(self) -> None:
        """
        Clear all accumulated state and start fresh evaluation.

        This method resets all internal counters, metrics, and error tracking
        to initial state, enabling reuse of the same evaluator instance for
        multiple evaluation runs.
        """
        # Accumulated confusion matrix state using nested defaultdicts
        self._confusion_matrix = {
            "overall": defaultdict(int),
            "fields": defaultdict(lambda: defaultdict(int)),
        }

        # Non-match tracking (when document_non_matches=True)
        self._non_matches = []

        # Error tracking
        self._errors = []

        # Processing statistics
        self._processed_count = 0
        self._start_time = time.time()

        if self.verbose:
            print("Reset evaluator state")

    def update(
        self,
        gt_model: StructuredModel,
        pred_model: StructuredModel,
        doc_id: Optional[str] = None,
    ) -> None:
        """
        Process a single document pair and accumulate the results in internal state.

        Runs compare_with() on the model pair, optionally writes the raw result
        to JSONL, then delegates accumulation to update_from_comparison_result().

        Args:
            gt_model: Ground truth StructuredModel instance
            pred_model: Predicted StructuredModel instance
            doc_id: Optional document identifier for error tracking
        """
        if doc_id is None:
            doc_id = f"doc_{self._processed_count}"

        try:
            comparison_result = gt_model.compare_with(
                pred_model,
                include_confusion_matrix=True,
                document_non_matches=self.document_non_matches,
            )

            # JSONL append of raw comparison result before accumulation
            if self.individual_results_jsonl:
                record = {"doc_id": doc_id, "comparison_result": comparison_result}
                with open(self.individual_results_jsonl, "a", encoding="utf-8") as f:
                    f.write(json.dumps(record) + "\n")

            self.update_from_comparison_result(comparison_result, doc_id)

        except Exception as e:
            error_record = {
                "doc_id": doc_id,
                "error": str(e),
                "error_type": type(e).__name__,
            }

            if not self.elide_errors:
                self._errors.append(error_record)
                self._confusion_matrix["overall"]["fn"] += 1

            if self.verbose:
                print(f"Error processing document {doc_id}: {str(e)}")

    def update_from_comparison_result(
        self,
        comparison_result: Dict[str, Any],
        doc_id: Optional[str] = None,
    ) -> None:
        """
        Accumulate a pre-computed compare_with() result into internal state.

        Unlike update(), this method does not require StructuredModel instances
        or re-run comparisons. It accepts the raw dictionary output of
        StructuredModel.compare_with(include_confusion_matrix=True) and
        accumulates its confusion matrix.

        Args:
            comparison_result: Dictionary returned by StructuredModel.compare_with()
                with include_confusion_matrix=True. Must contain a "confusion_matrix" key.
            doc_id: Optional document identifier for error tracking
        """
        if doc_id is None:
            doc_id = f"doc_{self._processed_count}"

        try:
            if "confusion_matrix" not in comparison_result:
                raise ValueError(
                    "comparison_result must contain a 'confusion_matrix' key. "
                    "Ensure compare_with() was called with include_confusion_matrix=True."
                )

            # Collect non-matches if enabled and present
            if self.document_non_matches and "non_matches" in comparison_result:
                for non_match in comparison_result["non_matches"]:
                    non_match_with_doc = non_match.copy()
                    non_match_with_doc["doc_id"] = doc_id
                    self._non_matches.append(non_match_with_doc)

            # Accumulate the confusion matrix
            self._accumulate_confusion_matrix(comparison_result["confusion_matrix"])

            self._processed_count += 1

            if self.verbose and self._processed_count % 1000 == 0:
                elapsed = time.time() - self._start_time
                print(f"Processed {self._processed_count} documents ({elapsed:.2f}s)")

        except Exception as e:
            error_record = {
                "doc_id": doc_id,
                "error": str(e),
                "error_type": type(e).__name__,
            }

            if not self.elide_errors:
                self._errors.append(error_record)
                self._confusion_matrix["overall"]["fn"] += 1

            if self.verbose:
                print(f"Error processing document {doc_id}: {str(e)}")

    def update_batch(
        self, batch_data: List[Tuple[StructuredModel, StructuredModel, Optional[str]]]
    ) -> None:
        """
        Process multiple document pairs efficiently in a batch.

        This method provides efficient batch processing by calling update()
        multiple times with optional garbage collection for memory management.

        Args:
            batch_data: List of tuples containing (gt_model, pred_model, doc_id)
        """
        batch_start = self._processed_count

        for gt_model, pred_model, doc_id in batch_data:
            self.update(gt_model, pred_model, doc_id)

        # Garbage collection for large batches
        if len(batch_data) >= 1000:
            gc.collect()

        if self.verbose:
            batch_size = self._processed_count - batch_start
            print(f"Processed batch of {batch_size} documents")

    def get_current_metrics(self) -> ProcessEvaluation:
        """
        Get current accumulated metrics without clearing state.

        This method allows monitoring evaluation progress by returning current
        metrics computed from accumulated state. Unlike compute(), this does
        not clear the internal state.

        Returns:
            ProcessEvaluation with current accumulated metrics
        """
        return self._build_process_evaluation()

    def compute(self) -> ProcessEvaluation:
        """
        Calculate final aggregated metrics from accumulated state.

        This method performs the final computation of all derived metrics from
        the accumulated confusion matrix state, similar to PyTorch Lightning's
        training_epoch_end pattern.

        Returns:
            ProcessEvaluation with final aggregated metrics
        """
        result = self._build_process_evaluation()

        if self.verbose:
            total_time = time.time() - self._start_time
            print(
                f"Final computation completed: {self._processed_count} documents in {total_time:.2f}s"
            )
            print(f"Overall accuracy: {result.metrics.get('cm_accuracy', 0.0):.3f}")

        return result

    def _accumulate_confusion_matrix(self, cm_result: Dict[str, Any]) -> None:
        """
        Accumulate confusion matrix results from a single document evaluation.

        This method handles the core accumulation logic, properly aggregating
        both overall metrics and field-level metrics while maintaining correct
        nested field paths.

        Args:
            cm_result: Confusion matrix result from compare_with method
        """
        # Accumulate overall metrics
        if "overall" in cm_result:
            for metric_name, value in cm_result["overall"].items():
                if isinstance(value, (int, float)) and metric_name in [
                    "tp",
                    "fp",
                    "tn",
                    "fn",
                    "fd",
                    "fa",
                ]:
                    self._confusion_matrix["overall"][metric_name] += value

        # Accumulate field-level metrics with proper path handling
        if "fields" in cm_result:
            self._accumulate_field_metrics(cm_result["fields"], "")

    def _accumulate_field_metrics(
        self, fields_dict: Dict[str, Any], path_prefix: str
    ) -> None:
        """
        Recursively accumulate field-level metrics with proper nested path construction.

        This method fixes the nested field aggregation bugs from the original implementation
        by properly handling different field structure formats and maintaining correct
        dotted notation paths for nested fields.

        Args:
            fields_dict: Dictionary containing field metrics to accumulate
            path_prefix: Current path prefix for building nested field paths
        """
        for field_name, field_data in fields_dict.items():
            current_path = f"{path_prefix}.{field_name}" if path_prefix else field_name

            if not isinstance(field_data, dict):
                continue

            # Handle field with direct confusion matrix metrics (simple leaf field)
            direct_metrics = {
                k: v
                for k, v in field_data.items()
                if k in ["tp", "fp", "tn", "fn", "fd", "fa"]
                and isinstance(v, (int, float))
            }
            if direct_metrics:
                self._accumulate_single_field_metrics(current_path, direct_metrics)

            # Handle hierarchical field structure (object fields with overall + fields)
            if "overall" in field_data:
                # Accumulate the overall metrics for this field
                self._accumulate_single_field_metrics(
                    current_path, field_data["overall"]
                )

            # Handle nested fields - check if there's a "fields" structure
            if "fields" in field_data and isinstance(field_data["fields"], dict):
                # For each nested field, create the proper dotted path
                for nested_field_name, nested_field_data in field_data[
                    "fields"
                ].items():
                    nested_path = f"{current_path}.{nested_field_name}"

                    if isinstance(nested_field_data, dict):
                        # If nested field has "overall", use those metrics
                        if "overall" in nested_field_data:
                            self._accumulate_single_field_metrics(
                                nested_path, nested_field_data["overall"]
                            )
                        else:
                            # Otherwise, look for direct metrics
                            nested_metrics = {
                                k: v
                                for k, v in nested_field_data.items()
                                if k in ["tp", "fp", "tn", "fn", "fd", "fa"]
                                and isinstance(v, (int, float))
                            }
                            if nested_metrics:
                                self._accumulate_single_field_metrics(
                                    nested_path, nested_metrics
                                )

                        # Continue recursion if there are more nested fields
                        if "fields" in nested_field_data:
                            self._accumulate_field_metrics(
                                nested_field_data["fields"], nested_path
                            )

            # Handle list field structure with nested_fields
            elif "nested_fields" in field_data:
                # Accumulate list-level metrics
                list_metrics = {
                    k: v
                    for k, v in field_data.items()
                    if k in ["tp", "fp", "tn", "fn", "fd", "fa"]
                    and isinstance(v, (int, float))
                }
                if list_metrics:
                    self._accumulate_single_field_metrics(current_path, list_metrics)

                # Accumulate nested field metrics from the list items
                for nested_field_name, nested_metrics in field_data[
                    "nested_fields"
                ].items():
                    nested_path = f"{current_path}.{nested_field_name}"
                    self._accumulate_single_field_metrics(nested_path, nested_metrics)

    def _accumulate_single_field_metrics(
        self, field_path: str, metrics: Dict[str, Union[int, float]]
    ) -> None:
        """
        Accumulate metrics for a single field path.

        Args:
            field_path: Dotted path to the field (e.g., 'transactions.date')
            metrics: Dictionary of confusion matrix metrics to accumulate
        """
        for metric_name, value in metrics.items():
            if metric_name in ["tp", "fp", "tn", "fn", "fd", "fa"] and isinstance(
                value, (int, float)
            ):
                self._confusion_matrix["fields"][field_path][metric_name] += value

    def _calculate_derived_metrics(
        self, cm_dict: Dict[str, Union[int, float]]
    ) -> Dict[str, float]:
        """
        Calculate derived confusion matrix metrics (precision, recall, f1, accuracy).

        This method replicates the derivation logic for confusion matrix metrics.

        Args:
            cm_dict: Dictionary with basic confusion matrix counts

        Returns:
            Dictionary with derived metrics
        """
        tp = cm_dict.get("tp", 0)
        fp = cm_dict.get("fp", 0)
        tn = cm_dict.get("tn", 0)
        fn = cm_dict.get("fn", 0)

        # Calculate derived metrics with safe division
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = (
            2 * (precision * recall) / (precision + recall)
            if (precision + recall) > 0
            else 0.0
        )
        accuracy = (tp + tn) / (tp + tn + fp + fn) if (tp + tn + fp + fn) > 0 else 0.0

        return {
            "cm_precision": precision,
            "cm_recall": recall,
            "cm_f1": f1,
            "cm_accuracy": accuracy,
        }

    def _build_process_evaluation(self) -> ProcessEvaluation:
        """
        Build ProcessEvaluation from current accumulated state.

        Returns:
            ProcessEvaluation with computed metrics from accumulated state
        """
        # Calculate derived metrics for overall results
        overall_cm = dict(self._confusion_matrix["overall"])
        overall_derived = self._calculate_derived_metrics(overall_cm)
        overall_metrics = {**overall_cm, **overall_derived}

        # Calculate derived metrics for each field
        field_metrics = {}
        for field_path, field_cm in self._confusion_matrix["fields"].items():
            field_cm_dict = dict(field_cm)
            field_derived = self._calculate_derived_metrics(field_cm_dict)
            field_metrics[field_path] = {**field_cm_dict, **field_derived}

        total_time = time.time() - self._start_time

        return ProcessEvaluation(
            document_count=self._processed_count,
            metrics=overall_metrics,
            field_metrics=field_metrics,
            errors=list(self._errors),  # Copy to avoid external modification
            total_time=total_time,
            non_matches=list(self._non_matches) if self.document_non_matches else None,
        )

    def save_metrics(self, filepath: str) -> None:
        """
        Save current accumulated metrics to a JSON file.

        Args:
            filepath: Path where metrics will be saved as JSON
        """
        process_eval = self._build_process_evaluation()

        # Build comprehensive metrics dictionary
        metrics_data = {
            "overall_metrics": process_eval.metrics,
            "field_metrics": process_eval.field_metrics,
            "evaluation_summary": {
                "total_documents_processed": self._processed_count,
                "total_evaluation_time": process_eval.total_time,
                "documents_per_second": self._processed_count / process_eval.total_time
                if process_eval.total_time > 0
                else 0,
                "error_count": len(process_eval.errors),
                "error_rate": len(process_eval.errors) / self._processed_count
                if self._processed_count > 0
                else 0,
                "target_schema": self._schema_name,
            },
            "errors": process_eval.errors,
            "metadata": {
                "saved_at": time.strftime("%Y-%m-%d %H:%M:%S UTC", time.gmtime()),
                "evaluator_config": {
                    "verbose": self.verbose,
                    "document_non_matches": self.document_non_matches,
                    "elide_errors": self.elide_errors,
                    "individual_results_jsonl": self.individual_results_jsonl,
                },
            },
        }

        # Ensure directory exists
        import os

        os.makedirs(os.path.dirname(os.path.abspath(filepath)), exist_ok=True)

        # Write to file
        with open(filepath, "w", encoding="utf-8") as f:
            json.dump(metrics_data, f, indent=2, default=str)

        if self.verbose:
            print(f"Metrics saved to: {filepath}")

    def pretty_print_metrics(self) -> None:
        """
        Pretty print current accumulated metrics in a format similar to StructuredModel.

        Displays overall metrics, field-level metrics, and evaluation summary
        in a human-readable format.
        """
        process_eval = self._build_process_evaluation()

        # Header
        print("\n" + "=" * 80)
        print(f"BULK EVALUATION RESULTS - {self._schema_name}")
        print("=" * 80)

        # Overall metrics
        overall_metrics = process_eval.metrics
        print("\nOVERALL METRICS:")
        print("-" * 40)
        print(f"Documents Processed: {self._processed_count:,}")
        print(f"Evaluation Time: {process_eval.total_time:.2f}s")
        print(
            f"Processing Rate: {self._processed_count / process_eval.total_time:.1f} docs/sec"
            if process_eval.total_time > 0
            else "Processing Rate: N/A"
        )

        # Confusion matrix
        print("\nCONFUSION MATRIX:")
        print(f"  True Positives (TP):    {overall_metrics.get('tp', 0):,}")
        print(f"  False Positives (FP):   {overall_metrics.get('fp', 0):,}")
        print(f"  True Negatives (TN):    {overall_metrics.get('tn', 0):,}")
        print(f"  False Negatives (FN):   {overall_metrics.get('fn', 0):,}")
        print(f"  False Discovery (FD):   {overall_metrics.get('fd', 0):,}")
        print(f"  False Alarm (FA):   {overall_metrics.get('fa', 0):,}")

        # Derived metrics
        print("\nDERIVED METRICS:")
        print(f"  Precision:     {overall_metrics.get('cm_precision', 0.0):.4f}")
        print(f"  Recall:        {overall_metrics.get('cm_recall', 0.0):.4f}")
        print(f"  F1 Score:      {overall_metrics.get('cm_f1', 0.0):.4f}")
        print(f"  Accuracy:      {overall_metrics.get('cm_accuracy', 0.0):.4f}")

        # Field-level metrics
        if process_eval.field_metrics:
            print("\nFIELD-LEVEL METRICS:")
            print("-" * 40)

            # Sort fields by F1 score descending for better readability
            sorted_fields = sorted(
                process_eval.field_metrics.items(),
                key=lambda x: x[1].get("cm_f1", 0.0),
                reverse=True,
            )

            for field_path, field_metrics in sorted_fields:
                tp = field_metrics.get("tp", 0)
                fp = field_metrics.get("fp", 0)
                fn = field_metrics.get("fn", 0)
                precision = field_metrics.get("cm_precision", 0.0)
                recall = field_metrics.get("cm_recall", 0.0)
                f1 = field_metrics.get("cm_f1", 0.0)

                # Only show fields with some activity
                if tp + fp + fn > 0:
                    print(
                        f"  {field_path:30} P: {precision:.3f} | R: {recall:.3f} | F1: {f1:.3f} | TP: {tp:,} | FP: {fp:,} | FN: {fn:,}"
                    )

        # Error summary
        if process_eval.errors:
            print("\nERROR SUMMARY:")
            print("-" * 40)
            print(f"Total Errors: {len(process_eval.errors):,}")
            print(
                f"Error Rate: {len(process_eval.errors) / self._processed_count * 100:.2f}%"
                if self._processed_count > 0
                else "Error Rate: N/A"
            )

            # Group errors by type
            error_types = {}
            for error in process_eval.errors:
                error_type = error.get("error_type", "Unknown")
                error_types[error_type] = error_types.get(error_type, 0) + 1

            if error_types:
                print("Error Types:")
                for error_type, count in sorted(
                    error_types.items(), key=lambda x: x[1], reverse=True
                ):
                    print(f"  {error_type}: {count:,}")

        # Configuration info
        print("\nCONFIGURATION:")
        print("-" * 40)
        print(f"Target Schema: {self._schema_name}")
        print(f"Document Non-matches: {'Yes' if self.document_non_matches else 'No'}")
        print(f"Elide Errors: {'Yes' if self.elide_errors else 'No'}")
        if self.individual_results_jsonl:
            print(f"Individual Results JSONL: {self.individual_results_jsonl}")

        print("=" * 80)

    def get_state(self) -> Dict[str, Any]:
        """
        Get serializable state for checkpointing and recovery.

        Returns a dictionary containing all internal state that can be serialized
        and later restored using load_state(). This enables checkpointing for
        long-running evaluation jobs.

        Returns:
            Dictionary containing serializable evaluator state
        """
        return {
            "confusion_matrix": {
                "overall": dict(self._confusion_matrix["overall"]),
                "fields": {
                    path: dict(metrics)
                    for path, metrics in self._confusion_matrix["fields"].items()
                },
            },
            "errors": list(self._errors),
            "processed_count": self._processed_count,
            "start_time": self._start_time,
            # Configuration
            "target_schema": self._schema_name,
            "elide_errors": self.elide_errors,
        }

    def load_state(self, state: Dict[str, Any]) -> None:
        """
        Restore evaluator state from serialized data.

        This method restores the internal state from data previously saved
        with get_state(), enabling recovery from checkpoints.

        Args:
            state: State dictionary from get_state()
        """
        # Validate state compatibility
        if state.get("target_schema") != self._schema_name:
            raise ValueError(
                f"State schema {state.get('target_schema')} doesn't match evaluator schema {self._schema_name}"
            )

        # Restore confusion matrix state
        cm_state = state["confusion_matrix"]
        self._confusion_matrix = {
            "overall": defaultdict(int, cm_state["overall"]),
            "fields": defaultdict(lambda: defaultdict(int)),
        }

        for field_path, field_metrics in cm_state["fields"].items():
            self._confusion_matrix["fields"][field_path] = defaultdict(
                int, field_metrics
            )

        # Restore other state
        self._errors = list(state["errors"])
        self._processed_count = state["processed_count"]
        self._start_time = state["start_time"]

        if self.verbose:
            print(f"Loaded state: {self._processed_count} documents processed")

    def merge_state(self, other_state: Dict[str, Any]) -> None:
        """
        Merge results from another evaluator instance.

        This method enables distributed processing by merging confusion matrix
        counts from multiple evaluator instances that processed different
        portions of a dataset.

        Args:
            other_state: State dictionary from another evaluator instance
        """
        # Validate compatibility
        if other_state.get("target_schema") != self._schema_name:
            raise ValueError(
                f"Cannot merge incompatible schemas: {other_state.get('target_schema')} vs {self._schema_name}"
            )

        # Merge overall metrics
        other_cm = other_state["confusion_matrix"]
        for metric, value in other_cm["overall"].items():
            self._confusion_matrix["overall"][metric] += value

        # Merge field-level metrics
        for field_path, field_metrics in other_cm["fields"].items():
            for metric, value in field_metrics.items():
                self._confusion_matrix["fields"][field_path][metric] += value

        # Merge errors and counts
        self._errors.extend(other_state["errors"])
        self._processed_count += other_state["processed_count"]

        if self.verbose:
            print(
                f"Merged state: now {self._processed_count} total documents processed"
            )

    # Legacy compatibility methods

    def evaluate_dataframe(self, df) -> ProcessEvaluation:
        """
        Legacy compatibility method for DataFrame-based evaluation.

        This method provides backward compatibility with the original DataFrame-based
        API while leveraging the new stateful processing internally.

        Args:
            df: DataFrame with columns for ground truth and predictions

        Returns:
            ProcessEvaluation with aggregated results
        """
        # Reset state for clean evaluation
        self.reset()

        # Process each row
        for idx, row in df.iterrows():
            doc_id = row.get("doc_id", f"row_{idx}")

            try:
                # Parse JSON data
                gt_data = json.loads(row["expected"])
                pred_data = json.loads(row["predicted"])

                # Create StructuredModel instances
                gt_model = self.target_schema(**gt_data)
                pred_model = self.target_schema(**pred_data)

                # Process using stateful update
                self.update(gt_model, pred_model, doc_id)

            except Exception as e:
                if self.verbose:
                    print(f"Error processing row {idx}: {e}")
                continue

        return self.compute()

__init__(target_schema=None, verbose=False, document_non_matches=True, elide_errors=False, individual_results_jsonl=None)

Initialize the stateful bulk evaluator.

Parameters:

Name Type Description Default
target_schema Optional[Type[StructuredModel]]

Optional StructuredModel class for validation and processing. Required for update() and evaluate_dataframe(). Not required when using update_from_comparison_result() with pre-computed results.

None
verbose bool

Whether to print detailed progress information

False
document_non_matches bool

Whether to document detailed non-match information

True
elide_errors bool

If True, skip documents with errors; if False, accumulate error metrics

False
individual_results_jsonl Optional[str]

Optional path to JSONL file for appending individual comparison results

None
Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def __init__(
    self,
    target_schema: Optional[Type[StructuredModel]] = None,
    verbose: bool = False,
    document_non_matches: bool = True,
    elide_errors: bool = False,
    individual_results_jsonl: Optional[str] = None,
):
    """
    Initialize the stateful bulk evaluator.

    Args:
        target_schema: Optional StructuredModel class for validation and processing.
            Required for update() and evaluate_dataframe(). Not required when using
            update_from_comparison_result() with pre-computed results.
        verbose: Whether to print detailed progress information
        document_non_matches: Whether to document detailed non-match information
        elide_errors: If True, skip documents with errors; if False, accumulate error metrics
        individual_results_jsonl: Optional path to JSONL file for appending individual comparison results
    """
    self.target_schema = target_schema
    self.verbose = verbose
    self.document_non_matches = document_non_matches
    self.elide_errors = elide_errors
    self.individual_results_jsonl = individual_results_jsonl

    # Initialize state
    self.reset()

    self._schema_name = target_schema.__name__ if target_schema else "unknown"

    if self.verbose:
        print(f"Initialized BulkStructuredModelEvaluator for {self._schema_name}")
        if self.individual_results_jsonl:
            print(
                f"Individual results will be appended to: {self.individual_results_jsonl}"
            )

compute()

Calculate final aggregated metrics from accumulated state.

This method performs the final computation of all derived metrics from the accumulated confusion matrix state, similar to PyTorch Lightning's training_epoch_end pattern.

Returns:

Type Description
ProcessEvaluation

ProcessEvaluation with final aggregated metrics

Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
def compute(self) -> ProcessEvaluation:
    """
    Calculate final aggregated metrics from accumulated state.

    This method performs the final computation of all derived metrics from
    the accumulated confusion matrix state, similar to PyTorch Lightning's
    training_epoch_end pattern.

    Returns:
        ProcessEvaluation with final aggregated metrics
    """
    result = self._build_process_evaluation()

    if self.verbose:
        total_time = time.time() - self._start_time
        print(
            f"Final computation completed: {self._processed_count} documents in {total_time:.2f}s"
        )
        print(f"Overall accuracy: {result.metrics.get('cm_accuracy', 0.0):.3f}")

    return result

evaluate_dataframe(df)

Legacy compatibility method for DataFrame-based evaluation.

This method provides backward compatibility with the original DataFrame-based API while leveraging the new stateful processing internally.

Parameters:

Name Type Description Default
df

DataFrame with columns for ground truth and predictions

required

Returns:

Type Description
ProcessEvaluation

ProcessEvaluation with aggregated results

Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
def evaluate_dataframe(self, df) -> ProcessEvaluation:
    """
    Legacy compatibility method for DataFrame-based evaluation.

    This method provides backward compatibility with the original DataFrame-based
    API while leveraging the new stateful processing internally.

    Args:
        df: DataFrame with columns for ground truth and predictions

    Returns:
        ProcessEvaluation with aggregated results
    """
    # Reset state for clean evaluation
    self.reset()

    # Process each row
    for idx, row in df.iterrows():
        doc_id = row.get("doc_id", f"row_{idx}")

        try:
            # Parse JSON data
            gt_data = json.loads(row["expected"])
            pred_data = json.loads(row["predicted"])

            # Create StructuredModel instances
            gt_model = self.target_schema(**gt_data)
            pred_model = self.target_schema(**pred_data)

            # Process using stateful update
            self.update(gt_model, pred_model, doc_id)

        except Exception as e:
            if self.verbose:
                print(f"Error processing row {idx}: {e}")
            continue

    return self.compute()

get_current_metrics()

Get current accumulated metrics without clearing state.

This method allows monitoring evaluation progress by returning current metrics computed from accumulated state. Unlike compute(), this does not clear the internal state.

Returns:

Type Description
ProcessEvaluation

ProcessEvaluation with current accumulated metrics

Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
239
240
241
242
243
244
245
246
247
248
249
250
def get_current_metrics(self) -> ProcessEvaluation:
    """
    Get current accumulated metrics without clearing state.

    This method allows monitoring evaluation progress by returning current
    metrics computed from accumulated state. Unlike compute(), this does
    not clear the internal state.

    Returns:
        ProcessEvaluation with current accumulated metrics
    """
    return self._build_process_evaluation()

get_state()

Get serializable state for checkpointing and recovery.

Returns a dictionary containing all internal state that can be serialized and later restored using load_state(). This enables checkpointing for long-running evaluation jobs.

Returns:

Type Description
Dict[str, Any]

Dictionary containing serializable evaluator state

Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
def get_state(self) -> Dict[str, Any]:
    """
    Get serializable state for checkpointing and recovery.

    Returns a dictionary containing all internal state that can be serialized
    and later restored using load_state(). This enables checkpointing for
    long-running evaluation jobs.

    Returns:
        Dictionary containing serializable evaluator state
    """
    return {
        "confusion_matrix": {
            "overall": dict(self._confusion_matrix["overall"]),
            "fields": {
                path: dict(metrics)
                for path, metrics in self._confusion_matrix["fields"].items()
            },
        },
        "errors": list(self._errors),
        "processed_count": self._processed_count,
        "start_time": self._start_time,
        # Configuration
        "target_schema": self._schema_name,
        "elide_errors": self.elide_errors,
    }

load_state(state)

Restore evaluator state from serialized data.

This method restores the internal state from data previously saved with get_state(), enabling recovery from checkpoints.

Parameters:

Name Type Description Default
state Dict[str, Any]

State dictionary from get_state()

required
Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
def load_state(self, state: Dict[str, Any]) -> None:
    """
    Restore evaluator state from serialized data.

    This method restores the internal state from data previously saved
    with get_state(), enabling recovery from checkpoints.

    Args:
        state: State dictionary from get_state()
    """
    # Validate state compatibility
    if state.get("target_schema") != self._schema_name:
        raise ValueError(
            f"State schema {state.get('target_schema')} doesn't match evaluator schema {self._schema_name}"
        )

    # Restore confusion matrix state
    cm_state = state["confusion_matrix"]
    self._confusion_matrix = {
        "overall": defaultdict(int, cm_state["overall"]),
        "fields": defaultdict(lambda: defaultdict(int)),
    }

    for field_path, field_metrics in cm_state["fields"].items():
        self._confusion_matrix["fields"][field_path] = defaultdict(
            int, field_metrics
        )

    # Restore other state
    self._errors = list(state["errors"])
    self._processed_count = state["processed_count"]
    self._start_time = state["start_time"]

    if self.verbose:
        print(f"Loaded state: {self._processed_count} documents processed")

merge_state(other_state)

Merge results from another evaluator instance.

This method enables distributed processing by merging confusion matrix counts from multiple evaluator instances that processed different portions of a dataset.

Parameters:

Name Type Description Default
other_state Dict[str, Any]

State dictionary from another evaluator instance

required
Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
def merge_state(self, other_state: Dict[str, Any]) -> None:
    """
    Merge results from another evaluator instance.

    This method enables distributed processing by merging confusion matrix
    counts from multiple evaluator instances that processed different
    portions of a dataset.

    Args:
        other_state: State dictionary from another evaluator instance
    """
    # Validate compatibility
    if other_state.get("target_schema") != self._schema_name:
        raise ValueError(
            f"Cannot merge incompatible schemas: {other_state.get('target_schema')} vs {self._schema_name}"
        )

    # Merge overall metrics
    other_cm = other_state["confusion_matrix"]
    for metric, value in other_cm["overall"].items():
        self._confusion_matrix["overall"][metric] += value

    # Merge field-level metrics
    for field_path, field_metrics in other_cm["fields"].items():
        for metric, value in field_metrics.items():
            self._confusion_matrix["fields"][field_path][metric] += value

    # Merge errors and counts
    self._errors.extend(other_state["errors"])
    self._processed_count += other_state["processed_count"]

    if self.verbose:
        print(
            f"Merged state: now {self._processed_count} total documents processed"
        )

pretty_print_metrics()

Pretty print current accumulated metrics in a format similar to StructuredModel.

Displays overall metrics, field-level metrics, and evaluation summary in a human-readable format.

Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
def pretty_print_metrics(self) -> None:
    """
    Pretty print current accumulated metrics in a format similar to StructuredModel.

    Displays overall metrics, field-level metrics, and evaluation summary
    in a human-readable format.
    """
    process_eval = self._build_process_evaluation()

    # Header
    print("\n" + "=" * 80)
    print(f"BULK EVALUATION RESULTS - {self._schema_name}")
    print("=" * 80)

    # Overall metrics
    overall_metrics = process_eval.metrics
    print("\nOVERALL METRICS:")
    print("-" * 40)
    print(f"Documents Processed: {self._processed_count:,}")
    print(f"Evaluation Time: {process_eval.total_time:.2f}s")
    print(
        f"Processing Rate: {self._processed_count / process_eval.total_time:.1f} docs/sec"
        if process_eval.total_time > 0
        else "Processing Rate: N/A"
    )

    # Confusion matrix
    print("\nCONFUSION MATRIX:")
    print(f"  True Positives (TP):    {overall_metrics.get('tp', 0):,}")
    print(f"  False Positives (FP):   {overall_metrics.get('fp', 0):,}")
    print(f"  True Negatives (TN):    {overall_metrics.get('tn', 0):,}")
    print(f"  False Negatives (FN):   {overall_metrics.get('fn', 0):,}")
    print(f"  False Discovery (FD):   {overall_metrics.get('fd', 0):,}")
    print(f"  False Alarm (FA):   {overall_metrics.get('fa', 0):,}")

    # Derived metrics
    print("\nDERIVED METRICS:")
    print(f"  Precision:     {overall_metrics.get('cm_precision', 0.0):.4f}")
    print(f"  Recall:        {overall_metrics.get('cm_recall', 0.0):.4f}")
    print(f"  F1 Score:      {overall_metrics.get('cm_f1', 0.0):.4f}")
    print(f"  Accuracy:      {overall_metrics.get('cm_accuracy', 0.0):.4f}")

    # Field-level metrics
    if process_eval.field_metrics:
        print("\nFIELD-LEVEL METRICS:")
        print("-" * 40)

        # Sort fields by F1 score descending for better readability
        sorted_fields = sorted(
            process_eval.field_metrics.items(),
            key=lambda x: x[1].get("cm_f1", 0.0),
            reverse=True,
        )

        for field_path, field_metrics in sorted_fields:
            tp = field_metrics.get("tp", 0)
            fp = field_metrics.get("fp", 0)
            fn = field_metrics.get("fn", 0)
            precision = field_metrics.get("cm_precision", 0.0)
            recall = field_metrics.get("cm_recall", 0.0)
            f1 = field_metrics.get("cm_f1", 0.0)

            # Only show fields with some activity
            if tp + fp + fn > 0:
                print(
                    f"  {field_path:30} P: {precision:.3f} | R: {recall:.3f} | F1: {f1:.3f} | TP: {tp:,} | FP: {fp:,} | FN: {fn:,}"
                )

    # Error summary
    if process_eval.errors:
        print("\nERROR SUMMARY:")
        print("-" * 40)
        print(f"Total Errors: {len(process_eval.errors):,}")
        print(
            f"Error Rate: {len(process_eval.errors) / self._processed_count * 100:.2f}%"
            if self._processed_count > 0
            else "Error Rate: N/A"
        )

        # Group errors by type
        error_types = {}
        for error in process_eval.errors:
            error_type = error.get("error_type", "Unknown")
            error_types[error_type] = error_types.get(error_type, 0) + 1

        if error_types:
            print("Error Types:")
            for error_type, count in sorted(
                error_types.items(), key=lambda x: x[1], reverse=True
            ):
                print(f"  {error_type}: {count:,}")

    # Configuration info
    print("\nCONFIGURATION:")
    print("-" * 40)
    print(f"Target Schema: {self._schema_name}")
    print(f"Document Non-matches: {'Yes' if self.document_non_matches else 'No'}")
    print(f"Elide Errors: {'Yes' if self.elide_errors else 'No'}")
    if self.individual_results_jsonl:
        print(f"Individual Results JSONL: {self.individual_results_jsonl}")

    print("=" * 80)

reset()

Clear all accumulated state and start fresh evaluation.

This method resets all internal counters, metrics, and error tracking to initial state, enabling reuse of the same evaluator instance for multiple evaluation runs.

Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
def reset(self) -> None:
    """
    Clear all accumulated state and start fresh evaluation.

    This method resets all internal counters, metrics, and error tracking
    to initial state, enabling reuse of the same evaluator instance for
    multiple evaluation runs.
    """
    # Accumulated confusion matrix state using nested defaultdicts
    self._confusion_matrix = {
        "overall": defaultdict(int),
        "fields": defaultdict(lambda: defaultdict(int)),
    }

    # Non-match tracking (when document_non_matches=True)
    self._non_matches = []

    # Error tracking
    self._errors = []

    # Processing statistics
    self._processed_count = 0
    self._start_time = time.time()

    if self.verbose:
        print("Reset evaluator state")

save_metrics(filepath)

Save current accumulated metrics to a JSON file.

Parameters:

Name Type Description Default
filepath str

Path where metrics will be saved as JSON

required
Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
def save_metrics(self, filepath: str) -> None:
    """
    Save current accumulated metrics to a JSON file.

    Args:
        filepath: Path where metrics will be saved as JSON
    """
    process_eval = self._build_process_evaluation()

    # Build comprehensive metrics dictionary
    metrics_data = {
        "overall_metrics": process_eval.metrics,
        "field_metrics": process_eval.field_metrics,
        "evaluation_summary": {
            "total_documents_processed": self._processed_count,
            "total_evaluation_time": process_eval.total_time,
            "documents_per_second": self._processed_count / process_eval.total_time
            if process_eval.total_time > 0
            else 0,
            "error_count": len(process_eval.errors),
            "error_rate": len(process_eval.errors) / self._processed_count
            if self._processed_count > 0
            else 0,
            "target_schema": self._schema_name,
        },
        "errors": process_eval.errors,
        "metadata": {
            "saved_at": time.strftime("%Y-%m-%d %H:%M:%S UTC", time.gmtime()),
            "evaluator_config": {
                "verbose": self.verbose,
                "document_non_matches": self.document_non_matches,
                "elide_errors": self.elide_errors,
                "individual_results_jsonl": self.individual_results_jsonl,
            },
        },
    }

    # Ensure directory exists
    import os

    os.makedirs(os.path.dirname(os.path.abspath(filepath)), exist_ok=True)

    # Write to file
    with open(filepath, "w", encoding="utf-8") as f:
        json.dump(metrics_data, f, indent=2, default=str)

    if self.verbose:
        print(f"Metrics saved to: {filepath}")

update(gt_model, pred_model, doc_id=None)

Process a single document pair and accumulate the results in internal state.

Runs compare_with() on the model pair, optionally writes the raw result to JSONL, then delegates accumulation to update_from_comparison_result().

Parameters:

Name Type Description Default
gt_model StructuredModel

Ground truth StructuredModel instance

required
pred_model StructuredModel

Predicted StructuredModel instance

required
doc_id Optional[str]

Optional document identifier for error tracking

None
Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
def update(
    self,
    gt_model: StructuredModel,
    pred_model: StructuredModel,
    doc_id: Optional[str] = None,
) -> None:
    """
    Process a single document pair and accumulate the results in internal state.

    Runs compare_with() on the model pair, optionally writes the raw result
    to JSONL, then delegates accumulation to update_from_comparison_result().

    Args:
        gt_model: Ground truth StructuredModel instance
        pred_model: Predicted StructuredModel instance
        doc_id: Optional document identifier for error tracking
    """
    if doc_id is None:
        doc_id = f"doc_{self._processed_count}"

    try:
        comparison_result = gt_model.compare_with(
            pred_model,
            include_confusion_matrix=True,
            document_non_matches=self.document_non_matches,
        )

        # JSONL append of raw comparison result before accumulation
        if self.individual_results_jsonl:
            record = {"doc_id": doc_id, "comparison_result": comparison_result}
            with open(self.individual_results_jsonl, "a", encoding="utf-8") as f:
                f.write(json.dumps(record) + "\n")

        self.update_from_comparison_result(comparison_result, doc_id)

    except Exception as e:
        error_record = {
            "doc_id": doc_id,
            "error": str(e),
            "error_type": type(e).__name__,
        }

        if not self.elide_errors:
            self._errors.append(error_record)
            self._confusion_matrix["overall"]["fn"] += 1

        if self.verbose:
            print(f"Error processing document {doc_id}: {str(e)}")

update_batch(batch_data)

Process multiple document pairs efficiently in a batch.

This method provides efficient batch processing by calling update() multiple times with optional garbage collection for memory management.

Parameters:

Name Type Description Default
batch_data List[Tuple[StructuredModel, StructuredModel, Optional[str]]]

List of tuples containing (gt_model, pred_model, doc_id)

required
Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
def update_batch(
    self, batch_data: List[Tuple[StructuredModel, StructuredModel, Optional[str]]]
) -> None:
    """
    Process multiple document pairs efficiently in a batch.

    This method provides efficient batch processing by calling update()
    multiple times with optional garbage collection for memory management.

    Args:
        batch_data: List of tuples containing (gt_model, pred_model, doc_id)
    """
    batch_start = self._processed_count

    for gt_model, pred_model, doc_id in batch_data:
        self.update(gt_model, pred_model, doc_id)

    # Garbage collection for large batches
    if len(batch_data) >= 1000:
        gc.collect()

    if self.verbose:
        batch_size = self._processed_count - batch_start
        print(f"Processed batch of {batch_size} documents")

update_from_comparison_result(comparison_result, doc_id=None)

Accumulate a pre-computed compare_with() result into internal state.

Unlike update(), this method does not require StructuredModel instances or re-run comparisons. It accepts the raw dictionary output of StructuredModel.compare_with(include_confusion_matrix=True) and accumulates its confusion matrix.

Parameters:

Name Type Description Default
comparison_result Dict[str, Any]

Dictionary returned by StructuredModel.compare_with() with include_confusion_matrix=True. Must contain a "confusion_matrix" key.

required
doc_id Optional[str]

Optional document identifier for error tracking

None
Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
def update_from_comparison_result(
    self,
    comparison_result: Dict[str, Any],
    doc_id: Optional[str] = None,
) -> None:
    """
    Accumulate a pre-computed compare_with() result into internal state.

    Unlike update(), this method does not require StructuredModel instances
    or re-run comparisons. It accepts the raw dictionary output of
    StructuredModel.compare_with(include_confusion_matrix=True) and
    accumulates its confusion matrix.

    Args:
        comparison_result: Dictionary returned by StructuredModel.compare_with()
            with include_confusion_matrix=True. Must contain a "confusion_matrix" key.
        doc_id: Optional document identifier for error tracking
    """
    if doc_id is None:
        doc_id = f"doc_{self._processed_count}"

    try:
        if "confusion_matrix" not in comparison_result:
            raise ValueError(
                "comparison_result must contain a 'confusion_matrix' key. "
                "Ensure compare_with() was called with include_confusion_matrix=True."
            )

        # Collect non-matches if enabled and present
        if self.document_non_matches and "non_matches" in comparison_result:
            for non_match in comparison_result["non_matches"]:
                non_match_with_doc = non_match.copy()
                non_match_with_doc["doc_id"] = doc_id
                self._non_matches.append(non_match_with_doc)

        # Accumulate the confusion matrix
        self._accumulate_confusion_matrix(comparison_result["confusion_matrix"])

        self._processed_count += 1

        if self.verbose and self._processed_count % 1000 == 0:
            elapsed = time.time() - self._start_time
            print(f"Processed {self._processed_count} documents ({elapsed:.2f}s)")

    except Exception as e:
        error_record = {
            "doc_id": doc_id,
            "error": str(e),
            "error_type": type(e).__name__,
        }

        if not self.elide_errors:
            self._errors.append(error_record)
            self._confusion_matrix["overall"]["fn"] += 1

        if self.verbose:
            print(f"Error processing document {doc_id}: {str(e)}")

stickler.structured_object_evaluator.bulk_structured_model_evaluator.aggregate_from_comparisons(comparison_results)

Aggregate a list of pre-computed compare_with() results into field-level metrics.

This is a convenience function for aggregating stored comparison results without needing the original StructuredModel instances. It accepts the raw dictionary outputs of StructuredModel.compare_with(include_confusion_matrix=True).

Parameters:

Name Type Description Default
comparison_results List[Dict[str, Any]]

List of dictionaries, each returned by StructuredModel.compare_with(include_confusion_matrix=True).

required

Returns:

Type Description
ProcessEvaluation

ProcessEvaluation with aggregated metrics including overall and

ProcessEvaluation

per-field precision, recall, F1, and accuracy.

Source code in stickler/structured_object_evaluator/bulk_structured_model_evaluator.py
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
def aggregate_from_comparisons(
    comparison_results: List[Dict[str, Any]],
) -> ProcessEvaluation:
    """
    Aggregate a list of pre-computed compare_with() results into field-level metrics.

    This is a convenience function for aggregating stored comparison results
    without needing the original StructuredModel instances. It accepts the raw
    dictionary outputs of StructuredModel.compare_with(include_confusion_matrix=True).

    Args:
        comparison_results: List of dictionaries, each returned by
            StructuredModel.compare_with(include_confusion_matrix=True).

    Returns:
        ProcessEvaluation with aggregated metrics including overall and
        per-field precision, recall, F1, and accuracy.
    """
    evaluator = BulkStructuredModelEvaluator()
    for result in comparison_results:
        evaluator.update_from_comparison_result(result)
    return evaluator.compute()