Data-Driven Binary Parser

This document describes the data-driven binary parser infrastructure in src/parser/. The parser uses declarative YAML-based structure definitions inspired by Kaitai Struct to parse and write binary data without hardcoding format details.

Relationship to Kaitai Struct

Our definition format is inspired by Kaitai Struct’s .ksy YAML format but is not fully compatible. We implement a subset of Kaitai features tailored for NITF parsing, with some extensions (like NITF-specific character encodings) and some omissions (like instances and params).

Key differences from Kaitai Struct:

Aspect

Kaitai Struct

Our Implementation

Execution model

Compiles to target language code

Runtime interpretation

Expression syntax

Full Kaitai expression language

Subset (see below)

instances

Supported

Not implemented

params

Supported

Not implemented

Array indexing

arr[0] bracket syntax

arr_0 indexed paths (internal); Value::Array (public API)

NITF encodings

Not built-in

BCS-A, BCS-N, BCS-NPI, ECS-A support

Writing support

Limited

Full bidirectional read/write

Our definition files use the .ksy extension for familiarity but should be considered a Kaitai-inspired format rather than true Kaitai Struct files. They may not work with the official Kaitai Struct compiler.

Architecture Overview

The parser is organized around six key types that separate concerns between schema definition, reading, writing, expression evaluation, and definition management.

        classDiagram
    class StructureRegistry {
        -search_paths: Vec~PathBuf~
        -file_cache: HashMap~String, Arc~StructureDefinition~~
        -runtime_defs: HashMap~String, Arc~StructureDefinition~~
        +add_search_path(path)
        +get(name) Option~Arc~StructureDefinition~~
        +register(name, def)
        +unregister(name)
        +list() Vec~String~
        +reload()
    }

    class DefinitionLoader {
        +load_file(path)$ StructureDefinition
        +load_str(yaml)$ StructureDefinition
        +load_reader(reader)$ StructureDefinition
        +validate_type_references(def)$
    }

    class StructureDefinition {
        +id: String
        +title: Option~String~
        +endian: Endian
        +fields: Vec~FieldDefinition~
        +types: HashMap~String, StructureDefinition~
        +enums: HashMap~String, EnumDefinition~
    }

    class FieldDefinition {
        +id: String
        +field_type: FieldType
        +size: SizeSpec
        +encoding: Option~Encoding~
        +condition: Option~Expression~
        +repeat: Option~RepeatSpec~
        +doc: Option~String~
    }

    class StructureAccessor {
        -definition: Arc~StructureDefinition~
        -data: &[u8]
        -repeat_offsets: RefCell~HashMap~
        -parsed: RefCell~bool~
        +new(definition, data) Result
        +get(path) Result~Value~
        +has(path) bool
        +fields() Iterator~String~
        +raw_slice(path) Result~&[u8]~
        +field_info(path) Result~FieldInfo~
    }

    class StructureWriter {
        -definition: Arc~StructureDefinition~
        -buffer: Vec~u8~
        -position: usize
        -written: HashSet~String~
        -next_field_index: usize
        -current_repeat_written: usize
        +new(definition) Self
        +set(path, value) Result
        +finish() Result~Vec~u8~~
        +write_to(writer) Result~usize~
    }

    class ExpressionEvaluator {
        +parse(expr)$ Result~Expression~
        +evaluate(expr, context) Result~EvalResult~
    }

    class EvalContext {
        +fields: HashMap~String, EvalResult~
        +index: Option~usize~
        +with_field(path, value) Self
        +with_index(index) Self
    }

    class Value {
        <<enum>>
        String
        Bytes
        Unsigned(u64)
        Struct(StructValue)
        Array(Vec~Value~)
    }

    StructureRegistry --> DefinitionLoader : uses to load
    StructureRegistry o-- StructureDefinition : caches
    DefinitionLoader ..> StructureDefinition : produces
    StructureDefinition *-- FieldDefinition : contains
    StructureDefinition *-- StructureDefinition : nested types
    StructureAccessor --> StructureDefinition : reads against
    StructureAccessor --> ExpressionEvaluator : evaluates conditions/sizes
    StructureAccessor ..> EvalContext : builds for evaluation
    StructureAccessor ..> Value : returns
    StructureWriter --> StructureDefinition : writes against
    StructureWriter --> ExpressionEvaluator : evaluates expressions
    StructureWriter ..> EvalContext : builds for evaluation
    FieldDefinition --> Expression : condition/repeat/size expressions
    ExpressionEvaluator --> EvalContext : reads field values from
    

Key Types

  • StructureRegistry is the top-level manager that loads and caches StructureDefinitions from KSY (Kaitai Struct YAML) files on disk, delegating parsing to DefinitionLoader. It supports multiple search paths with last-wins priority, runtime registration of definitions, and lazy loading with caching.

  • DefinitionLoader deserializes KSY YAML into intermediate Raw* structs via serde_yaml, then converts them to the final StructureDefinition type. All methods are static — the loader is stateless.

  • StructureDefinition / FieldDefinition form the schema layer — a declarative description of a binary format’s fields, types, sizes, conditions, and repeats. Definitions can contain nested type definitions and enum mappings.

  • StructureAccessor is the read path. Given a definition and a byte slice, it lazily parses field offsets via a single O(n) pass on first access, caching repeat element offsets for efficient indexed access. All subsequent field accesses are O(1) lookups. It provides a map-like get(path) interface returning Value instances, with repeated fields returned as Value::Array. It uses ExpressionEvaluator to resolve dynamic sizes and conditional fields.

  • StructureWriter is the write path. It uses streaming mode where fields must be written in definition order. It accepts field values via set(path, value) and serializes them into bytes according to the definition. For repeated fields, callers can pass a WriteValue::Array with all elements at once, or write elements sequentially with indexed paths.

  • ExpressionEvaluator + EvalContext handle the expression language (arithmetic, comparisons, bitwise operations, field references) used in conditional fields, computed sizes, and repeat counts. Both the accessor and writer build an EvalContext from already-parsed fields to evaluate expressions.

  • Value is the parsed field value enum with conversion methods that handle NITF’s ASCII-numeric conventions (e.g., parsing "003" as integer 3 via .as_i64()).

Key Components

1. Structure Definitions

Structure definitions are loaded from YAML files using serde_yaml. The DefinitionLoader deserializes into intermediate Raw* structs, then converts to the final StructureDefinition type.

// Loading a definition
let def = DefinitionLoader::load_file(Path::new("nitf_file_header.ksy"))?;
let def = DefinitionLoader::load_str(yaml_string)?;

Supported KSY Features

Feature

Status

Notes

meta.id, meta.title, meta.endian

Required id, optional title

seq fields

Sequential field definitions

types (nested)

Recursive type definitions

enums

Integer-to-name mappings

Field types: u1-u8, s1-s8, str

Integer and string types

Endian-specific types: u2be, u4le

Parsed but endian from meta

size (fixed and expression)

Fixed integer or expression

encoding

ASCII, BCS-A, BCS-N, BCS-NPI, ECS-A

pad-right

Padding character

if (conditional)

Expression-based conditions

repeat: expr

Expression-based count

repeat: until

Condition-based termination

repeat: eos

Read until end of stream

doc

Documentation strings

instances

Not implemented

params

Not implemented

Bit fields (b1, b4)

Partial

Parsed as bytes

2. StructureAccessor (Reading)

The StructureAccessor provides lazy, map-like access to binary data. On first field access, a single O(n) pass walks all fields and caches repeat element offsets. All subsequent accesses are O(1) lookups.

let accessor = StructureAccessor::new(Arc::new(def), &data)?;

// Access fields by path
let version = accessor.get("fver")?.as_str()?;
let num_images = accessor.get("numi")?.as_i64()?;

// Repeated fields return Value::Array
let all_info = accessor.get("image_info")?;  // Returns Value::Array
if let Value::Array(entries) = all_info {
    for entry in &entries {
        // Each entry is a Value::Struct with nested fields
    }
}

// Indexed access for individual repeated elements
let first_len = accessor.get("image_info_0.li")?.as_str()?;

// Check field existence (including conditional evaluation)
if accessor.has("optional_field") { ... }

// Zero-copy raw slice access
let raw_bytes: &[u8] = accessor.raw_slice("data_field")?;

// Iterate all accessible fields (skips false conditionals)
for field_path in accessor.fields() { ... }

Offset Calculation

Field offsets are computed in a single pass on first access:

  1. Fixed-size fields: offset computed from preceding field sizes

  2. Variable-size fields: preceding fields parsed to determine sizes

  3. Conditional fields: condition evaluated to determine presence (skipped if false)

  4. Repeated fields: repeat count evaluated, each element’s offset cached internally

The single-pass result is cached in a repeat offset map for efficient indexed access. The fields() iterator returns base field names only — repeated fields appear once (not once per element).

3. StructureWriter (Writing)

The StructureWriter uses streaming mode where fields must be written in definition order:

// Create a writer
let mut writer = StructureWriter::new(Arc::new(def));
writer.set("field_a", "value")?;  // Must be in definition order
writer.set("field_b", 42i64)?;
let bytes = writer.finish()?;

// Write directly to an io::Write target
let bytes_written = writer.write_to(&mut output_file)?;

For repeated fields, pass a WriteValue::Array with all elements:

writer.set("items", vec!["AAA", "BBB", "CCC"])?;

The streaming writer supports expression-based sizes and repeat counts by evaluating them against previously written values. Fields must be written in the order they appear in the definition.

4. Expression Evaluator

The expression system parses and evaluates Kaitai Struct-style expressions using a hand-written recursive descent parser:

let expr = ExpressionEvaluator::parse("numi.to_i * 16 + 388")?;
let evaluator = ExpressionEvaluator::new();
let result = evaluator.evaluate(&expr, &context)?;

Supported Expression Syntax

Category

Syntax

Example

Literals

integers, floats, strings, booleans

42, 3.14, "text", true

Hex literals

0x prefix

0xFF, 0x1A

Field references

dot-notation paths

header.version, items_0.value

Arithmetic

+, -, *, /, %

width * height

Comparison

==, !=, <, >, <=, >=

version >= 2

Logical

and, or, not

a > 0 and b < 10

Bitwise

&, |, ^, ~, <<, >>

flags & 0xFF, mask | 0x01

Methods

.to_i, .to_s, .length

numi.to_i, data.length

Special vars

_index

Current repeat index

Parentheses

(expr)

(a + b) * c

Unary

-, not, ~

-offset, ~mask

Operator Precedence (lowest to highest)

  1. or

  2. and

  3. ==, !=, <, >, <=, >=

  4. | (bitwise OR)

  5. ^ (bitwise XOR)

  6. & (bitwise AND)

  7. +, -

  8. *, /, %

  9. Unary -, not, ~

  10. Postfix .method, .field

Not Supported

  • _root, _parent, _io (parsed but not evaluated)

  • Ternary operator: a ? b : c

Why a Custom Expression Evaluator?

We use a hand-written expression evaluator rather than a third-party library for several reasons:

  1. Kaitai-specific syntax: The .to_i, .to_s, and .length method call syntax is specific to Kaitai Struct. Generic expression libraries like meval don’t support method calls, and adapting them would require significant preprocessing.

  2. License constraints: This project requires Apache-2.0, MIT, BSD, ISC, Zlib, or public domain licenses. Some expression evaluation crates (e.g., evalexpr) use AGPL licensing which is incompatible.

  3. NITF-specific needs: NITF fields are often ASCII strings representing numbers (e.g., "003" for the count 3). The .to_i method handles this conversion naturally. Generic evaluators would need custom type coercion.

  4. Minimal dependencies: The custom evaluator adds no external dependencies and is well-tested code organized into lexer, parser, evaluator, and operator modules.

5. Structure Registry

The registry manages loading and caching of structure definitions:

let mut registry = StructureRegistry::new();
registry.add_search_path("data/structures/tre");

// Get definition (loads from search paths and caches)
let def = registry.get("TRE_GEOLOB");

// Mutable access (loads and caches if not already present)
let def = registry.get_mut("TRE_GEOLOB");

// Runtime registration (takes priority over file-loaded definitions)
registry.register("CUSTOM", custom_def);

// List all available definitions (file + runtime)
for name in registry.list() { ... }

// Reload file-based definitions (preserves runtime registrations)
registry.reload()?;

Naming Convention

Prefix

Example

File Path

TRE_

TRE_GEOLOB

tre/tre_geolob.ksy

DES_

DES_TRE_OVERFLOW

des/des_tre_overflow.ksy

NITF_

NITF_02.10_FileHeader

nitf/nitf_02.10_file_header.ksy

NSIF_

NSIF_01.00_FileHeader

nsif/nsif_01.00_file_header.ksy

6. NITF Character Set Encoding

The encoding module validates NITF-specific character sets:

Encoding

Description

Valid Bytes

BCS-A

Basic Character Set - Alphanumeric

0x200x7E

BCS-N

Basic Character Set - Numeric

0x20, 0x2B0x2F, 0x300x39

BCS-NPI

Basic Character Set - Numeric Positive Integer

0x20, 0x300x39

ECS-A

Extended Character Set - Alphanumeric

0x20+

Each encoding provides validate() for bulk validation and is_valid_byte() for per-byte checks. The validate_*_detailed() functions return a ValidationResult with the index and byte of the first invalid character.

Value Type

The Value enum represents parsed field values with conversion methods:

pub enum Value<'a> {
    String(Cow<'a, str>),      // String fields
    Bytes(&'a [u8]),           // Raw byte fields
    Unsigned(u64),             // Unsigned integers
    Struct(StructValue<'a>),   // Nested structures
    Array(Vec<Value<'a>>),     // Repeated fields
}

// Conversions (handle NITF's ASCII-numeric fields)
value.as_str()?      // Trims padding
value.as_i64()?      // Parses numeric strings or casts unsigned
value.as_u64()?      // Parses unsigned strings or casts unsigned
value.as_f64()?      // Parses float strings or casts unsigned
value.as_bytes()     // Raw bytes (always succeeds)

Error Types

LoadError       // Definition loading errors (YAML, missing fields, invalid types, undefined type refs)
AccessError     // Reading errors (unknown field, EOF, conditional not present, encoding, expression)
WriteError      // Writing errors (out of order, too large, missing required, validation, conversion)
ConversionError // Value conversion errors (type mismatch, parse failure)
ExpressionError // Expression errors (syntax, unknown field, type error, division by zero)

Dependencies

Crate

Purpose

serde_yaml

YAML deserialization for KSY files

serde

Derive macros for deserialization

thiserror

Error type derivation

Repeated Field Access

Repeated fields are accessed primarily through the base field name, which returns a Value::Array. Individual elements can also be accessed via {field_id}_{index} indexed paths (zero-based) for convenience.

seq:
  - id: num_segments
    type: u2
  - id: segment_info
    type: segment_entry
    repeat: expr
    repeat-expr: num_segments

If num_segments is 3:

  • segment_info — Entire array as Value::Array (preferred)

  • segment_info_0 — First entry (indexed access)

  • segment_info_0.offset — Nested field access on first entry

  • segment_info_1 — Second entry

  • segment_info_2 — Third entry

For writing, pass arrays directly:

writer.set("items", vec!["AAA", "BBB", "CCC"])?;

Example: NITF File Header

meta:
  id: nitf_file_header
  title: NITF 2.1 File Header
  endian: be

seq:
  - id: fhdr
    type: str
    size: 4
    encoding: BCS-A
    doc: File profile name (NITF or NSIF)
    
  - id: fver
    type: str
    size: 5
    encoding: BCS-A
    doc: File version (02.10)
    
  - id: numi
    type: str
    size: 3
    encoding: BCS-N
    doc: Number of image segments
    
  - id: image_info
    type: image_segment_info
    repeat: expr
    repeat-expr: numi.to_i

types:
  image_segment_info:
    seq:
      - id: lish
        type: str
        size: 6
        encoding: BCS-N
      - id: li
        type: str
        size: 10
        encoding: BCS-N
let accessor = StructureAccessor::new(def, &data)?;
let version = accessor.get("fver")?.as_str()?;  // "02.10"
let num_images = accessor.get("numi")?.as_i64()?;  // 2

// Access the entire array
let image_info = accessor.get("image_info")?;  // Value::Array

// Or access individual elements by index
for i in 0..num_images {
    let path = format!("image_info_{}.li", i);
    let image_len = accessor.get(&path)?.as_i64()?;
}

Comparison with Kaitai Struct

Feature

Kaitai Struct

Implementation Status

instances

Supported

Not implemented

params

Supported

Not implemented

_root, _parent, _io

Supported

Parsed but not evaluated

Ternary operator

Supported

Not supported

Bitwise operators

Supported

Supported (&, |, ^, ~, <<, >>)

Array indexing arr[0]

Supported

Value::Array for whole array; arr_0 indexed paths for elements

Method calls

.to_i, .to_s, .length, etc.

.to_i, .to_s, .length

Hex literals

0xFF

Supported