# Metadata

## The Simple Path

For a quick look at image properties without reading any pixels, `iminfo` gives
you dimensions, band count, pixel type, and block layout in one call:

```python
from aws.osml.io import iminfo

info = iminfo("image.ntf")
print(f"{info.width}x{info.height}, {info.bands} bands, {info.dtype}")
print(f"Block size: {info.block_size}")
print(f"Resolution levels: {info.num_resolution_levels}")
```

`iminfo` also includes the full format-specific metadata dictionary for the
image segment, so you can inspect compression, TREs, TIFF tags, and other
fields without dropping down to the low-level API:

```python
# NITF: subheader fields and TREs
info = iminfo("image.ntf")
print(info.metadata["IC"])          # "C8" (JPEG 2000)
print(info.metadata["IGEOLO"])      # 60-char geographic location string

# TREs are nested dicts
if "GEOLOB" in info.metadata:
    print(info.metadata["GEOLOB"]["ARV"])

# GeoTIFF: IFD tags keyed by numeric tag ID
info = iminfo("image.tif")
print(info.metadata["259"])         # Compression tag value
print(info.metadata["33550"])       # ModelPixelScale
```

The `metadata` dict is a snapshot — a plain Python dictionary captured when
`iminfo` is called. It is not a live reference to the file.

When you need more control — prefix filtering, dataset-level metadata, or
write-side metadata — the full `MetadataProvider` interface described below
gives you access to everything in the file.

## The MetadataProvider Interface

All assets and datasets expose metadata through the `MetadataProvider` interface,
regardless of the underlying file format. `MetadataProvider` implements the standard
Python `collections.abc.Mapping` protocol, so you access metadata fields the same
way you access items in a dictionary:

```python
from aws.osml.io import IO

with IO.open(["image.ntf"], "r") as dataset:
    # Dataset-level metadata — dict-style access
    dataset.metadata["FTITLE"]              # KeyError if missing
    dataset.metadata.get("FTITLE")          # None if missing
    "FTITLE" in dataset.metadata            # membership test
    len(dataset.metadata)                   # number of fields

    # Bulk export
    all_meta = dataset.metadata.entries()   # full dict (single call, fast path)
    filtered = dataset.metadata.entries("FS")  # keys starting with "FS"

    # Asset-level metadata
    image = dataset.get_asset("image:0")
    image_meta = image.metadata.entries()
```

When writing, use `BufferedMetadataProvider` to build metadata. It implements
`collections.abc.MutableMapping`, so you write fields with dictionary syntax:

```python
from aws.osml.io import BufferedMetadataProvider

meta = BufferedMetadataProvider()
meta["FTITLE"] = "My File Title"
```

The dictionary keys and value types are format-specific. A NITF file uses
field names like `FTITLE` and `ISCLAS`; a GeoTIFF uses numeric tag IDs like
`"256"` and `"33550"`. There is no translation layer — you work directly with
the native field names from whatever format you opened.

The rest of this page covers each format's metadata conventions in detail.

---

## NITF / NSIF Metadata

NITF files carry metadata in fixed-width ASCII header fields, security
classification blocks, and Tagged Record Extensions (TREs). The library
exposes all of these through the Mapping interface using the standard NITF field names.

### Reading NITF Metadata

#### Header and Subheader Fields

Most NITF fields are ASCII strings — even numeric values like row counts and
compression ratios. A few TREs use binary integer fields, which come through
as Python `int` directly.

```python
from aws.osml.io import IO

with IO.open(["image.ntf"], "r") as dataset:
    # File header fields — dict-style access
    title = dataset.metadata["FTITLE"]
    classification = dataset.metadata["FSCLAS"]       # "U", "C", "S", "TS", etc.

    # Image subheader fields
    image = dataset.get_asset("image:0")
    meta = image.metadata

    image_id = meta["IID1"]              # "IMG_00001"
    compression = meta["IC"]             # "C8"
    date_time = meta["IDATIM"]           # "20231215103045"

    # Numeric fields are ASCII strings — cast as needed
    num_rows = int(meta["NROWS"])         # 2048
    num_cols = int(meta["NCOLS"])         # 2048
    num_bands = int(meta["NBANDS"])       # 3

    # Coordinate strings — NITF packs 4 corners into a single field
    if "IGEOLO" in meta:
        geo = meta["IGEOLO"]              # 60-char geographic location string

    # Safe access for conditional fields
    comrat = meta.get("COMRAT")           # None if IC is "NC" or "NM"
```

#### TRE Fields as Nested Dictionaries

TRE (Tagged Record Extension) fields are grouped under their CETAG as nested
dictionaries. Each TRE with a known definition in the `StructureRegistry`
appears as a top-level key mapped to a dict of its fields:

```python
# Access TRE fields through nested dictionaries
geolob = meta["GEOLOB"]              # dict
arv = geolob["ARV"]                  # "000360000"
brv = geolob["BRV"]                  # "000360000"

# Or access in one step
arv = meta["GEOLOB"]["ARV"]

# TREs with repeated fields contain arrays
j2klra = meta["J2KLRA"]              # dict
layers = j2klra["LAYERS"]            # list of dicts
first_layer = layers[0]              # {"LAYER_ID": "000", "BITRATE": "0.031250"}
```

Unknown TREs (those without a definition in the registry) appear with their
raw data preserved:

```python
# Unknown TRE — raw hex data and byte length
unknown = meta["UNKNWN"]             # {"_raw": "0102030405", "_length": 5}
raw_hex = unknown["_raw"]
byte_count = unknown["_length"]
```

Overflow TREs stored in data extension segments are resolved automatically —
you don't need to chase them across segments.

#### Repeated Fields as Arrays

Repeated fields in the image subheader (like band info) appear as Python lists
instead of individual indexed entries:

```python
# Band info is a list of dicts, one per band
bands = meta["BAND_INFO"]            # list of dicts
for i, band in enumerate(bands):
    print(f"Band {i}: IREPBAND={band['IREPBAND']}, NLUTS={band['NLUTS']}")

# Access a specific band directly
first_band = meta["BAND_INFO"][0]
irepband = first_band["IREPBAND"]          # "R"
```

#### Prefix Filtering

Use `entries(prefix)` to retrieve a subset of metadata. For subheader fields,
the prefix matches field names. For TREs, the prefix matches the CETAG:

```python
# Get all fields starting with "FS" (file security fields)
# Returns: FSCLAS, FSCLSY, FSCODE, FSCTLH, FSREL, FSDCTP, ...
security = dataset.metadata.entries("FS")

# Get a specific TRE by CETAG
geolob_only = image.metadata.entries("GEOLOB")
# Returns: {"GEOLOB": {"ARV": "...", "BRV": "...", ...}}
```

#### Value Types Summary

The Python types you get back depend on how the field is defined in the
underlying structure definition:

| Definition | Python type | Example |
|------------|-------------|---------|
| `type: str` (most fields) | `str` | `"U"`, `"00002048"` |
| Binary integers (`u1`, `u2`, `u4`, `u8`) | `int` | `42` |
| Repeated fields (band info, etc.) | `list` of `dict` | `[{"IREPBAND": "R", ...}]` |
| Known TREs | `dict` of `dict` | `{"GEOLOB": {"ARV": "..."}}` |
| Unknown TREs | `dict` with `_raw`, `_length` | `{"_raw": "0102", "_length": 2}` |
| Binary byte fields | `str` (hex-encoded) | `"ff8000"` |

### Writing NITF Metadata

When writing NITF files, you control header fields by setting metadata on the
writer and on individual assets. The writer reads user-settable fields from
the metadata provider and falls back to sensible defaults when a field is absent.

#### File Header Fields

Set file-level metadata using `BufferedMetadataProvider` and assign it to the
writer's `metadata` property:

```python
from aws.osml.io import IO, BufferedMetadataProvider

file_meta = BufferedMetadataProvider()
file_meta["FTITLE"] = "Reconnaissance Mission 2026-03-15"
file_meta["ONAME"] = "Sensor Operator"
file_meta["OPHONE"] = "555-0100"
file_meta["FDT"] = "20260315120000"
file_meta["OSTAID"] = "STATION1"
file_meta["CLEVEL"] = "05"

# Security classification fields use the FS prefix
file_meta["FSCLAS"] = "S"
file_meta["FSCLSY"] = "US"
file_meta["FSCODE"] = "SECRET"
file_meta["FSREL"] = "USA GBR"

# FBKGC is a 3-byte binary field (RGB background color)
# Set it as a list of integers
file_meta["FBKGC"] = [255, 255, 255]

writer = IO.open(["output.ntf"], "w", "nitf")
writer.metadata = file_meta
# ... add assets and close
```

Fields you don't set keep their defaults — `FSCLAS` defaults to `"U"`,
`OSTAID` defaults to `"OSML_IO"`, `CLEVEL` defaults to `"03"`, and text
fields default to blank.

#### Image Subheader Fields

Image assets read several fields from metadata (`IID1`, `IDATIM`, `TGTID`,
`IID2`, `ISORCE`). The security classification block and category fields are
also metadata-driven:

```python
image_meta = BufferedMetadataProvider()

# Identification fields
image_meta["IID1"] = "IMG_00001"
image_meta["IDATIM"] = "20260315103045"
image_meta["ISORCE"] = "Satellite XYZ"

# Security fields use the IS prefix
image_meta["ISCLAS"] = "S"
image_meta["ISCLSY"] = "US"
image_meta["ISREL"] = "USA"

# Image category and coordinate representation
image_meta["ICAT"] = "SAR"
image_meta["ICORDS"] = "G"
```

Fields derived from the image data itself — `NROWS`, `NCOLS`, `PVTYPE`,
`IREP`, `NBPP`, `ABPP`, `NBANDS`, and blocking parameters — are always
computed from the `ImageAssetProvider` and cannot be overridden through
metadata.

#### Text, Graphic, and DES Subheader Fields

Text, graphic, and data extension segment subheaders follow the same pattern.
Set fields on the asset's metadata provider before adding it to the writer:

```python
# Text asset metadata (TS prefix for security fields)
text_meta = BufferedMetadataProvider()
text_meta["TXTDT"] = "20260315120000"
text_meta["TXTFMT"] = "STA"
text_meta["TSCLAS"] = "C"

# Graphic asset metadata (SS prefix for security fields)
graphic_meta = BufferedMetadataProvider()
graphic_meta["SFMT"] = "C"
graphic_meta["SDLVL"] = "002"
graphic_meta["SLOC"] = "0050000100"
graphic_meta["SSCLAS"] = "U"

# DES metadata (DES prefix for security fields, but DECLAS for classification)
des_meta = BufferedMetadataProvider()
des_meta["DESVER"] = "02"
des_meta["DECLAS"] = "U"
des_meta["DESCLSY"] = "US"
```

#### Security Classification Fields

Every NITF subheader contains the same 13-field security classification block.
The field names use a prefix that varies by segment type:

| Segment | Prefix | Example |
|---------|--------|---------|
| File header | `FS` | `FSCLAS`, `FSCLSY`, `FSCODE`, … |
| Image | `IS` | `ISCLAS`, `ISCLSY`, `ISCODE`, … |
| Text | `TS` | `TSCLAS`, `TSCLSY`, `TSCODE`, … |
| Graphic | `SS` | `SSCLAS`, `SSCLSY`, `SSCODE`, … |
| DES | `DE`/`DES` | `DECLAS`, `DESCLSY`, `DESCODE`, … |

The 13 fields in each block (after the prefix) are: `CLAS`, `CLSY`, `CODE`,
`CTLH`, `REL`, `DCTP`, `DCDT`, `DCXM`, `DG`, `DGDT`, `CLTX`, `CATP`,
`CAUT`, `CRSN`, `SRDT`, `CTLN`.

All default to `"U"` for classification and blank for everything else.

#### Computed vs. User-Settable Fields

Some fields are always computed by the writer and cannot be overridden:

- `FHDR`, `FVER` — determined by the output format (NITF 2.1 / NSIF 1.0)
- `FL`, `HL` — computed from actual file and header lengths
- `NUMI`, `NUMS`, `NUMT`, `NUMDES`, `NUMRES` — segment counts
- Segment length arrays (`LISH`/`LI`, `LSSH`/`LS`, etc.)
- `ENCRYP` — always `"0"` (unencrypted)
- Image dimensions, pixel type, blocking parameters — derived from image data

#### Writing TREs

Set TREs as nested dicts using dictionary syntax, matching the format returned by
the reader:

```python
image_meta["GEOLOB"] = {
    "ARV": "000360000",
    "BRV": "000180000",
    "LSO": "-077.0000000000",
    "PSO": "+038.0000000000",
}
```

Numeric fields (BCS-N encoding) are auto-formatted to their defined width —
short values are left-padded with zeros and overly-precise values are
reformatted to fit. You can pass natural representations:

```python
image_meta["ICHIPB"] = {
    "OP_ROW_11": "0.5",     # auto-padded to "0000000000.5" (12 bytes)
    "FI_ROW": "768",        # auto-padded to "00000768" (8 bytes)
    # ...
}
```

Text fields (BCS-A) are right-padded with spaces if short and rejected if
too long. Values that cannot fit any field after formatting raise an error.

#### Encoding Tolerance

NITF fields declare a character encoding that constrains what bytes are valid.
For example, BCS-NPI (Numeric Positive Integer) only permits digits and spaces
per the JBP specification. In practice, real-world NITF producers frequently
violate these constraints — the RPC00B TRE's `HEIGHT_SCALE` field is commonly
written with a leading `+` sign despite being declared BCS-NPI.

By default, the writer uses **permissive** validation: numeric fields (BCS-N
and BCS-NPI) accept any printable ASCII character (the BCS-A range, 0x20–0x7E).
This ensures that metadata read from real-world files can be written back
without error — a round-trip that would otherwise fail on spec-violating values.

If you need output that is strictly compliant with the NITF encoding
specifications, enable strict mode on the writer:

```python
with IO.open("output.ntf", "w", "nitf") as writer:
    writer.strict_encoding = True  # reject values that violate declared encodings
    writer.add_asset("image:0", provider, "Image", "", ["data"])
```

In strict mode, writing `"+0697"` to a BCS-NPI field raises a validation error
because `+` is not in the BCS-NPI character set. This is useful when producing
files that must pass formal conformance checks.

### Extending NITF Metadata with Structure Definitions

The metadata you access from NITF files is driven by a
data-driven parsing framework. The library uses declarative YAML-based
structure definition files (`.ksy` format, inspired by
[Kaitai Struct](https://kaitai.io/)) to describe binary layouts. These
definitions control both reading and writing — the same file that tells the
parser how to extract fields from a binary header also tells the writer how
to serialize them back.

This means you can extend the metadata the library understands by adding new
structure definition files. If a TRE, DES, or other NITF metadata structure
isn't already supported, you can write a `.ksy` definition for it and register
it with the `StructureRegistry`.

#### Configuring the StructureRegistry

The `StructureRegistry` manages all structure definitions. By default it loads
definitions from the package's built-in `data/structures/` directory, which
includes NITF file headers, image subheaders, and many common TREs. You can
extend it with your own definitions:

```python
from aws.osml.io import StructureRegistry

# Create a registry (loads built-in definitions automatically)
registry = StructureRegistry()

# See what's already available
for name in registry.list():
    print(name)
# NITF_02.10_FileHeader, NITF_02.10_ImageSubheader, TRE_GEOLOB,
# TRE_RPC00B, TRE_SENSRB, TRE_USE00A, ... (70+ definitions)

# Add a directory containing your custom .ksy files
registry.add_search_path("/path/to/my/structures")

# Retrieve a specific definition
geolob_def = registry.get("TRE_GEOLOB")

# Reload definitions after editing .ksy files on disk
registry.reload()
```

You can also set the `OSML_IO_STRUCTURE_PATH` environment variable to add
search paths without changing code. Separate multiple paths with `:`.

```bash
export OSML_IO_STRUCTURE_PATH="/team/shared/structures:/project/custom/structures"
```

#### How Definitions Are Used

Structure definitions drive both directions of the pipeline:

- When reading, the parser uses the definition to locate fields in the binary
  data, apply the correct encoding (BCS-A, BCS-N, etc.), evaluate conditional
  and repeated fields, and populate the metadata dictionary.

- When writing, the writer uses the same definition to serialize metadata
  values back into the correct binary layout, validating field sizes and
  encodings along the way.

Adding a new `.ksy` file for a TRE automatically enables both reading and
writing that TRE — no code changes required.

#### Lower-Level Access with StructureAccessor

For lower-level access with built-in type conversion, the `StructureAccessor`
returns `Value` objects with `as_str()`, `as_int()`, and `as_float()` methods
that handle NITF's ASCII-numeric conventions (e.g. parsing `"003"` as `3`).

#### Writing Your Own Structure Definitions

Structure definition files use a YAML-based format with support for field types,
conditional presence, repeat expressions, and nested structures. For the full
syntax reference, expression language details, and examples, see the
[Structure Definition Guide](structure-definitions.md).

---

## TIFF and GeoTIFF Metadata

For TIFF and GeoTIFF files, the metadata dictionary uses numeric TIFF tag IDs
as keys. Each key is the string representation of the tag number from the
TIFF 6.0 specification — for example, `"256"` for ImageWidth, `"259"` for
Compression, `"33550"` for ModelPixelScale.

This design means every tag in the IFD is preserved, including private-use
tags (32768+) and vendor-specific tags that would otherwise be dropped by a
hardcoded name list. The raw tag values are stored directly, with no
interpretation or transformation applied.

### Reading TIFF Metadata

```python
from aws.osml.io import IO

with IO.open(["image.tif"], "r") as dataset:
    meta = dataset.metadata

    # Tags are keyed by their numeric ID as a string
    width = meta["256"]           # ImageWidth
    height = meta["257"]          # ImageLength
    bits = meta["258"]            # BitsPerSample
    compression = meta["259"]     # Compression

    # GeoTIFF tags use the same numeric key convention
    pixel_scale = meta["33550"]   # ModelPixelScale — e.g. [0.5, 0.5, 0.0]
    tiepoints = meta["33922"]     # ModelTiepoint
    geokeys = meta["34735"]       # GeoKeyDirectory (raw SHORT array)

    # Dataset-level entries use descriptive string keys
    byte_order = meta["ByteOrder"]              # "LittleEndian"
    num_dirs = meta["NumberOfDirectories"]       # 3

    # Prefix filtering works on the numeric key strings
    tags_3xx = dataset.metadata.entries("3")
    # Returns "322" (TileWidth), "323" (TileLength), "339" (SampleFormat),
    # "33550" (ModelPixelScale), "34735" (GeoKeyDirectory), etc.
```

### Using TagNameResolver for Name-Based Access

If you prefer human-readable tag names, wrap the dictionary with
`TagNameResolver`. It translates names like `"ImageWidth"` to the
corresponding numeric key (`"256"`) behind the scenes.

```python
from aws.osml.io import IO
from aws.osml.io.tiff.utils import TagNameResolver

with IO.open(["image.tif"], "r") as dataset:
    meta = dataset.metadata.entries()
    tags = TagNameResolver(meta)

    # Look up by name — same value as meta["256"]
    width = tags["ImageWidth"]
    height = tags["ImageLength"]

    # GeoTIFF tags work the same way
    scale = tags["ModelPixelScale"]
    geokeys = tags["GeoKeyDirectory"]

    # Safe access with a default value
    nodata = tags.get("GDALNoData", "nan")

    # Direct numeric access when you know the tag number
    raw = tags.by_number(34735)

    # Check if a tag is present
    if "Compression" in tags:
        print(f"Compression: {tags['Compression']}")

    # Iterate over all entries
    for key, value in tags:
        print(f"Tag {key}: {value}")
```

The resolver ships with a default mapping covering baseline TIFF 6.0 tags,
GeoTIFF tags, and common GDAL tags. You can extend it with custom mappings
for vendor-specific or application-specific tags:

```python
custom_tags = TagNameResolver(meta, custom_mapping={
    "MyVendorTag": 65000,
    "CloudCover": 65001,
})

vendor_val = custom_tags["MyVendorTag"]
cloud = custom_tags["CloudCover"]

# Custom mappings override defaults if there's a name collision
```

### Writing TIFF Metadata

When writing TIFF files, supply metadata using the same numeric key format.
The writer infers the TIFF field type from the JSON value type for common
cases. For types that can't be inferred, use an explicit type annotation.

```python
from aws.osml.io import IO, BufferedImageAssetProvider, BufferedMetadataProvider, PixelType

metadata = BufferedMetadataProvider()
metadata["259"] = 8                              # Compression: Deflate
metadata["33550"] = [0.5, 0.5, 0.0]             # ModelPixelScale → DOUBLE array
metadata["42113"] = "nan"                        # GDALNoData → ASCII

# For field types that can't be inferred (e.g. UNDEFINED), use an annotation:
metadata["700"] = {"value": [60, 120, 109, 108], "type": 7}  # XMP as UNDEFINED bytes

# Attach metadata to the provider — the writer sources all IFD tags from here
provider = BufferedImageAssetProvider.create(
    key="image:0", num_columns=512, num_rows=512, num_bands=1,
    block_width=256, block_height=256, pixel_type=PixelType.UInt8,
    metadata=metadata,
)
provider.set_full_image(image_data)

with IO.open(["output.tif"], "w", "tiff") as writer:
    writer.add_asset("image:0", provider, "Image", "desc", ["data"])
```

#### Writing with TagNameResolver

`TagNameResolver` is bidirectional — you can use it to build write metadata
with human-readable names instead of numeric tag IDs. Assign values with
`resolver["TagName"] = value` and the resolver stores them under the correct
numeric key in the underlying dictionary.

For tags with well-known enumerated values (Compression, Predictor,
PlanarConfiguration, SampleFormat, PhotometricInterpretation, Orientation),
string values are resolved to their numeric equivalents automatically:

```python
from aws.osml.io import IO, BufferedImageAssetProvider, BufferedMetadataProvider, PixelType
from aws.osml.io.tiff.utils import TagNameResolver

metadata = BufferedMetadataProvider()
tag_dict = metadata.entries()
resolver = TagNameResolver(tag_dict)

# Set tags by name — stored under the correct numeric key
resolver["TileWidth"] = 512
resolver["TileLength"] = 512
resolver["ModelPixelScale"] = [0.5, 0.5, 0.0]

# Enumerated values resolve automatically
resolver["Compression"] = "LZW"           # stored as 5
resolver["Compression"] = "Deflate"       # stored as 8
resolver["Predictor"] = "Horizontal"      # stored as 2
resolver["SampleFormat"] = "Float"        # stored as 3

# Integer values pass through unchanged
resolver["Compression"] = 5               # also works

# Write resolved keys back to the metadata provider
for key, value in tag_dict.items():
    metadata[key] = value

provider = BufferedImageAssetProvider.create(
    key="image:0", num_columns=512, num_rows=512, num_bands=1,
    block_width=512, block_height=512, pixel_type=PixelType.Float32,
    metadata=metadata,
)
provider.set_full_image(image_data)

with IO.open(["output.tif"], "w", "tiff") as writer:
    writer.add_asset("image:0", provider, "Image", "desc", ["data"])
```

The supported enumerated value names (case-insensitive) are:

| Tag | Accepted names |
|-----|---------------|
| Compression (259) | None, CCITTRLE, CCITTFax3, CCITTFax4, LZW, OJPEG, JPEG, Deflate, PackBits |
| PhotometricInterpretation (262) | MinIsWhite, MinIsBlack, RGB, Palette, Mask, YCbCr |
| PlanarConfiguration (284) | Chunky, Planar |
| Predictor (317) | None, Horizontal, FloatingPoint |
| SampleFormat (339) | UInt, Int, Float, Void |
| Orientation (274) | TopLeft, TopRight, BottomRight, BottomLeft, LeftTop, RightTop, RightBottom, LeftBottom |