# Datasets and the IO Interface

## The Simple Path

For most tasks you don't need to think about datasets or assets at all. The
convenience functions handle file opening, asset selection, and cleanup for you:

```python
from aws.osml.io import imread, imsave, iminfo

# Read → NumPy array
pixels = imread("image.ntf")

# Inspect without reading pixels
info = iminfo("image.ntf")
print(f"{info.width}x{info.height}, {info.bands} bands, {info.dtype}")

# Save — format inferred from extension
imsave("output.tif", pixels)
```

When you need more control — multi-segment files, per-asset metadata, specific
compression parameters, or write workflows that involve multiple assets — the
full dataset API described below gives you direct access to everything in the
file.

## Opening a Dataset

The `IO` class is the entry point for reading and writing imagery files. It auto-detects
the format (NITF, TIFF/GeoTIFF, PNG, etc.) and returns a `DatasetReader` or `DatasetWriter`:

```python
from aws.osml.io import IO

# Read mode — format auto-detected from extension
with IO.open(["image.ntf"], "r") as dataset:
    print(type(dataset))  # DatasetReader

# Write mode — format specified explicitly
with IO.open(["output.tif"], "w", "geotiff") as writer:
    print(type(writer))  # DatasetWriter
```

Use the context manager (`with`) to ensure file handles are released when you're done.

## Input Sources

`IO.open()` and the convenience functions (`imread`, `imsave`, `iminfo`, `tiles`)
accept two kinds of input:

### File paths (recommended for large files)

Pass a string path (or list of paths for multi-file pyramids). The library
memory-maps the file, so only the pages you access are loaded into RAM. This is
the most performant option — the operating system efficiently manages loading
imagery from disk into memory without requiring the entire file to be resident.

```python
from aws.osml.io import imread

pixels = imread("large_image.ntf")
```

### Python file-like objects

Any object with a standard `.read()` / `.write()` interface works — `io.BytesIO`,
fsspec handles, HTTP response bodies, or any duck-typed object with the required
methods. This is convenient when you already have bytes in memory or want to
encode directly to a buffer without touching the filesystem.

```python
import io
from aws.osml.io import IO, imread, imsave
import numpy as np

# Read from an in-memory buffer
png_bytes = download_image_bytes()
pixels = imread(io.BytesIO(png_bytes), format="png")

# Write directly to a buffer
data = np.random.randint(0, 255, (3, 256, 256), dtype=np.uint8)
buffer = io.BytesIO()
imsave(buffer, data, format="jpeg")
```

### Trade-offs

Stream sources are read entirely into memory via a single `.read()` call. For
large files (multi-GB NITF imagery) this can be problematic:

- **Memory pressure** — the full file must fit in RAM, unlike memory-mapped paths
  which load pages on demand.
- **Latency for remote files** — if the stream backs cloud storage (e.g., an
  fsspec S3 handle), the entire file must be downloaded before decoding begins.

For efficient access to large remote imagery without downloading the full file,
use the [VirtualiZarr tile-based access](zarr-codecs.md) path. It issues HTTP
range requests for only the tiles you need:

```python
import zarr
import numpy as np
from aws.osml.io.multi_reference_fs import MultiReferenceFileSystem
from zarr.storage._fsspec import FsspecStore

fs = MultiReferenceFileSystem(
    fo="s3://bucket/image.ntf.tile_index.json",
    template_overrides={"base": "s3://bucket/imagery/"},
    asynchronous=True,
    remote_options={"asynchronous": True},
    skip_instance_cache=True,
)
store = FsspecStore(fs=fs, read_only=True, path="")
root = zarr.open_group(store, mode="r", zarr_format=2)

# Read only the tiles you need — no full-file download
tile = np.asarray(root["0/data"][0:3, 0:256, 0:256])
```

See [Cloud Imagery Access via Zarr](zarr-codecs.md) for the full workflow.

Alternatively, download the remote file to a local path first to get
memory-mapped performance:

```python
import tempfile
from aws.osml.io import imread

with tempfile.NamedTemporaryFile(suffix=".ntf") as tmp:
    tmp.write(remote_file.read())
    tmp.flush()
    pixels = imread(tmp.name)
```

### The `format` parameter

When working with streams, the library cannot infer the image format from a file
extension. The `format` parameter is **required** for all stream operations:

```python
# Raises ValueError — no format specified
imread(io.BytesIO(data))

# Works
imread(io.BytesIO(data), format="png")
```

Supported format strings: `"nitf"`, `"tiff"`, `"png"`, `"j2k"`, `"jpeg"`.

When using file paths, `format` remains optional — the library infers it from the
file extension. Recognized NITF extensions include `.ntf`, `.nitf`, `.nsif`, `.nsf`,
and `.hr1` through `.hr8` (High Resolution Elevation products).

### When streams are a good fit

- The file is small enough to fit in memory (PNG thumbnails, JPEG tiles, small
  NITF chips)
- You already have the bytes in memory (HTTP response bodies, message payloads)
- You want to encode output directly to a buffer without a temporary file (tile
  server responses)
- You are using fsspec handles for moderate-sized files from cloud storage

## Dataset Structure

The dataset model in this library is inspired by the
[SpatioTemporal Asset Catalog (STAC)](https://stacspec.org/en) specification. STAC defines
a common structure for describing and cataloging geospatial assets — any file that
represents information about the Earth captured at a certain place and time. The core
building block in STAC is the **Item**, a GeoJSON feature that groups one or more related
**Assets** (the actual data files) together with shared metadata such as spatial extent,
temporal range, and provenance.

This library adopts the same conceptual model: a single `Dataset` maps to a STAC Item and
may contain multiple named assets. Just as a STAC Item for a satellite scene might include
separate assets for each spectral band, a thumbnail, a metadata sidecar, and ML-derived
annotations, a `Dataset` opened by this library can contain multiple images, structured
data payloads (e.g. SICD/SIDD XML), text reports, and vector graphic overlays — all
accessed through a uniform interface. The key insight is that real-world geospatial
products are rarely a single file; they are bundles of related assets that share a common
spatial and temporal context.

By aligning with the STAC data model, datasets produced or consumed by this library are
straightforward to publish as STAC Items and integrate with the broader STAC ecosystem of
catalogs, search APIs, and tooling. The library does not implement the STAC JSON format
itself, but the structural alignment means the mapping between an in-memory `Dataset` and
a STAC Item is direct: each asset key corresponds to a STAC Asset entry, asset types map
to STAC roles, and dataset-level metadata carries the information needed to populate Item
properties.

Each asset within a dataset has a type and a key that uniquely identifies it:

| Asset Type | Description | Examples |
|------------|-------------|---------|
| `image` | Raster imagery with blocked access | Satellite photos, SAR data |
| `data` | Structured data payloads | SICD/SIDD XML, overflow TREs |
| `text` | Plain text content | Mission reports, annotations |
| `graphics` | Vector graphics | CGM overlays |

### Asset Roles

Every asset also carries one or more semantic roles that describe its purpose. Roles
are aligned with the [STAC asset roles](https://github.com/radiantearth/stac-spec/blob/master/best-practices.md#asset-roles)
convention — short strings that communicate what an asset is for, independent of the
underlying file format.

| Role | Meaning | Assigned To |
|------|---------|-------------|
| `data` | Full-resolution image data | TIFF full-res IFDs, NITF image segments, JPEG, PNG |
| `overview` | Reduced-resolution image | COG overview IFDs, multi-file R-set images |
| `metadata` | Metadata asset | NITF text segments, data extension segments |
| `graphic` | Graphic/annotation overlay | NITF graphic segments |

Roles are the primary way to distinguish between different kinds of assets without
parsing key strings. See [Image Pyramids](#image-pyramids) below for how roles are
used to separate full-resolution images from reduced-resolution overviews.

## Image Pyramids

An image pyramid is a set of representations of the same image at progressively lower
resolutions. Pyramids enable efficient multi-scale access — a viewer can load a
low-resolution overview for navigation and fetch full-resolution tiles only for the
region of interest.

There are three ways multi-resolution data can be represented in geospatial imagery:

1. **Block-level resolution levels** — A single image whose compressed blocks can be
   decoded at multiple resolutions (e.g. JPEG 2000 wavelet decomposition). The block
   grid stays the same; each block just produces fewer pixels at higher level numbers.
   See [Reading Blocks](image-assets.md#reading-blocks) for details.

2. **Embedded overviews** — A single file containing multiple images at different
   resolutions. Cloud Optimized GeoTIFFs (COGs) store reduced-resolution overview
   images as additional IFDs alongside the full-resolution image.

3. **Multi-file pyramids** — Separate files for each resolution level. NITF R-sets
   are a common example: `image.ntf` is the full resolution, `image.ntf.r1` through
   `image.ntf.rN` are progressively reduced overviews.

This library exposes cases 2 and 3 through the same uniform interface: each resolution
level becomes a separate image asset with its own key and role. The full-resolution
image has role `data`, and each overview has role `overview`.

### Overview Asset Keys

Overview keys follow the pattern `image:{parent}:overview:{level}`, where `{parent}`
is the index of the full-resolution image and `{level}` is the overview number:

```python
from aws.osml.io import IO, AssetType

with IO.open(["cog.tif"], "r") as dataset:
    for key in dataset.get_asset_keys(asset_type=AssetType.Image):
        asset = dataset.get_asset(key)
        print(f"{key}: {asset.num_columns}x{asset.num_rows}, roles={asset.roles}")
    # image:0: 4096x4096, roles=['data']
    # image:0:overview:1: 2048x2048, roles=['overview']
    # image:0:overview:2: 1024x1024, roles=['overview']
```

Each overview is a fully functional image asset — you can read blocks, check dimensions,
and access metadata just like a full-resolution image:

```python
    # Use roles to separate full-res from overviews
    data_keys = dataset.get_asset_keys(asset_type=AssetType.Image, roles=["data"])
    overview_keys = dataset.get_asset_keys(asset_type=AssetType.Image, roles=["overview"])

    # Read a block from an overview
    overview = dataset.get_asset("image:0:overview:1")
    block = overview.get_block(0, 0, resolution_level=0)
```

This is different from the `resolution_level` parameter on `get_block()`. Block-level
resolution levels are a decompression feature that produces smaller versions of the
same block. Overview assets are separate images with their own tile grids and
dimensions. The two mechanisms are complementary — an overview image that uses JPEG
2000 compression could itself support multiple block-level resolution levels.

### Multi-File Pyramids

When a dataset spans multiple files at different resolutions, pass all files to
`IO.open()` as a list. The library detects the R-set naming convention (`.rN` suffix)
and exposes each file as an overview asset, producing the same key and role structure
as embedded overviews:

```python
with IO.open(["image.ntf", "image.ntf.r1", "image.ntf.r2"], "r") as dataset:
    for key in dataset.get_asset_keys(asset_type=AssetType.Image):
        asset = dataset.get_asset(key)
        print(f"{key}: {asset.num_columns}x{asset.num_rows}, roles={asset.roles}")
    # image:0: 4096x4096, roles=['data']
    # image:0:overview:1: 2048x2048, roles=['overview']
    # image:0:overview:2: 1024x1024, roles=['overview']
```

The first path is always the full-resolution base image. The overview level is
extracted from the filename, not inferred from list order — these two calls produce
identical results:

```python
IO.open(["image.ntf", "image.ntf.r1", "image.ntf.r2"], "r")
IO.open(["image.ntf", "image.ntf.r2", "image.ntf.r1"], "r")
```

R-set detection is format-agnostic. Each file in the list is opened with its own
auto-detected format reader, so users are free to select other encodings for the
overview files if desired.

:::{note}
R-sets are a de facto industry convention used by some data providers and image
analysis tools. They are not part of the JBP/NITF specification — there is no 
internal metadata linking an R-set file to its parent. The relationship is purely
by filename convention.
:::

Some things to keep in mind with multi-file pyramids:

- The caller must provide the full list of paths explicitly. `IO.open()` does not
  scan the filesystem for sibling `.rN` files.
- When only one path is provided, behavior is identical to the single-file case.
- R-set overviews are associated with `image:0` (the primary image segment). If the
  base file contains multiple image segments, the R-sets apply to the primary image
  only.
- The same multi-path pattern works for writing — see
  [Writing Multi-File R-Set Pyramids](image-assets-writing.md#writing-multi-file-r-set-pyramids).

### Streams and Explicit Roles

When sources are streams rather than file paths, there are no filenames to parse for
`.rN` suffixes. The `roles` parameter tells the library the purpose of each source
explicitly:

```python
import io
from aws.osml.io import IO, AssetType

base_stream = io.BytesIO(base_bytes)
overview1_stream = io.BytesIO(overview1_bytes)
overview2_stream = io.BytesIO(overview2_bytes)

with IO.open(
    [base_stream, overview1_stream, overview2_stream],
    "r",
    format="nitf",
    roles=[["data"], ["overview:1"], ["overview:2"]],
) as dataset:
    # Same asset key structure as file-path R-sets
    for key in dataset.get_asset_keys(asset_type=AssetType.Image):
        asset = dataset.get_asset(key)
        print(f"{key}: {asset.num_columns}x{asset.num_rows}")
    # image:0: 4096x4096
    # image:0:overview:1: 2048x2048
    # image:0:overview:2: 1024x1024
```

The `roles` parameter assigns semantic roles to each source in a multi-source
dataset:

| First argument | `roles` type | Description |
|----------------|--------------|-------------|
| Single source (`str` or stream) | `list[str]` | Roles for the single source |
| List of sources | `list[list[str]]` | One inner list per source (must match list length) |

Role strings:

| Role string | Meaning |
|-------------|---------|
| `"data"` | Base image (full resolution). If omitted, the first source is treated as the base. |
| `"overview:N"` | R-set overview at level N (N ≥ 1). Maps to the `image:0:overview:N` asset key. |

When `roles` is required:

- **List of streams** — always required (no filenames to detect from). Omitting raises `ValueError`.
- **List of file paths with `roles`** — explicit roles override `.rN` filename detection.
- **List of file paths without `roles`** — falls back to `.rN` detection (common convention).

```python
# Paths with explicit roles — bypasses .rN detection
IO.open(["base.ntf", "ovr.ntf"], "r", roles=[["data"], ["overview:1"]])

# Paths without roles — uses .rN detection
IO.open(["image.ntf", "image.ntf.r1"], "r")
```

## Discovering Assets

Use `get_asset_keys()` to list available assets, then `get_asset()` to retrieve
a specific one. You can filter by asset type, by role, or both:

```python
from aws.osml.io import IO, AssetType

with IO.open(["complex_dataset.ntf"], "r") as dataset:
    # List keys by asset type
    image_keys = dataset.get_asset_keys(asset_type=AssetType.Image)
    text_keys = dataset.get_asset_keys(asset_type=AssetType.Text)
    data_keys = dataset.get_asset_keys(asset_type=AssetType.Data)
    graphics_keys = dataset.get_asset_keys(asset_type=AssetType.Graphics)

    print(f"Images: {len(image_keys)}, Text: {len(text_keys)}, "
          f"Data: {len(data_keys)}, Graphics: {len(graphics_keys)}")

    # Retrieve a specific asset
    image = dataset.get_asset("image:0")
```

### Filtering by Role

The `roles` parameter on `get_asset_keys()` lets you filter assets by their semantic
purpose. This is useful when a dataset contains both full-resolution images and
overviews:

```python
with IO.open(["cog.tif"], "r") as dataset:
    # Only full-resolution images
    data_keys = dataset.get_asset_keys(asset_type=AssetType.Image, roles=["data"])

    # Only overview images
    overview_keys = dataset.get_asset_keys(asset_type=AssetType.Image, roles=["overview"])

    # All image assets (no role filter)
    all_keys = dataset.get_asset_keys(asset_type=AssetType.Image)
```

When `roles` is omitted or `None`, all assets matching the `asset_type` filter are
returned. When both `asset_type` and `roles` are provided, both filters apply — only
assets that match the type and have at least one of the requested roles are returned.

NITF files can contain all four asset types. TIFF files contain only image assets —
each IFD (Image File Directory) in the file becomes a separate image asset keyed as
`"image:0"`, `"image:1"`, etc. Cloud Optimized GeoTIFFs additionally expose overview
IFDs as `"image:0:overview:1"`, `"image:0:overview:2"`, etc. PNG files contain a
single image keyed as `"image:0"`. Text, data, and graphics asset queries will return
empty lists for TIFF and PNG datasets.

## Dataset-Level Metadata

Every dataset exposes a `metadata` property with file-level fields. See the
[Metadata](metadata.md) section for details:

```python
with IO.open(["image.ntf"], "r") as dataset:
    file_metadata = dataset.metadata.entries()
```