# Cloud Imagery Access via Zarr ## The Challenge of Working with Large Geospatial Datasets Geospatial imagery archives are massive. Government and commercial satellite programs produce hundreds of petabytes of imagery, growing at tens to hundreds of terabytes per day. The traditional workflow of downloading entire multi-GB files before accessing any pixels is giving way to distributed access patterns. Machine Learning inference, tile-level analytics pulling from global data lakehouses, and interactive visualizations, all benefit from services that can quickly fetch only the compressed bytes for the tiles they care about, decode them, and move on. The challenge is that some well established geospatial formats were designed before cloud object stores existed. Their internal structure — headers, offset tables, interleaved bands, shared compression state — was built for fast local disk access, not HTTP range requests with network delays. At best consumers of these images execute multiple reads to gather header and pixel information from different parts of a file. In some cases entire files must be scanned sequentially to locate the region of interst. ## Where Zarr Fits In [Zarr](https://zarr.dev/) is a modern array storage format designed from the ground up for chunked, compressed, cloud-native data. Each chunk in a Zarr array is independently addressable and independently decodable. Zarr enjoys integration with a wide ecosystem: xarray for labeled array access, Dask for parallel and distributed computation, and fsspec for transparent cloud IO. If we were starting from scratch, we might store our imagery as Zarr arrays and be done. The reality is we have petabytes of imagery governed by well-defined standards and strong communities already stored in the cloud S3. Converting it all to Zarr or something like cloud-optimized GeoTIFFs is expensive, slow, and will increase storage costs. What we need is a way to make the existing data behave like Zarr without actually transcoding it. ## VirtualiZarr: Making Old Data Behave Like New [VirtualiZarr](https://github.com/zarr-developers/VirtualiZarr) is a community project under the `zarr-developers` organization that creates virtual Zarr stores from archival data formats. Instead of converting files, it builds lightweight reference layers that describe where each chunk lives inside an existing file. These references map Zarr chunk coordinates to byte ranges in the original data, making archival files appear as native Zarr stores without copying or modifying a single byte. VirtualiZarr provides a pluggable parser system and a `ManifestStore` abstraction that represents a virtual Zarr store backed by chunk references. The references can be serialized to several formats including [Kerchunk](https://fsspec.github.io/kerchunk/) JSON, Kerchunk Parquet, or committed to an [Icechunk](https://icechunk.io/) transactional store. On the consumer side, fsspec's `ReferenceFileSystem` reads Kerchunk references directly, while Icechunk provides its own zarr-compatible store. Zarr is a great general abstraction for raster data, and VirtualiZarr is the bridge that makes existing data behave like Zarr — without the cost of conversion. We keep our geospatial imagery files exactly where they are in S3. A small reference index (kilobytes to low megabytes) sits alongside each one, and the Zarr ecosystem treats them as native chunked arrays. ## What We Built Bridging archival imagery formats into the Zarr ecosystem required solving three problems that the existing tooling does not handle. Each problem led to a component: 1. **Multi-Resolution Tile Indexing for Geospatial Formats** — Zarr needs to know where each tile lives in the source file. Archival formats store this information in format-specific structures (SOT markers, IFD tags, length-prefixed headers) that VirtualiZarr cannot parse out of the box. 2. **New Support for Non-contiguous Chunk Data** — The Kerchunk reference spec assumes each chunk maps to a single contiguous byte range. JPEG 2000 codestreams with resolution-first progression orders (RLCP, RPCL) interleave tile-parts from different tiles, scattering a single tile's data across multiple non-contiguous locations in the file. Neither Kerchunk nor fsspec's `ReferenceFileSystem` can express or fetch this. 3. **Decoders for Geospatial Formats** — The fetched bytes are still in the source format's native encoding. NITF pixel data uses big-endian byte order and format-specific interleave modes. JPEG 2000 tile-parts are not self-contained codestreams. Compressed TIFF tiles depend on shared IFD tag metadata (predictor settings, JPEG tables) that lives outside the tile data. Standard Zarr codecs cannot decode any of these. The following sections describe each component in pipeline order: the parser produces the tile index, the filesystem fetches the bytes, and the codec decodes them into pixels. ```{image} /_static/images/zarr-decode-flow.png :alt: Figure showing overall flow of Zarr decode path. :width: 700px :align: center ``` ### Tile Indexing with VirtualiZarr Parsers The library provides `OversightMLParser`, a single VirtualiZarr parser that makes it possible to access any imagery format supported by this library as a virtual dataset. The parser scans a file using the library's native reader and walks each image segment, determining the byte offset and length of every tile relative to the start of the file. How tile boundaries are calculated depends on the format and compression: - Uncompressed tiles have fixed sizes. Offsets are computed arithmetically from the image dimensions, pixel type, and interleave mode. Formats that support sparse tile arrays (ex: NITF masked images) may have existing metadata that contains these boundaries. - JPEG tiles are length-prefixed. The parser scans the length headers sequentially to locate each tile boundary. - JPEG 2000 codestreams use SOT (Start of Tile-part) markers that record the byte offset and length of each tile-part. If the codestream contains a TLM (Tile-part Length Marker) in its main header, the full tile index is available immediately without scanning. If no TLM is present, the parser performs a sequential SOT scan. - TIFF files store tile offsets and byte counts in IFD tags (`TileOffsets` and `TileByteCounts`). The parser reads these directly. When a tile's data spans multiple non-contiguous byte ranges (as with interleaved JPEG 2000 tile-parts), the parser detects this and emits a multi-range reference entry instead of a single-range entry. Contiguous multi-part tiles are merged into a single range automatically. The parsers produce the `ManifestStore` — a virtual Zarr store backed by chunk references into the source file. This can be serialized to create a format-agnostic tile index. ### MultiReferenceFileSystem: Scatter-Gather I/O for Non-Contiguous Chunks The standard Kerchunk reference spec supports three forms per chunk key: inline data, whole-file references, and single byte-range references. This covers most formats, but breaks down for JPEG 2000 codestreams with interleaved tile-parts. ```{image} /_static/images/kerchunk-singlerange.png :alt: Figure showing how J2K tile parts grouped by tile are referenced by kerchunk index. :width: 700px :align: center ``` JPEG 2000 supports several progression orders that control how compressed data is organized in the codestream. Two of these — RLCP (Resolution-Layer-Component- Position) and RPCL (Resolution-Position-Component-Layer) — interleave tile-parts from different tiles. Instead of writing all of tile 0's data, then all of tile 1's data, the encoder writes resolution level 0 for every tile, then resolution level 1 for every tile, and so on. A single tile's compressed bytes end up scattered across multiple non-contiguous locations in the file. The standard `ReferenceFileSystem` has no way to express "fetch these six byte ranges and concatenate them" for a single chunk. This is not a theoretical edge case. Satellite imagery from several commercial providers uses RPCL progression order, and the interleaved tile-part layout is common in large multi-resolution JPEG 2000 files. `MultiReferenceFileSystem` is a drop-in subclass of fsspec's `ReferenceFileSystem` that extends the Kerchunk reference spec with a fourth form: | Form | Format | Description | |------|--------|-------------| | Inline | `"base64:..."` or raw bytes | Inline data | | Whole file | `["url"]` | Entire file | | Single range | `["url", offset, length]` | One contiguous byte range | | **Multi-range** | `["url", [[offset, length], ...]]` | **Multiple non-contiguous byte ranges** | A multi-range entry is a 2-element list where the first element is the URL and the second is a list of `[offset, length]` pairs. Each pair identifies one tile-part's location in the file. The filesystem fetches all ranges and concatenates them in order before handing the bytes to the codec. For example, a tile with six tile-parts scattered across a file: ```json { "0/data/0.0.0": [ "s3://bucket/image.ntf", [[66132, 1518], [2534029, 3385], [7216065, 11460], [22566527, 38566], [74210429, 116812], [242293202, 339534]] ] } ``` The URL appears once rather than being repeated for each sub-range. For a file with 1,722 tiles and six tile-parts each, this saves roughly 775 KB of redundant URL strings compared to a flat list of single-range entries. ```{image} /_static/images/kerchunk-multirange.png :alt: Figure showing how J2K tile parts interleaved by resolution level are referenced by kerchunk index. :width: 700px :align: center ``` `MultiReferenceFileSystem` handles all standard reference types by delegating to the parent `ReferenceFileSystem`. When it encounters a multi-range entry, it performs scatter-gather I/O: - **Sync path** (`_cat_common`): fetches each byte range sequentially and concatenates the results. Adequate for local files and testing. - **Async path** (`_cat_file`): issues all byte-range fetches concurrently via `asyncio.gather` and concatenates in the original entry order. This is the path used by Zarr's async store and is critical for minimizing latency when reading from S3. Use `MultiReferenceFileSystem` instead of `ReferenceFileSystem` when your tile index may contain multi-range entries. It is fully backward-compatible — tile indexes with only standard single-range entries work identically. The index generator in this library automatically emits multi-range entries for tiles with interleaved tile-parts and single-range entries for everything else, so using `MultiReferenceFileSystem` as the default is the simplest approach. ```python from aws.osml.io.multi_reference_fs import MultiReferenceFileSystem fs = MultiReferenceFileSystem( fo="s3://bucket/image.tile_index.json", asynchronous=True, remote_options={"asynchronous": True}, skip_instance_cache=True, ) ``` The constructor accepts the same arguments as `ReferenceFileSystem`. Existing code that uses `fsspec.filesystem("reference", ...)` can switch by replacing the filesystem instantiation. ```{note} This multi-range reference format is a novel extension to the Kerchunk reference spec introduced by this project. The standard Kerchunk and Zarr ecosystem does not handle the case of non-contiguous byte ranges for a single chunk. If you encounter JPEG 2000 imagery with interleaved tile-parts elsewhere, `MultiReferenceFileSystem` is the component that makes it work. ``` ### Custom Codecs Once the filesystem delivers the raw bytes for a chunk, those bytes are still in the source format's native encoding. Standard Zarr codecs cannot decode them. This library registers four Zarr v3 codecs that handle the format-specific decoding: `JbpBlockCodec` handles uncompressed NITF tiles. NITF raw pixel data uses big-endian byte order and one of four interleave modes (band-interleaved by pixel, band-interleaved by line, band-interleaved by block, or band-sequential). The codec performs the endian swap and interleave conversion to produce standard NumPy arrays. `JpegCodec` decodes JPEG tiles. It carries format-specific parameters (color space, interleave mode, bits per pixel) in its configuration and passes them to the underlying Rust decoder. `Jpeg2000Codec` decodes JPEG 2000 tile data. JPEG 2000 codestreams support internal tiling, but the tiles are not self-contained. Each tile's compressed data (the "tile-part") contains only the wavelet coefficients. The decoding parameters — tile dimensions, quantization tables, wavelet decomposition levels, component counts — live in the codestream's main header (the SIZ, COD, and QCD markers). A decoder cannot reconstruct pixels from a tile-part alone. Existing Zarr JPEG 2000 codecs assume each chunk is a complete, self-contained codestream. They cannot decode a bare tile-part without the header. We solve this by inlining the shared main header (base64-encoded, typically 100–500 bytes) in the codec configuration stored in `.zarray`. At decode time the codec reconstructs a minimal single-tile codestream on the fly: ```{image} /_static/images/reconstructed-j2k-codestream.png :alt: Figure showing reconstruction of a single tile J2K codestream. :width: 700px :align: center ``` OpenJPEG receives what looks like a normal single-tile codestream and decodes it. This approach has precedent in the JPEG 2000 ecosystem. JPIP (the JPEG 2000 Interactive Protocol) streams individual tile-parts to clients that already hold the main header. Because the codec operates on the J2K codestream directly, it works for standalone `.j2k`/`.jp2` files and for J2K codestreams embedded in container formats like NITF. Note that the codec layer performs no I/O. When `MultiReferenceFileSystem` fetches and concatenates multiple tile-parts for an interleaved codestream, the codec receives the complete concatenated bytes and reconstructs the codestream exactly the same way. The filesystem handles the scatter-gather complexity so codecs remain simple bytes-to-bytes transforms. `TiffTileCodec` decodes compressed TIFF tiles. Individual compressed tiles extracted from a TIFF file cannot be decoded in isolation — the decoder needs IFD tag metadata (compression type, predictor, photometric interpretation, JPEG tables, etc.) that lives in the file header, not in the tile data itself. The codec stores the required IFD tag values in its configuration. At decode time it constructs a minimal single-tile TIFF in memory from the configuration and the compressed tile bytes, then hands it to libtiff via `TIFFClientOpen` + `TIFFReadEncodedTile`. This approach supports LZW, JPEG, Deflate, Adobe Deflate, and PackBits compression, including horizontal differencing predictors and YCbCr-to-RGB conversion for JPEG tiles. Uncompressed TIFF tiles (Compression=1) do not require a codec — Zarr reads the raw tile bytes directly. ```{image} /_static/images/reconstructed-single-tile-tiff.png :alt: Figure showing reconstruction of a single tile TIFF. :width: 700px :align: center ``` `DtedTileCodec` decodes DTED elevation data sections. DTED stores elevations as column-major signed-magnitude big-endian 16-bit integers with per-record headers and checksums interleaved in the data stream. The codec strips the record framing, converts signed-magnitude to native two's complement, and transposes column-major to row-major — producing a standard `(1, rows, cols)` Int16 array. The codec introduces a capability not found in existing Zarr codecs: **overlap-aware edge trimming**. DTED cells share boundary posts with their neighbors (the easternmost column of one cell duplicates the westernmost column of the next). The `trim_*` parameters discard these shared edges during decode, so the output chunks tile seamlessly without data duplication. This enables representing an entire DTED archive as a single contiguous Zarr array — each file becomes one chunk, edges are trimmed at decode time, and consumers see a seamless elevation surface with no preprocessing required. ```json { "name": "https://awslabs.github.io/osml-imagery-io/codecs/dted", "configuration": { "num_lat_points": 1201, "num_lon_lines": 1201, "record_size": 2414, "trim_bottom": 1, "trim_right": 1 } } ``` All five codecs are registered with the Zarr codec registry via Python entry points. They use URI-based names per the Zarr v3 specification to avoid conflicts with existing codecs: - `https://awslabs.github.io/osml-imagery-io/codecs/jpeg2000` - `https://awslabs.github.io/osml-imagery-io/codecs/jpeg` - `https://awslabs.github.io/osml-imagery-io/codecs/jbp-block` - `https://awslabs.github.io/osml-imagery-io/codecs/tiff-tile` - `https://awslabs.github.io/osml-imagery-io/codecs/dted` The URIs resolve to human-readable documentation. Implementations do not fetch them at runtime. ## Zarr Access to Image Pyramids ### The GeoZarr multiscales convention The GeoZarr [multiscales convention](https://github.com/zarr-conventions/multiscales) defines how multi-resolution image pyramids are stored in Zarr groups. Each resolution level is identified by an `asset` path in the `layout` array. The `asset` field uses Zarr path nomenclature: it can be a simple name referencing a direct child group or array (e.g., `"0"`, `"level1"`, `"full"`), or a path with `/` separators for nested resources (e.g., `"0/data"`). All paths are relative to the group containing the multiscales metadata. In our implementation, each asset path points to a numbered subgroup (`"0"`, `"1"`, `"2"`, ...) where level `"0"` is the highest resolution and each subgroup contains a single array named `"data"`. This structure is used consistently for all tile indexes — including single-resolution images, which are represented as a one-level pyramid. This means the access path ``root["0/data"]`` works the same way regardless of whether overviews are present. The root group's `.zattrs` carries metadata that describes the relationship between levels. The convention has three key parts: **Convention identity via `zarr_conventions`.** The root group attributes include a `zarr_conventions` array that declares which conventions the store conforms to. Each entry has a UUID, schema URL, spec URL, name, and description. This makes convention detection unambiguous — a consumer can check for UUID `d35379db-88df-4056-af3a-620245f8e347` to confirm the store uses the multiscales convention. **Layout via `multiscales`.** The `multiscales` attribute is an object (not an array) containing a `layout` array. Each entry in the layout describes one resolution level with an `asset` field (the path to the Zarr subgroup), an optional `derived_from` field (the parent level it was downsampled from), and a `transform` object with `scale` and `translation` arrays. **Relative transforms.** Scale factors are relative between adjacent levels, not absolute from level 0. If level 1 is derived from level 0 at 2× downsample, its `transform.scale` is `[2.0, 2.0]` (Y, X). If level 2 is also 2× from level 1, its scale is also `[2.0, 2.0]`. The scale and translation arrays have exactly two elements — the spatial Y and X axes. The bands axis is not included. ```{note} The GeoZarr effort also defines `proj:` (CRS information) and `spatial:` (affine transforms, bounding boxes) conventions. These are not yet implemented in this library and are planned for a future phase. When added, their entries will appear in the `zarr_conventions` array alongside the multiscales entry. ``` ### How pyramids map to the chunk index The tile index is always a Zarr group hierarchy with GeoZarr multiscales metadata, regardless of how many resolution levels exist. A single-resolution image is a one-level pyramid; adding overviews adds levels without changing how the base image is addressed: ``` image.tile_index.json ├── .zattrs ← GeoZarr multiscales metadata (zarr_conventions, layout) ├── .zgroup ├── 0/ ← Level 0 (full resolution) │ └── data/ │ ├── .zarray ← array metadata (shape, chunks, codecs) │ ├── 0.0.0 ← ["s3://bucket/image.tif", offset, length] │ ├── 0.0.1 │ ├── 0.1.0 │ └── 0.1.1 ├── 1/ ← Level 1 (2× downsampled) │ └── data/ │ ├── .zarray │ └── 0.0.0 └── 2/ ← Level 2 (4× downsampled) └── data/ ├── .zarray └── 0.0.0 ``` Each level has its own `.zarray` with the correct shape and chunk dimensions for that resolution. Chunk keys within a level follow the standard Zarr `bands.row.col` convention. The path prefix (`0/data/`, `1/data/`, `2/data/`) is what distinguishes chunks at different resolution levels. The root `.zattrs` carries the GeoZarr multiscales metadata that records the relationship between levels: ```json { "source": "s3://bucket/image.tif", "zarr_conventions": [ { "uuid": "d35379db-88df-4056-af3a-620245f8e347", "schema_url": "https://raw.githubusercontent.com/zarr-conventions/multiscales/refs/tags/v1/schema.json", "spec_url": "https://github.com/zarr-conventions/multiscales/blob/v1/README.md", "name": "multiscales", "description": "Multiscale layout of zarr datasets" } ], "multiscales": { "layout": [ { "asset": "0", "transform": {"scale": [1.0, 1.0], "translation": [0.0, 0.0]} }, { "asset": "1", "derived_from": "0", "transform": {"scale": [2.0, 2.0], "translation": [0.0, 0.0]} }, { "asset": "2", "derived_from": "1", "transform": {"scale": [2.0, 2.0], "translation": [0.0, 0.0]} } ], "resampling_method": "average" } } ``` The `zarr_conventions` array identifies the store as conforming to the GeoZarr multiscales convention. The `multiscales` object describes the pyramid: each entry in the `layout` array corresponds to one resolution level. Level 0 has no `derived_from` — it is the full-resolution original. Levels 1 and 2 each declare their parent and a relative `transform`. A scale of `[2.0, 2.0]` means each pixel in that level covers 2×2 pixels in the `derived_from` level. For a power-of-2 pyramid, every non-base level has the same relative scale. The optional `resampling_method` records how the overviews were generated. Consumers like `xarray.open_datatree()` read this metadata and present the pyramid as a `DataTree` with one node per level, each containing a lazily-loaded Dataset. ### Format-specific sources of resolution levels The parser produces the same hierarchical Zarr structure regardless of where the resolution levels come from. What differs is how the source format stores them. **Cloud Optimized GeoTIFF (COG):** A single file contains the full-resolution image and its overviews as separate IFDs. The TIFF reader exposes these as `image:0` (full resolution), `image:0:overview:1` (first overview), `image:0:overview:2` (second overview), etc. The parser detects the overview keys and builds the hierarchy automatically. All chunk references point to byte ranges within the same file, so a single URL suffices. **NITF R-sets:** Each resolution level is a separate file. The caller passes all files to the parser, and `IO.open()` detects the `.rN` filename pattern to key the assets correctly (`image:0` for the base, `image:0:overview:N` for each R-set file). Each level's chunk references point to its own file, so the caller provides one URL per file. **JPEG 2000 native resolution levels:** A single J2K codestream supports decoding at multiple resolution levels through wavelet decomposition. In the current implementation, the parser does not auto-expand these into pyramid levels — only explicitly provided assets (COG overview IFDs or R-set files) produce a hierarchy. Auto-expansion of J2K resolution levels is planned for a future phase, where each level would use the same codec with a different `resolution_level` parameter and reference only the tile-part byte ranges needed for that level. In all cases, the parser is a metadata-only operation. It does not decode pixels, generate new overview levels, or resample imagery. It describes how to address tiles that already exist in the source data. ## End-to-End Example: Single-Resolution NITF in S3 The following example shows the complete workflow for indexing a single NITF file and accessing its tiles through Zarr. ### Step 1: Generate the tile index Run this once per file, typically as part of an ingest pipeline. The file must be available locally for indexing. ```{note} Index generation requires the `virtualizarr` optional dependency: `pip install osml-imagery-io[virtualizarr]` ``` When the `url` parameter is omitted, the index stores only filenames with a `{{base}}` template placeholder instead of absolute S3 URLs. This makes the index portable — it can be created before the upload destination is known and resolved at read time. The serialized JSON uses the Kerchunk v1 `templates` feature, which `ReferenceFileSystem` and `MultiReferenceFileSystem` both support natively. ```python from aws.osml.io.virtualizarr_parsers import OversightMLParser, write_tile_index # Generate a portable index — no URL needed at index time parser = OversightMLParser(local_paths="local/image.ntf") store = parser() # Serialize as Kerchunk JSON (or .parquet) with multi-range support write_tile_index(store, "image.ntf.tile_index.json") ``` Upload both the image and the index to S3: ```python import boto3 s3 = boto3.client("s3") s3.upload_file("local/image.ntf", "my-bucket", "imagery/image.ntf") s3.upload_file( "image.ntf.tile_index.json", "my-bucket", "imagery/image.ntf.tile_index.json", ) ``` If you already know the final S3 location at index time, you can pass `url` directly and skip `template_overrides` at read time: ```python store = parser(url="s3://my-bucket/imagery/image.ntf") ``` ### Step 2: Open and access tiles Codec registration happens automatically when the package is installed with the `zarr` extras (`pip install osml-imagery-io[zarr]`). No explicit import is needed. Use `MultiReferenceFileSystem` to open the tile index. It handles both standard single-range entries and multi-range entries for JPEG 2000 images with interleaved tile-parts. When you slice into the dataset, the filesystem issues HTTP range requests for only the bytes backing the requested tiles and the registered codec decodes them into NumPy arrays. For portable indexes, pass `template_overrides` to resolve the `{{base}}` placeholder to the S3 directory containing the image (with a trailing slash). The same index works unchanged if the image moves to a different bucket or path — just update the override. ```python import zarr from aws.osml.io.multi_reference_fs import MultiReferenceFileSystem from zarr.storage._fsspec import FsspecStore fs = MultiReferenceFileSystem( fo="s3://my-bucket/imagery/image.ntf.tile_index.json", template_overrides={"base": "s3://my-bucket/imagery/"}, asynchronous=True, remote_options={"asynchronous": True, "profile": "my-profile"}, skip_instance_cache=True, ) store = FsspecStore(fs=fs, read_only=True, path="") root = zarr.open_group(store, mode="r", zarr_format=2) # Read a single tile region import numpy as np arr = root["0/data"] tile = np.asarray(arr[0:3, 768:1024, 1024:1280]) print(tile.shape) # (3, 256, 256) print(tile.dtype) # uint8 ``` AWS credentials can also be provided through environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_PROFILE`) or any other method supported by boto3 and fsspec. ## End-to-End Example: Multi-Resolution COG Pyramid This example shows how to index a Cloud Optimized GeoTIFF that contains embedded overview images and access tiles at different resolution levels through Zarr. A COG stores its overview images as additional IFDs within the same file. The parser detects these automatically and produces a hierarchical tile index with one subgroup per resolution level. ### Step 1: Generate the hierarchical tile index Portable indexes work with multi-file pyramids too. Each file gets its own `{{base}}filename` reference, and a single `template_overrides` resolves all of them at read time (assuming the files are co-located). ```python from aws.osml.io.virtualizarr_parsers import OversightMLParser, write_tile_index # A COG with embedded overviews — single file, portable index parser = OversightMLParser(local_paths="local/image.tif") store = parser() write_tile_index(store, "image.tif.tile_index.json") ``` For a COG with two overview levels, the resulting JSON contains subgroups `"0"` (full resolution), `"1"` (first overview), and `"2"` (second overview). All chunk references point to byte ranges within the same file. The same workflow works for multi-file NITF R-set pyramids. Pass multiple paths and the index will contain `{{base}}image.ntf`, `{{base}}image.ntf.r1`, etc.: ```python # Multi-file NITF pyramid (R-set convention) parser = OversightMLParser(local_paths=[ "local/image.ntf", "local/image.ntf.r1", "local/image.ntf.r2", ]) store = parser() write_tile_index(store, "image.ntf.tile_index.json") ``` ### Step 2: Open the pyramid and read tiles at different levels ```python import zarr from aws.osml.io.multi_reference_fs import MultiReferenceFileSystem from zarr.storage._fsspec import FsspecStore fs = MultiReferenceFileSystem( fo="s3://my-bucket/imagery/image.tif.tile_index.json", template_overrides={"base": "s3://my-bucket/imagery/"}, asynchronous=True, remote_options={"asynchronous": True}, skip_instance_cache=True, ) store = FsspecStore(fs=fs, read_only=True, path="") root = zarr.open_group(store, mode="r", zarr_format=2) # The root group contains numbered subgroups — one per resolution level import numpy as np # Read a tile from the full-resolution level level_0 = root["0/data"] print(level_0.shape) # e.g. (3, 8192, 8192) tile_full = np.asarray(level_0[0:3, 0:256, 0:256]) # Read the same spatial region from the first overview (2× downsampled) level_1 = root["1/data"] print(level_1.shape) # e.g. (3, 4096, 4096) tile_ovr1 = np.asarray(level_1[0:3, 0:128, 0:128]) # Read from the second overview (4× downsampled) level_2 = root["2/data"] print(level_2.shape) # e.g. (3, 2048, 2048) tile_ovr2 = np.asarray(level_2[0:3, 0:64, 0:64]) ``` You can also open the hierarchical index as an `xarray.DataTree` to get a structured view of all levels: ```python import xarray as xr dt = xr.open_datatree("image.tif.tile_index.json", engine="kerchunk") print(dt) # DataTree('None', parent=None) # ├── DataTree('0') # │ └── Dataset {'data': (bands: 3, y: 8192, x: 8192)} # ├── DataTree('1') # │ └── Dataset {'data': (bands: 3, y: 4096, x: 4096)} # └── DataTree('2') # └── Dataset {'data': (bands: 3, y: 2048, x: 2048)} # Access a specific level's dataset level_1_ds = dt["1"].ds print(level_1_ds["data"].shape) # (3, 4096, 4096) ``` For single-file inputs without overviews, the parser produces a one-level pyramid with a single subgroup `"0"`. The access path `root["0/data"]` is the same regardless of whether overviews are present, so adding overviews later does not change how existing code addresses the full-resolution image.