VirtualiZarr Parsers¶
VirtualiZarr parser for generating virtual Zarr datasets from imagery files.
OversightMLParser implements the VirtualiZarr Parser protocol and produces
ManifestStore objects that can be serialized to Kerchunk JSON indices. It works
for any format supported by IO.open(): NITF, standalone JPEG 2000, TIFF, and
GeoTIFF.
The parser supports both single-file and multi-file inputs:
Single file — pass a single path and URL. If the file contains overview assets (e.g. COG overview IFDs), the parser builds a hierarchical store automatically. Otherwise it produces a flat store.
Multi-file pyramid — pass a list of paths and URLs, one per resolution level. The parser builds a hierarchical store with GeoZarr
multiscalesmetadata describing the pyramid structure.
Note
virtualizarr is an optional dependency. Install with pip install osml-imagery-io[virtualizarr]
to enable parser support.
OversightMLParser¶
- class aws.osml.io.virtualizarr_parsers.OversightMLParser(local_paths)¶
Bases:
objectVirtualiZarr parser for any imagery format supported by IO.open().
Supports NITF (2.0, 2.1, NSIF 1.0, SICD, SIDD), standalone JPEG 2000 (.j2k, .jp2), TIFF, and GeoTIFF. Format detection is handled by
IO.open()— the parser itself is format-agnostic.- Parameters:
local_paths (
str|list[str]) – Path(s) to the local imagery file(s) to scan. A single string is wrapped in a list automatically. For multi-file pyramids, pass one path per resolution level.
Examples
Portable index (no URL needed at index time):
parser = OversightMLParser(local_paths="/data/image.ntf") manifest_store = parser() # refs use {{base}}filename.ntf
Absolute URL index:
parser = OversightMLParser(local_paths="/data/image.ntf") manifest_store = parser(url="s3://bucket/image.ntf")
Multi-file pyramid:
parser = OversightMLParser(local_paths=["/data/image.ntf", "/data/image.ntf.r1"]) manifest_store = parser(url=["s3://bucket/image.ntf", "s3://bucket/image.ntf.r1"])
Constructor¶
OversightMLParser(local_paths) accepts either a single path string or a list
of paths. A single string is wrapped in a list internally.
# Single file
parser = OversightMLParser(local_paths="/data/image.ntf")
# Multi-file pyramid (one file per resolution level)
parser = OversightMLParser(local_paths=["/data/image.ntf", "/data/image.ntf.r1"])
Calling the parser¶
parser(url) accepts either a single URL string or a list of URLs. A single
URL is used for all chunk references. A list must have the same length as
local_paths — each URL corresponds to the local path at the same index.
# Single URL — used for all assets
store = parser(url="s3://bucket/image.ntf")
# Multiple URLs — one per file in the pyramid
store = parser(url=["s3://bucket/image.ntf", "s3://bucket/image.ntf.r1"])
Flat vs hierarchical output¶
When the parser detects overview assets (keys matching image:N:overview:M),
it produces a hierarchical ManifestStore with one subgroup per resolution
level. Otherwise it produces a flat store with arrays at the root — identical
to the pre-multiscale behavior.
For hierarchical stores, each subgroup contains a single array named "data",
and the root group’s attributes include GeoZarr multiscales metadata and a
zarr_conventions array declaring convention identity:
ManifestGroup (root)
├── groups:
│ ├── "0" → ManifestGroup(arrays={"data": level_0_array})
│ ├── "1" → ManifestGroup(arrays={"data": level_1_array})
│ └── "2" → ManifestGroup(arrays={"data": level_2_array})
└── attributes:
├── "source": "s3://bucket/image.ntf"
├── "zarr_conventions": [{ ... }]
└── "multiscales": { ... }
multiscales metadata structure¶
The root group’s multiscales attribute conforms to the
GeoZarr multiscales convention
(UUID d35379db-88df-4056-af3a-620245f8e347). It contains:
layout — one entry per resolution level with an
assetpath matching the subgroup name, an optionalderived_fromreferencing the parent level, and atransformobject with relativescaleandtranslationarraysresampling_method — optional; recorded when a
downsampling_methodkeyword argument is provided to the parser
Scale transforms use relative factors between adjacent levels (not absolute from
level 0). The scale and translation arrays have two elements: [Y, X].
A zarr_conventions array in the root attributes declares convention identity:
{
"source": "s3://bucket/image.tif",
"zarr_conventions": [
{
"uuid": "d35379db-88df-4056-af3a-620245f8e347",
"schema_url": "https://raw.githubusercontent.com/zarr-conventions/multiscales/refs/tags/v1/schema.json",
"spec_url": "https://github.com/zarr-conventions/multiscales/blob/v1/README.md",
"name": "multiscales",
"description": "Multiscale layout of zarr datasets"
}
],
"multiscales": {
"layout": [
{
"asset": "0",
"transform": {"scale": [1.0, 1.0], "translation": [0.0, 0.0]}
},
{
"asset": "1",
"derived_from": "0",
"transform": {"scale": [2.0, 2.0], "translation": [0.0, 0.0]}
}
],
"resampling_method": "average"
}
}
write_tile_index¶
- aws.osml.io.virtualizarr_parsers.write_tile_index(store, output, segments=None)¶
Write a tile index to JSON or Parquet with multi-range support.
This is the recommended way to serialize a
ManifestStoreproduced byOversightMLParser. It handles the multi-range reference entries that VirtualiZarr’s built-in serialization does not support.When the store was created with
url=None(portable mode), the serialized output includes a Kerchunk v1"templates"dict with{"base": ""}so that{{base}}placeholders in chunk reference URLs can be resolved at read time viatemplate_overrides.- Parameters:
store (ManifestStore) – The manifest store returned by
OversightMLParser().output (
str) – Output file path. Extension determines format:.jsonfor Kerchunk JSON,.parquetfor Kerchunk Parquet.segments (
list[str] |None) – Subgroup keys to include (e.g.["0", "2"]). IfNone, all subgroups are included.
- Raises:
ValueError – If the output extension is not
.jsonor.parquet, or if a requested segment is not found.
Examples
- Return type:
Portable index (resolve URL at read time):
parser = OversightMLParser(local_paths="local/image.ntf") store = parser() # no url — portable mode write_tile_index(store, "image.tile_index.json")
Absolute URL index:
parser = OversightMLParser(local_paths="local/image.ntf") store = parser(url="s3://my-bucket/imagery/image.ntf") write_tile_index(store, "image.tile_index.json")
write_tile_index() automatically detects whether the store is flat or
hierarchical and serializes accordingly. For hierarchical stores, the output
Kerchunk JSON uses path-prefixed keys (e.g. 0/data/0.0.0, 1/data/0.0.0)
and includes the root multiscales metadata in .zattrs.