s3torchconnector

Submodules

Classes

S3Reader

A read-only, file like representation of a single object stored in S3.

S3Writer

A write-only, file like representation of a single object stored in S3.

S3IterableDataset

An IterableStyle dataset created from S3 objects.

S3MapDataset

A Map-Style dataset created from S3 objects.

S3Checkpoint

A checkpoint manager for S3.

S3ClientConfig

A dataclass exposing configurable parameters for the S3 client.

Package Contents

class s3torchconnector.S3Reader(bucket: str, key: str, get_object_info: Callable[[], s3torchconnectorclient._mountpoint_s3_client.ObjectInfo | s3torchconnectorclient._mountpoint_s3_client.HeadObjectResult], get_stream: Callable[[], s3torchconnectorclient._mountpoint_s3_client.GetObjectStream])[source]

Bases: io.BufferedIOBase

A read-only, file like representation of a single object stored in S3.

property bucket
property key
prefetch() None[source]

Start fetching data from S3.

Raises:

S3Exception – An error occurred accessing S3.

readinto(buf) int[source]

Read up to len(buf) bytes into a pre-allocated, writable bytes-like object buf. Return the number of bytes read. If no bytes are available, zero is returned.

Parameters:

buf – writable bytes-like object

Returns:

numer of bytes read or zero, if no bytes available

Return type:

int

read(size: int | None = None) bytes[source]

Read up to size bytes from the object and return them.

If size is zero or positive, read that many bytes from S3, or until the end of the object. If size is None or negative, read the entire file.

Parameters:

size (int | None) – how many bytes to read.

Returns:

Bytes read from S3 Object

Return type:

bytes

Raises:

S3Exception – An error occurred accessing S3.

seek(offset: int, whence: int = SEEK_SET, /) int[source]

Change the stream position to the given byte offset, interpreted relative to whence.

When seeking beyond the end of the file, always stay at EOF. Seeking before the start of the file results in a ValueError.

Parameters:
  • offset (int) – How many bytes to seek relative to whence.

  • whence (int) – One of SEEK_SET, SEEK_CUR, and SEEK_END. Default: SEEK_SET

Returns:

Current position of the stream

Return type:

int

Raises:

S3Exception – An error occurred accessing S3.

tell() int[source]
Returns:

Current stream position.

Return type:

int

readable() bool[source]
Returns:

Return whether object was opened for reading.

Return type:

bool

writable() bool[source]
Returns:

Return whether object was opened for writing.

Return type:

bool

class s3torchconnector.S3Writer(stream: s3torchconnectorclient._mountpoint_s3_client.PutObjectStream)[source]

Bases: io.BufferedIOBase

A write-only, file like representation of a single object stored in S3.

stream
write(data: bytes | memoryview) int[source]

Write bytes to S3 Object specified by bucket and key

Parameters:

data (bytes | memoryview) – bytes to write

Returns:

Number of bytes written

Return type:

int

Raises:

S3Exception – An error occurred accessing S3.

close()[source]

Close write-stream to S3. Ensures all bytes are written successfully.

Raises:

S3Exception – An error occurred accessing S3.

flush()[source]

No-op

readable() bool[source]
Returns:

Return whether object was opened for reading.

Return type:

bool

writable() bool[source]
Returns:

Return whether object was opened for writing.

Return type:

bool

tell() int[source]
Returns:

Current stream position.

Return type:

int

class s3torchconnector.S3IterableDataset(region: str, get_dataset_objects: Callable[[s3torchconnector._s3client.S3Client], Iterable[s3torchconnector._s3bucket_key_data.S3BucketKeyData]], endpoint: str | None = None, transform: Callable[[s3torchconnector.S3Reader], Any] = identity, s3client_config: s3torchconnector._s3client.S3ClientConfig | None = None, enable_sharding: bool = False)[source]

Bases: torch.utils.data.IterableDataset

An IterableStyle dataset created from S3 objects.

To create an instance of S3IterableDataset, you need to use from_prefix or from_objects methods.

property region
property endpoint
classmethod from_objects(object_uris: str | Iterable[str], *, region: str, endpoint: str | None = None, transform: Callable[[s3torchconnector.S3Reader], Any] = identity, s3client_config: s3torchconnector._s3client.S3ClientConfig | None = None, enable_sharding: bool = False)[source]

Returns an instance of S3IterableDataset using the S3 URI(s) provided.

Parameters:
  • object_uris (str | Iterable[str]) – S3 URI of the object(s) desired.

  • region (str) – AWS region of the S3 bucket where the objects are stored.

  • endpoint (str) – AWS endpoint of the S3 bucket where the objects are stored.

  • transform – Optional callable which is used to transform an S3Reader into the desired type.

  • s3client_config – Optional S3ClientConfig with parameters for S3 client.

  • enable_sharding – If True, shard the dataset across multiple workers for parallel data loading. If False (default), each worker loads the entire dataset independently.

Returns:

An IterableStyle dataset created from S3 objects.

Return type:

S3IterableDataset

Raises:

S3Exception – An error occurred accessing S3.

classmethod from_prefix(s3_uri: str, *, region: str, endpoint: str | None = None, transform: Callable[[s3torchconnector.S3Reader], Any] = identity, s3client_config: s3torchconnector._s3client.S3ClientConfig | None = None, enable_sharding: bool = False)[source]

Returns an instance of S3IterableDataset using the S3 URI provided.

Parameters:
  • s3_uri (str) – An S3 URI (prefix) of the object(s) desired. Objects matching the prefix will be included in the returned dataset.

  • region (str) – AWS region of the S3 bucket where the objects are stored.

  • endpoint (str) – AWS endpoint of the S3 bucket where the objects are stored.

  • transform – Optional callable which is used to transform an S3Reader into the desired type.

  • s3client_config – Optional S3ClientConfig with parameters for S3 client.

  • enable_sharding – If True, shard the dataset across multiple workers for parallel data loading. If False (default), each worker loads the entire dataset independently.

Returns:

An IterableStyle dataset created from S3 objects.

Return type:

S3IterableDataset

Raises:

S3Exception – An error occurred accessing S3.

class s3torchconnector.S3MapDataset(region: str, get_dataset_objects: Callable[[s3torchconnector._s3client.S3Client], Iterable[s3torchconnector._s3bucket_key_data.S3BucketKeyData]], endpoint: str | None = None, transform: Callable[[s3torchconnector.S3Reader], Any] = identity, s3client_config: s3torchconnector._s3client.S3ClientConfig | None = None)[source]

Bases: torch.utils.data.Dataset

A Map-Style dataset created from S3 objects.

To create an instance of S3MapDataset, you need to use from_prefix or from_objects methods.

property region
property endpoint
classmethod from_objects(object_uris: str | Iterable[str], *, region: str, endpoint: str | None = None, transform: Callable[[s3torchconnector.S3Reader], Any] = identity, s3client_config: s3torchconnector._s3client.S3ClientConfig | None = None)[source]

Returns an instance of S3MapDataset using the S3 URI(s) provided.

Parameters:
  • object_uris (str | Iterable[str]) – S3 URI of the object(s) desired.

  • region (str) – AWS region of the S3 bucket where the objects are stored.

  • endpoint (str) – AWS endpoint of the S3 bucket where the objects are stored.

  • transform – Optional callable which is used to transform an S3Reader into the desired type.

  • s3client_config – Optional S3ClientConfig with parameters for S3 client.

Returns:

A Map-Style dataset created from S3 objects.

Return type:

S3MapDataset

Raises:

S3Exception – An error occurred accessing S3.

classmethod from_prefix(s3_uri: str, *, region: str, endpoint: str | None = None, transform: Callable[[s3torchconnector.S3Reader], Any] = identity, s3client_config: s3torchconnector._s3client.S3ClientConfig | None = None)[source]

Returns an instance of S3MapDataset using the S3 URI provided.

Parameters:
  • s3_uri (str) – An S3 URI (prefix) of the object(s) desired. Objects matching the prefix will be included in the returned dataset.

  • region (str) – AWS region of the S3 bucket where the objects are stored.

  • endpoint (str) – AWS endpoint of the S3 bucket where the objects are stored.

  • transform – Optional callable which is used to transform an S3Reader into the desired type.

  • s3client_config – Optional S3ClientConfig with parameters for S3 client.

Returns:

A Map-Style dataset created from S3 objects.

Return type:

S3MapDataset

Raises:

S3Exception – An error occurred accessing S3.

class s3torchconnector.S3Checkpoint(region: str, endpoint: str | None = None, s3client_config: s3torchconnector._s3client.S3ClientConfig | None = None)[source]

A checkpoint manager for S3.

To read a checkpoint from S3, users need to create an S3Reader by providing s3_uri of the checkpoint stored in S3. Similarly, to save a checkpoint to S3, users need to create an S3Writer by providing s3_uri. S3Reader and S3Writer implements io.BufferedIOBase therefore, they can be passed to torch.load, and torch.save.

region
endpoint = None
reader(s3_uri: str) s3torchconnector.S3Reader[source]

Creates an S3Reader from a given s3_uri.

Parameters:

s3_uri (str) – A valid s3_uri. (i.e. s3://<BUCKET>/<KEY>)

Returns:

a read-only binary stream of the S3 object’s contents, specified by the s3_uri.

Return type:

S3Reader

Raises:

S3Exception – An error occurred accessing S3.

writer(s3_uri: str) s3torchconnector.S3Writer[source]

Creates an S3Writer from a given s3_uri.

Parameters:

s3_uri (str) – A valid s3_uri. (i.e. s3://<BUCKET>/<KEY>)

Returns:

a write-only binary stream. The content is saved to S3 using the specified s3_uri.

Return type:

S3Writer

Raises:

S3Exception – An error occurred accessing S3.

class s3torchconnector.S3ClientConfig[source]

A dataclass exposing configurable parameters for the S3 client.

Args: throughput_target_gbps(float): Throughput target in Gigabits per second (Gbps) that we are trying to reach.

10.0 Gbps by default (may change in future).

part_size(int): Size (bytes) of file parts that will be uploaded/downloaded.

Note: for saving checkpoints, the inner client will adjust the part size to meet the service limits. (max number of parts per upload is 10,000, minimum upload part size is 5 MiB). Part size must have values between 5MiB and 5GiB. 8MiB by default (may change in future).

force_path_style(bool): forceful path style addressing for S3 client.

throughput_target_gbps: float = 10.0
part_size: int = 8388608
unsigned: bool = False
force_path_style: bool = False