s3torchconnector

Submodules

Classes

S3Reader

An abstract base class for read-only, file-like representation of a single object stored in S3.

S3ReaderConstructor

Constructor for creating partial(S3Reader) instances.

S3Writer

A write-only, file like representation of a single object stored in S3.

S3IterableDataset

An IterableStyle dataset created from S3 objects.

S3MapDataset

A Map-Style dataset created from S3 objects.

Package Contents

class s3torchconnector.S3Reader[source]

Bases: abc.ABC, io.BufferedIOBase

An abstract base class for read-only, file-like representation of a single object stored in S3.

This class defines the interface for S3 readers. Concrete implementations (SequentialS3Reader or RangedS3Reader extend this class. S3ReaderConstructor creates partial functions of these implementations, which are then completed by S3Client with the remaining required parameters.

property bucket: str
Abstractmethod:

property key: str
Abstractmethod:

abstract read(size: int | None = None) bytes[source]

Read and return up to n bytes.

If the argument is omitted, None, or negative, reads and returns all data until EOF.

If the argument is positive, and the underlying raw stream is not ‘interactive’, multiple raw reads may be issued to satisfy the byte count (unless EOF is reached first). But for interactive raw streams (as well as sockets and pipes), at most one raw read will be issued, and a short result does not imply that EOF is imminent.

Returns an empty bytes object on EOF.

Returns None if the underlying raw stream was open in non-blocking mode and no data is available at the moment.

abstract seek(offset: int, whence: int = SEEK_SET, /) int[source]

Change stream position.

Change the stream position to the given byte offset. The offset is interpreted relative to the position indicated by whence. Values for whence are:

  • 0 – start of stream (the default); offset should be zero or positive

  • 1 – current stream position; offset may be negative

  • 2 – end of stream; offset is usually negative

Return the new absolute position.

abstract tell() int[source]

Return current stream position.

abstract readinto(buf) int[source]
readable() bool[source]
Returns:

Return whether object was opened for reading.

Return type:

bool

writable() bool[source]
Returns:

Return whether object was opened for writing.

Return type:

bool

class s3torchconnector.S3ReaderConstructor[source]

Constructor for creating partial(S3Reader) instances.

Creates partial S3Reader instances that will be completed by S3Client with the remaining required parameters (e.g. bucket, key, get_object_info, get_stream).

The constructor provides factory methods for different reader types:

  • sequential(): Creates a constructor for sequential readers that buffer the entire object. Best for full reads and repeated access.

  • range_based(): Creates a constructor for range-based readers that fetch specific byte ranges. Suitable for sparse partial reads for large objects.

static sequential() s3torchconnector.s3reader.protocol.S3ReaderConstructorProtocol[source]

Creates a constructor for sequential readers

Returns:

Partial constructor for SequentialS3Reader

Return type:

S3ReaderConstructorProtocol

Example:

reader_constructor = S3ReaderConstructor.sequential()
static range_based(buffer_size: int | None = None) s3torchconnector.s3reader.protocol.S3ReaderConstructorProtocol[source]

Creates a constructor for range-based readers

Parameters:

buffer_size – Internal buffer size in bytes. If None, uses default 8MB. Set to 0 to disable buffering.

Returns:

Partial constructor for RangedS3Reader

Return type:

S3ReaderConstructorProtocol

Range-based reader performs byte-range requests to read specific portions of S3 objects without downloading the entire file.

Buffer size affects read performance:

  • Small reads (< buffer_size): Loads buffer_size bytes to buffer to reduce S3 API calls for small, sequential reads

  • Large reads (≥ buffer_size): bypass the buffer for direct transfer from S3

  • Forward overlap reads: Reuses buffered data when reading ranges that extend beyond current buffer, and processes remaining

data according to size with logic above.

Configuration Guide:

  • Use larger buffer sizes for workloads with many small, sequential reads of nearby bytes

  • Use smaller buffer sizes or disable buffering for sparse partial reads

  • Buffer can be disabled by setting buffer_size to 0

  • If buffer_size is None, uses default 8MB buffer

Examples:

# Range-based reader with default 8MB buffer
reader_constructor = S3ReaderConstructor.range_based()

# Range-based reader with custom buffer size
reader_constructor = S3ReaderConstructor.range_based(buffer_size=16*1024*1024)

# Range-based reader with buffering disabled
reader_constructor = S3ReaderConstructor.range_based(buffer_size=0)
static default() s3torchconnector.s3reader.protocol.S3ReaderConstructorProtocol[source]

Creates default reader constructor (sequential)

Returns:

Partial constructor for SequentialS3Reader

Return type:

S3ReaderConstructorProtocol

static get_reader_type_string(constructor: s3torchconnector.s3reader.protocol.S3ReaderConstructorProtocol | None) str[source]

Returns the reader type string for the given constructor.

class s3torchconnector.S3Writer(stream: s3torchconnectorclient._mountpoint_s3_client.PutObjectStream)[source]

Bases: io.BufferedIOBase

A write-only, file like representation of a single object stored in S3.

stream
write(data: bytes | memoryview) int[source]

Write bytes to S3 Object specified by bucket and key

Parameters:

data (bytes | memoryview) – bytes to write

Returns:

Number of bytes written

Return type:

int

Raises:

S3Exception – An error occurred accessing S3.

close()[source]

Close write-stream to S3. Ensures all bytes are written successfully.

Raises:

S3Exception – An error occurred accessing S3.

flush()[source]

No-op

readable() bool[source]
Returns:

Return whether object was opened for reading.

Return type:

bool

writable() bool[source]
Returns:

Return whether object was opened for writing.

Return type:

bool

tell() int[source]
Returns:

Current stream position.

Return type:

int

class s3torchconnector.S3IterableDataset(region: str, get_dataset_objects: Callable[[s3torchconnector._s3client.S3Client], Iterable[s3torchconnector._s3bucket_key_data.S3BucketKeyData]], endpoint: str | None = None, transform: Callable[[s3torchconnector.S3Reader], Any] = identity, s3client_config: s3torchconnector._s3client.S3ClientConfig | None = None, enable_sharding: bool = False, reader_constructor: s3torchconnector.s3reader.S3ReaderConstructorProtocol | None = None)[source]

Bases: torch.utils.data.IterableDataset

An IterableStyle dataset created from S3 objects.

To create an instance of S3IterableDataset, you need to use from_prefix or from_objects methods.

property region
property endpoint
classmethod from_objects(object_uris: str | Iterable[str], *, region: str, endpoint: str | None = None, transform: Callable[[s3torchconnector.S3Reader], Any] = identity, s3client_config: s3torchconnector._s3client.S3ClientConfig | None = None, enable_sharding: bool = False, reader_constructor: s3torchconnector.s3reader.S3ReaderConstructorProtocol | None = None)[source]

Returns an instance of S3IterableDataset using the S3 URI(s) provided.

Parameters:
  • object_uris (str | Iterable[str]) – S3 URI of the object(s) desired.

  • region (str) – AWS region of the S3 bucket where the objects are stored.

  • endpoint (str) – AWS endpoint of the S3 bucket where the objects are stored.

  • transform – Optional callable which is used to transform an S3Reader into the desired type.

  • s3client_config – Optional S3ClientConfig with parameters for S3 client.

  • enable_sharding – If True, shard the dataset across multiple workers for parallel data loading. If False (default), each worker loads the entire dataset independently.

  • reader_constructor (Optional[S3ReaderConstructorProtocol]) – Optional partial(S3Reader) created using S3ReaderConstructor e.g. S3ReaderConstructor.sequential() or S3ReaderConstructor.range_based()

Returns:

An IterableStyle dataset created from S3 objects.

Return type:

S3IterableDataset

Raises:

S3Exception – An error occurred accessing S3.

classmethod from_prefix(s3_uri: str, *, region: str, endpoint: str | None = None, transform: Callable[[s3torchconnector.S3Reader], Any] = identity, s3client_config: s3torchconnector._s3client.S3ClientConfig | None = None, enable_sharding: bool = False, reader_constructor: s3torchconnector.s3reader.S3ReaderConstructorProtocol | None = None)[source]

Returns an instance of S3IterableDataset using the S3 URI provided.

Parameters:
  • s3_uri (str) – An S3 URI (prefix) of the object(s) desired. Objects matching the prefix will be included in the returned dataset.

  • region (str) – AWS region of the S3 bucket where the objects are stored.

  • endpoint (str) – AWS endpoint of the S3 bucket where the objects are stored.

  • transform – Optional callable which is used to transform an S3Reader into the desired type.

  • s3client_config – Optional S3ClientConfig with parameters for S3 client.

  • enable_sharding – If True, shard the dataset across multiple workers for parallel data loading. If False (default), each worker loads the entire dataset independently.

  • reader_constructor (Optional[S3ReaderConstructorProtocol]) – Optional partial(S3Reader) created using S3ReaderConstructor e.g. S3ReaderConstructor.sequential() or S3ReaderConstructor.range_based()

Returns:

An IterableStyle dataset created from S3 objects.

Return type:

S3IterableDataset

Raises:

S3Exception – An error occurred accessing S3.

class s3torchconnector.S3MapDataset(region: str, get_dataset_objects: Callable[[s3torchconnector._s3client.S3Client], Iterable[s3torchconnector._s3bucket_key_data.S3BucketKeyData]], endpoint: str | None = None, transform: Callable[[s3torchconnector.S3Reader], Any] = identity, s3client_config: s3torchconnector._s3client.S3ClientConfig | None = None, reader_constructor: s3torchconnector.s3reader.S3ReaderConstructorProtocol | None = None)[source]

Bases: torch.utils.data.Dataset

A Map-Style dataset created from S3 objects.

To create an instance of S3MapDataset, you need to use from_prefix or from_objects methods.

property region
property endpoint
classmethod from_objects(object_uris: str | Iterable[str], *, region: str, endpoint: str | None = None, transform: Callable[[s3torchconnector.S3Reader], Any] = identity, s3client_config: s3torchconnector._s3client.S3ClientConfig | None = None, reader_constructor: s3torchconnector.s3reader.S3ReaderConstructorProtocol | None = None)[source]

Returns an instance of S3MapDataset using the S3 URI(s) provided.

Parameters:
  • object_uris (str | Iterable[str]) – S3 URI of the object(s) desired.

  • region (str) – AWS region of the S3 bucket where the objects are stored.

  • endpoint (str) – AWS endpoint of the S3 bucket where the objects are stored.

  • transform – Optional callable which is used to transform an S3Reader into the desired type.

  • s3client_config – Optional S3ClientConfig with parameters for S3 client.

  • reader_constructor (Optional[S3ReaderConstructorProtocol]) – Optional partial(S3Reader) created using S3ReaderConstructor e.g. S3ReaderConstructor.sequential() or S3ReaderConstructor.range_based()

Returns:

A Map-Style dataset created from S3 objects.

Return type:

S3MapDataset

Raises:

S3Exception – An error occurred accessing S3.

classmethod from_prefix(s3_uri: str, *, region: str, endpoint: str | None = None, transform: Callable[[s3torchconnector.S3Reader], Any] = identity, s3client_config: s3torchconnector._s3client.S3ClientConfig | None = None, reader_constructor: s3torchconnector.s3reader.S3ReaderConstructorProtocol | None = None)[source]

Returns an instance of S3MapDataset using the S3 URI provided.

Parameters:
  • s3_uri (str) – An S3 URI (prefix) of the object(s) desired. Objects matching the prefix will be included in the returned dataset.

  • region (str) – AWS region of the S3 bucket where the objects are stored.

  • endpoint (str) – AWS endpoint of the S3 bucket where the objects are stored.

  • transform – Optional callable which is used to transform an S3Reader into the desired type.

  • s3client_config – Optional S3ClientConfig with parameters for S3 client.

  • reader_constructor (Optional[S3ReaderConstructorProtocol]) – Optional partial(S3Reader) created using S3ReaderConstructor e.g. S3ReaderConstructor.sequential() or S3ReaderConstructor.range_based()

Returns:

A Map-Style dataset created from S3 objects.

Return type:

S3MapDataset

Raises:

S3Exception – An error occurred accessing S3.