mlsimkit.learn.manifest – MLSimKit Learn Manifest

This part of the documentation covers the manifest module used by codes to interface into custom datasets.

mlsimkit.learn.manifest.manifest.create_manifest_entry(run_folder, data_files, file_lists)

Create one manifest entry for a simulation “run” folder.

Parameters:
  • run_folder (str) – The path to the run folder.

  • data_files (list) – A list of DataFile objects representing data files.

  • file_lists (list) – A list of FileList objects representing file lists.

Returns:

A manifest entry dictionary for the run folder.

Return type:

dict

Raises:

RuntimeError – If there are issues with the data files or file lists.

mlsimkit.learn.manifest.manifest.generate_manifest_entries(run_folders, data_files, file_lists, skip_on_error)

Generate manifest entries from a list of simulation “run” folders.

Parameters:
  • run_folders (list) – A list of run folder paths.

  • data_files (list) – A list of DataFile objects representing data files.

  • file_lists (list) – A list of FileList objects representing file lists.

  • skip_on_error (bool) – If True, skip over a run folder on error and continue.

Yields:

dict – A manifest entry dictionary for each run folder.

mlsimkit.learn.manifest.manifest.get_array_list(manifest, key)
mlsimkit.learn.manifest.manifest.get_base_path(base_enum: RelativePathBase, manifest_path)
mlsimkit.learn.manifest.manifest.get_path_list(manifest, key, default_dir=None)
mlsimkit.learn.manifest.manifest.make_manifest(geometry_files)
mlsimkit.learn.manifest.manifest.make_working_manifest(manifest_path, output_dir)
mlsimkit.learn.manifest.manifest.read_manifest_file(manifest_filepath)

Read a manifest file into a Dataframe where each line is a record

mlsimkit.learn.manifest.manifest.resolve_file_path(file_path, default_dir=None, missing_ok=True)

Resolve a file path to an absolute path or URI.

mlsimkit.learn.manifest.manifest.split_manifest(manifest_file: str, settings: SplitSettings, output_dir: str | None = None) Dict[str, str | None]

Split a JSON lines manifest file into train, validation, and test sets.

Parameters:
  • manifest_file (str) – The path to the input JSON lines manifest file.

  • settings (SplitSettings) – The settings for splitting the data.

  • output_dir (str, optional) – The path to the output directory for the split files. Defaults to the same directory as the input manifest file.

Returns:

A dictionary containing the file paths for the train, validation, and test sets. If a set is empty, its value in the dictionary will be None.

Return type:

Dict[str, Optional[str]]

The function handles the following cases: - If settings.test_size is 1.0 (100%), the entire dataset is assigned to the test set, and the train and validation sets are set to None. - If settings.test_size is 0.0, the dataset is split into train and validation sets according to settings.train_size, and the test set is set to None. - If settings.test_size is between 0.0 and 1.0 (exclusive), the dataset is split into train, validation, and test sets according to the provided percentages.

The function writes the split manifest files to the specified output_dir or the same directory as the input manifest file if output_dir is not provided. The filenames of the split manifest files are derived from the input manifest file’s name with “-train”, “-valid”, and “-test” suffixes. If a split set is empty (None), the corresponding value in the returned dictionary will be None.

mlsimkit.learn.manifest.manifest.write_manifest_file(records: DataFrame, filename: str) None

Write manifest records as a JSON lines file.

Parameters:
  • records (pd.DataFrame) – A pandas DataFrame containing the records to be written.

  • filename (str) – The path to the output JSON lines file.

mlsimkit.learn.manifest.manifest.write_manifest_files(manifest_entries, manifest_file, run_folders)

Write the manifest and description files using the provided manifest entries iterator.

class mlsimkit.learn.manifest.schema.DataFile(*, name: Annotated[str, MinLen(min_length=1)], file_glob: str | None = None, file_regex: str | None = None, columns: Annotated[list, MinLen(min_length=1)], delimiter: Annotated[str, MinLen(min_length=1), MaxLen(max_length=1)] = ',')

A parameter to extract values from a .dat file

_abc_impl = <_abc._abc_data object>
classmethod check_file_glob_or_regex(values)
columns: list
delimiter: str
file_glob: str | None
file_regex: str | None
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
class mlsimkit.learn.manifest.schema.FileList(*, name: str, file_glob: str | None = None)

A parameter to list files

_abc_impl = <_abc._abc_data object>
file_glob: str | None
model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str
class mlsimkit.learn.manifest.schema.RelativePathBase(value)

An enumeration.

CWD = 'CWD'
ManifestRoot = 'ManifestRoot'
PackageRoot = 'PackageRoot'
class mlsimkit.learn.manifest.schema.SplitSettings(*, train_size: Annotated[float, Ge(ge=0), Le(le=1)] = 0.6, valid_size: Annotated[float, Ge(ge=0), Le(le=1)] = 0.2, test_size: Annotated[float, Ge(ge=0), Le(le=1)] = 0.2, random_seed: int | None = None)

Settings for splitting a dataset into train, validation, and test sets.

_abc_impl = <_abc._abc_data object>
check_total_percentage() SplitSettings

Validate that the sum of train, validation, and test percentages is 100%.

Parameters:

values (dict) – The values of the model’s fields.

Raises:

ValueError – If the sum of train, validation, and test percentages is not 1.0.

Returns:

The validated instance of the SplitSettings model.

Return type:

SplitSettings

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

random_seed: int | None
test_size: float
train_size: float
valid_size: float