mlsimkit.learn.manifest
– MLSimKit Learn Manifest¶
This part of the documentation covers the manifest module used by codes to interface into custom datasets.
- mlsimkit.learn.manifest.manifest.create_manifest_entry(run_folder, data_files, file_lists)¶
Create one manifest entry for a simulation “run” folder.
- Parameters:
run_folder (str) – The path to the run folder.
data_files (list) – A list of DataFile objects representing data files.
file_lists (list) – A list of FileList objects representing file lists.
- Returns:
A manifest entry dictionary for the run folder.
- Return type:
dict
- Raises:
RuntimeError – If there are issues with the data files or file lists.
- mlsimkit.learn.manifest.manifest.generate_manifest_entries(run_folders, data_files, file_lists, skip_on_error)¶
Generate manifest entries from a list of simulation “run” folders.
- Parameters:
run_folders (list) – A list of run folder paths.
data_files (list) – A list of DataFile objects representing data files.
file_lists (list) – A list of FileList objects representing file lists.
skip_on_error (bool) – If True, skip over a run folder on error and continue.
- Yields:
dict – A manifest entry dictionary for each run folder.
- mlsimkit.learn.manifest.manifest.get_array_list(manifest, key)¶
- mlsimkit.learn.manifest.manifest.get_base_path(base_enum: RelativePathBase, manifest_path)¶
- mlsimkit.learn.manifest.manifest.get_path_list(manifest, key, default_dir=None)¶
- mlsimkit.learn.manifest.manifest.make_manifest(geometry_files)¶
- mlsimkit.learn.manifest.manifest.make_working_manifest(manifest_path, output_dir)¶
- mlsimkit.learn.manifest.manifest.read_manifest_file(manifest_filepath)¶
Read a manifest file into a Dataframe where each line is a record
- mlsimkit.learn.manifest.manifest.resolve_file_path(file_path, default_dir=None, missing_ok=True)¶
Resolve a file path to an absolute path or URI.
- mlsimkit.learn.manifest.manifest.split_manifest(manifest_file: str, settings: SplitSettings, output_dir: str | None = None) Dict[str, str | None] ¶
Split a JSON lines manifest file into train, validation, and test sets.
- Parameters:
manifest_file (str) – The path to the input JSON lines manifest file.
settings (SplitSettings) – The settings for splitting the data.
output_dir (str, optional) – The path to the output directory for the split files. Defaults to the same directory as the input manifest file.
- Returns:
A dictionary containing the file paths for the train, validation, and test sets. If a set is empty, its value in the dictionary will be None.
- Return type:
Dict[str, Optional[str]]
The function handles the following cases: - If settings.test_size is 1.0 (100%), the entire dataset is assigned to the test set, and the train and validation sets are set to None. - If settings.test_size is 0.0, the dataset is split into train and validation sets according to settings.train_size, and the test set is set to None. - If settings.test_size is between 0.0 and 1.0 (exclusive), the dataset is split into train, validation, and test sets according to the provided percentages.
The function writes the split manifest files to the specified output_dir or the same directory as the input manifest file if output_dir is not provided. The filenames of the split manifest files are derived from the input manifest file’s name with “-train”, “-valid”, and “-test” suffixes. If a split set is empty (None), the corresponding value in the returned dictionary will be None.
- mlsimkit.learn.manifest.manifest.write_manifest_file(records: DataFrame, filename: str) None ¶
Write manifest records as a JSON lines file.
- Parameters:
records (pd.DataFrame) – A pandas DataFrame containing the records to be written.
filename (str) – The path to the output JSON lines file.
- mlsimkit.learn.manifest.manifest.write_manifest_files(manifest_entries, manifest_file, run_folders)¶
Write the manifest and description files using the provided manifest entries iterator.
- class mlsimkit.learn.manifest.schema.DataFile(*, name: Annotated[str, MinLen(min_length=1)], file_glob: str | None = None, file_regex: str | None = None, columns: Annotated[list, MinLen(min_length=1)], delimiter: Annotated[str, MinLen(min_length=1), MaxLen(max_length=1)] = ',')¶
A parameter to extract values from a .dat file
- _abc_impl = <_abc._abc_data object>¶
- classmethod check_file_glob_or_regex(values)¶
- columns: list¶
- delimiter: str¶
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- name: str¶
- class mlsimkit.learn.manifest.schema.FileList(*, name: str, file_glob: str | None = None)¶
A parameter to list files
- _abc_impl = <_abc._abc_data object>¶
- model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- name: str¶
- class mlsimkit.learn.manifest.schema.RelativePathBase(value)¶
An enumeration.
- CWD = 'CWD'¶
- ManifestRoot = 'ManifestRoot'¶
- PackageRoot = 'PackageRoot'¶
- class mlsimkit.learn.manifest.schema.SplitSettings(*, train_size: Annotated[float, Ge(ge=0), Le(le=1)] = 0.6, valid_size: Annotated[float, Ge(ge=0), Le(le=1)] = 0.2, test_size: Annotated[float, Ge(ge=0), Le(le=1)] = 0.2, random_seed: int | None = None)¶
Settings for splitting a dataset into train, validation, and test sets.
- _abc_impl = <_abc._abc_data object>¶
- check_total_percentage() SplitSettings ¶
Validate that the sum of train, validation, and test percentages is 100%.
- Parameters:
values (dict) – The values of the model’s fields.
- Raises:
ValueError – If the sum of train, validation, and test percentages is not 1.0.
- Returns:
The validated instance of the SplitSettings model.
- Return type:
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- test_size: float¶
- train_size: float¶
- valid_size: float¶