Experiments

experiments

Higher-level experiments (generally combining multiple Runs)

This module provides utilities to run more complex "experiments" that go beyond the scope of a single Run.

LatencyHeatmap `dataclass`

LatencyHeatmap(endpoint, source_file, clients=4, output_path=None, input_lengths=(lambda: [10, 50, 200, 500])(), output_lengths=(lambda: [128, 256, 512, 1024])(), requests_per_combination=1, create_payload_fn=None, create_payload_kwargs=dict(), tokenizer=None)

Experiment to measure how latency varies by input and output token count

This experiment uses a source text file to generate input prompts/payloads of different lengths, and measures how response time varies with both the input lengths and output/response lengths.

Attributes:

Name	Type	Description
`endpoint`	`Endpoint`	The LLM endpoint to test.
`source_file`	`UPath \| str`	The source file from which prompts of different lengths will be sampled (see `llmeter.prompt_utils.CreatePromptCollection` for details).
`clients`	`int`	The number of concurrent clients (requests) to use for the experiment. Note that using a high number of concurrent clients could impact observed latency.
`output_path`	`UPath \| str \| None`	The (local or Cloud e.g. `s3://...`) path to save the results.
`input_lengths`	`Sequence[int]`	The approximate input/prompt lengths to test. Since the locally-available `tokenizer` will often differ from the endpoint's own token counting, it's typically not possible to generate prompts with the exact specified token counts.
`output_lengths`	`Sequence[int]`	The target output lengths to test. Since generation may stop early for certain prompts, and some endpoints may not report exact token counts in their responses, the results may not correspond exactly to these targets.
`requests_per_combination`	`int`	The number of requests to make for each combination of input and output lengths.
`create_payload_fn`	`Callable \| None`	A function to create the actual endpoint payload for each invocation, from the sampled text prompt. Typically, you'll want to specify a prefix for your prompt in either this or the `create_payload_kwargs`. If not set, the endpoint's default `create_payload` method will be used.
`create_payload_kwargs`	`Dict`	Keyword arguments to pass to the `create_payload_fn`.
`tokenizer`	`Tokenizer \| None`	A tokenizer to be used for sampling prompts of the specified lengths, and also estimating the generated output lengths if necessary for your endpoint. If not set, the `llmeter.tokenizers.DummyTokenizer` will be used.

LoadTest `dataclass`

LoadTest(endpoint, payload, sequence_of_clients, min_requests_per_client=1, min_requests_per_run=10, output_path=None, tokenizer=None, test_name=None, callbacks=None)

Experiment to explore how performance changes at different concurrency levels

This experiment creates a series of Runs with different levels of concurrency, defined by sequence_of_clients, and runs them one after the other.

LoadTestResult `dataclass`

LoadTestResult(results, test_name, output_path=None)

load `classmethod`

load(load_path, test_name=None, load_responses=True)

Load test results from a directory.

Parameters:

Name	Type	Description	Default
`load_path`	`UPath \| str \| None`	Directory path containing the load test results subdirectories	required
`test_name`	`str \| None`	Optional name for the test. If not provided, will use the directory name	`None`
`load_responses`	`bool`	Whether to load individual invocation responses. Defaults to True. When False, only summaries and pre-computed stats are loaded.	`True`

Returns:

Name	Type	Description
`LoadTestResult`	`LoadTestResult`	A LoadTestResult object containing the loaded results

Raises:

Type	Description
`FileNotFoundError`	If load_path does not exist or is None/empty
`ValueError`	If no results are found in the directory

Source code in llmeter/experiments.py

@classmethod
def load(
    cls,
    load_path: Path | str | None,
    test_name: str | None = None,
    load_responses: bool = True,
) -> "LoadTestResult":
    """Load test results from a directory.

    Args:
        load_path: Directory path containing the load test results subdirectories
        test_name: Optional name for the test. If not provided, will use the directory name
        load_responses: Whether to load individual invocation responses. Defaults to True.
            When False, only summaries and pre-computed stats are loaded.

    Returns:
        LoadTestResult: A LoadTestResult object containing the loaded results

    Raises:
        FileNotFoundError: If load_path does not exist or is None/empty
        ValueError: If no results are found in the directory
    """
    if not load_path:
        raise FileNotFoundError("Load path cannot be None or empty")

    if isinstance(load_path, str):
        load_path = Path(load_path)

    if not load_path.exists():
        raise FileNotFoundError(f"Load path {load_path} does not exist")

    results = [
        Result.load(x, load_responses=load_responses)
        for x in load_path.iterdir()
        if x.is_dir()
    ]

    if not results:
        raise ValueError(f"No results found in {load_path}")

    return LoadTestResult(
        results={r.clients: r for r in results},
        test_name=test_name or load_path.name,
        output_path=load_path.parent,
    )

Experiments

experiments

LatencyHeatmap dataclass

LoadTest dataclass

LoadTestResult dataclass

load classmethod

LatencyHeatmap `dataclass`

LoadTest `dataclass`

LoadTestResult `dataclass`

load `classmethod`