Skip to content

Experiments

experiments

Higher-level experiments (generally combining multiple Runs)

This module provides utilities to run more complex "experiments" that go beyond the scope of a single Run.

LatencyHeatmap dataclass

LatencyHeatmap(endpoint, source_file, clients=4, output_path=None, input_lengths=(lambda: [10, 50, 200, 500])(), output_lengths=(lambda: [128, 256, 512, 1024])(), requests_per_combination=1, create_payload_fn=None, create_payload_kwargs=dict(), tokenizer=None)

Experiment to measure how latency varies by input and output token count

This experiment uses a source text file to generate input prompts/payloads of different lengths, and measures how response time varies with both the input lengths and output/response lengths.

Attributes:

Name Type Description
endpoint Endpoint

The LLM endpoint to test.

source_file UPath | str

The source file from which prompts of different lengths will be sampled (see llmeter.prompt_utils.CreatePromptCollection for details).

clients int

The number of concurrent clients (requests) to use for the experiment. Note that using a high number of concurrent clients could impact observed latency.

output_path UPath | str | None

The (local or Cloud e.g. s3://...) path to save the results.

input_lengths Sequence[int]

The approximate input/prompt lengths to test. Since the locally-available tokenizer will often differ from the endpoint's own token counting, it's typically not possible to generate prompts with the exact specified token counts.

output_lengths Sequence[int]

The target output lengths to test. Since generation may stop early for certain prompts, and some endpoints may not report exact token counts in their responses, the results may not correspond exactly to these targets.

requests_per_combination int

The number of requests to make for each combination of input and output lengths.

create_payload_fn Callable | None

A function to create the actual endpoint payload for each invocation, from the sampled text prompt. Typically, you'll want to specify a prefix for your prompt in either this or the create_payload_kwargs. If not set, the endpoint's default create_payload method will be used.

create_payload_kwargs Dict

Keyword arguments to pass to the create_payload_fn.

tokenizer Tokenizer | None

A tokenizer to be used for sampling prompts of the specified lengths, and also estimating the generated output lengths if necessary for your endpoint. If not set, the llmeter.tokenizers.DummyTokenizer will be used.

LoadTest dataclass

LoadTest(endpoint, payload, sequence_of_clients, min_requests_per_client=1, min_requests_per_run=10, output_path=None, tokenizer=None, test_name=None, callbacks=None)

Experiment to explore how performance changes at different concurrency levels

This experiment creates a series of Runs with different levels of concurrency, defined by sequence_of_clients, and runs them one after the other.

LoadTestResult dataclass

LoadTestResult(results, test_name, output_path=None)

load classmethod

load(load_path, test_name=None, load_responses=True)

Load test results from a directory.

Parameters:

Name Type Description Default
load_path UPath | str | None

Directory path containing the load test results subdirectories

required
test_name str | None

Optional name for the test. If not provided, will use the directory name

None
load_responses bool

Whether to load individual invocation responses. Defaults to True. When False, only summaries and pre-computed stats are loaded.

True

Returns:

Name Type Description
LoadTestResult LoadTestResult

A LoadTestResult object containing the loaded results

Raises:

Type Description
FileNotFoundError

If load_path does not exist or is None/empty

ValueError

If no results are found in the directory

Source code in llmeter/experiments.py
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
@classmethod
def load(
    cls,
    load_path: Path | str | None,
    test_name: str | None = None,
    load_responses: bool = True,
) -> "LoadTestResult":
    """Load test results from a directory.

    Args:
        load_path: Directory path containing the load test results subdirectories
        test_name: Optional name for the test. If not provided, will use the directory name
        load_responses: Whether to load individual invocation responses. Defaults to True.
            When False, only summaries and pre-computed stats are loaded.

    Returns:
        LoadTestResult: A LoadTestResult object containing the loaded results

    Raises:
        FileNotFoundError: If load_path does not exist or is None/empty
        ValueError: If no results are found in the directory
    """
    if not load_path:
        raise FileNotFoundError("Load path cannot be None or empty")

    if isinstance(load_path, str):
        load_path = Path(load_path)

    if not load_path.exists():
        raise FileNotFoundError(f"Load path {load_path} does not exist")

    results = [
        Result.load(x, load_responses=load_responses)
        for x in load_path.iterdir()
        if x.is_dir()
    ]

    if not results:
        raise ValueError(f"No results found in {load_path}")

    return LoadTestResult(
        results={r.clients: r for r in results},
        test_name=test_name or load_path.name,
        output_path=load_path.parent,
    )