Skip to content

Experiments

experiments

Higher-level experiments (generally combining multiple Runs)

This module provides utilities to run more complex "experiments" that go beyond the scope of a single Run.

LatencyHeatmap dataclass

LatencyHeatmap(endpoint, source_file, clients=4, output_path=None, input_lengths=(lambda: [10, 50, 200, 500])(), output_lengths=(lambda: [128, 256, 512, 1024])(), requests_per_combination=1, create_payload_fn=None, create_payload_kwargs=dict(), tokenizer=None)

Experiment to measure how latency varies by input and output token count

This experiment uses a source text file to generate input prompts/payloads of different lengths, and measures how response time varies with both the input lengths and output/response lengths.

Attributes:

Name Type Description
endpoint Endpoint

The LLM endpoint to test.

source_file UPath | str

The source file from which prompts of different lengths will be sampled (see llmeter.prompt_utils.CreatePromptCollection for details).

clients int

The number of concurrent clients (requests) to use for the experiment. Note that using a high number of concurrent clients could impact observed latency.

output_path UPath | str | None

The (local or Cloud e.g. s3://...) path to save the results.

input_lengths Sequence[int]

The approximate input/prompt lengths to test. Since the locally-available tokenizer will often differ from the endpoint's own token counting, it's typically not possible to generate prompts with the exact specified token counts.

output_lengths Sequence[int]

The target output lengths to test. Since generation may stop early for certain prompts, and some endpoints may not report exact token counts in their responses, the results may not correspond exactly to these targets.

requests_per_combination int

The number of requests to make for each combination of input and output lengths.

create_payload_fn Callable | None

A function to create the actual endpoint payload for each invocation, from the sampled text prompt. Typically, you'll want to specify a prefix for your prompt in either this or the create_payload_kwargs. If not set, the endpoint's default create_payload method will be used.

create_payload_kwargs Dict

Keyword arguments to pass to the create_payload_fn.

tokenizer Tokenizer | None

A tokenizer to be used for sampling prompts of the specified lengths, and also estimating the generated output lengths if necessary for your endpoint. If not set, the llmeter.tokenizers.DummyTokenizer will be used.

LoadTest dataclass

LoadTest(endpoint, payload, sequence_of_clients, min_requests_per_client=1, min_requests_per_run=10, run_duration=None, low_memory=False, progress_bar_stats=None, output_path=None, tokenizer=None, test_name=None, callbacks=None)

Experiment to explore how performance changes at different concurrency levels.

This experiment creates a series of Runs with different levels of concurrency, defined by sequence_of_clients, and runs them one after the other.

By default, each run sends a fixed number of requests (count-bound). Set run_duration to run each concurrency level for a fixed number of seconds instead (time-bound), which gives a more realistic picture of sustained throughput.

Attributes:

Name Type Description
endpoint Endpoint

The LLM endpoint to test.

payload dict | list[dict]

The request payload(s) to send.

sequence_of_clients list[int]

Concurrency levels to test.

min_requests_per_client int

Minimum requests per client in count-bound mode.

min_requests_per_run int

Minimum total requests per run in count-bound mode.

run_duration int | float | None

When set, each concurrency level runs for this many seconds instead of a fixed request count. Mutually exclusive with min_requests_per_client / min_requests_per_run.

low_memory bool

When True, responses are written to disk but not kept in memory. Requires output_path. Defaults to False.

progress_bar_stats dict | None

Controls which live stats appear on the progress bar. See DEFAULT_DISPLAY_STATS in llmeter.live_display for the default.

output_path PathLike | str | None

Where to save results.

tokenizer Tokenizer | None

Optional tokenizer for token counting.

test_name str | None

Name for this test. Defaults to current date/time.

callbacks list[Callback] | None

Optional callbacks.

Example::

# Count-bound: 10 requests per client at each concurrency level
load_test = LoadTest(
    endpoint=my_endpoint,
    payload=sample_payload,
    sequence_of_clients=[1, 5, 10, 20],
    min_requests_per_client=10,
    output_path="outputs/load_test",
)
result = await load_test.run()
result.plot_results()

# Time-bound: 60 seconds per concurrency level
load_test = LoadTest(
    endpoint=my_endpoint,
    payload=sample_payload,
    sequence_of_clients=[1, 5, 10, 20],
    run_duration=60,
    output_path="outputs/load_test",
)
result = await load_test.run()

# Time-bound with low-memory mode for large-scale tests
load_test = LoadTest(
    endpoint=my_endpoint,
    payload=sample_payload,
    sequence_of_clients=[1, 5, 10, 20, 50],
    run_duration=120,
    low_memory=True,
    output_path="outputs/large_load_test",
)
result = await load_test.run()

run async

run(output_path=None)

Run the load test across all configured concurrency levels.

Creates a :class:~llmeter.runner.Runner and iterates through sequence_of_clients, running one test per concurrency level. In time-bound mode (run_duration is set), each level runs for a fixed duration. In count-bound mode, each level sends a fixed number of requests per client.

Parameters:

Name Type Description Default
load_path

Optional (local or remote) folder to save results. If provided, individual

required
Default

self.output_path if set, else no files will be saved.

required

Returns:

Name Type Description
LoadTestResult

A result object containing one

class:~llmeter.results.Result per concurrency level, keyed by

client count.

Example::

load_test = LoadTest(
    endpoint=my_endpoint,
    payload=sample_payload,
    sequence_of_clients=[1, 5, 10],
    run_duration=30,
)
result = await load_test.run(output_path="outputs/my_test")

# Access individual results by client count
result.results[5].stats["requests_per_minute"]

# Plot all standard charts
result.plot_results()
Source code in llmeter/experiments.py
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
async def run(self, output_path: WritablePathLike | None = None):
    """Run the load test across all configured concurrency levels.

    Creates a :class:`~llmeter.runner.Runner` and iterates through
    ``sequence_of_clients``, running one test per concurrency level. In
    time-bound mode (``run_duration`` is set), each level runs for a fixed
    duration. In count-bound mode, each level sends a fixed number of
    requests per client.

    Args:
        load_path: Optional (local or remote) folder to save results. If provided, individual
        Run results will be written to `{output_path}/{test_name}/{NNNNN-clients}` subfolders.
        Default: `self.output_path` if set, else no files will be saved.

    Returns:
        LoadTestResult: A result object containing one
        :class:`~llmeter.results.Result` per concurrency level, keyed by
        client count.

    Example::

        load_test = LoadTest(
            endpoint=my_endpoint,
            payload=sample_payload,
            sequence_of_clients=[1, 5, 10],
            run_duration=30,
        )
        result = await load_test.run(output_path="outputs/my_test")

        # Access individual results by client count
        result.results[5].stats["requests_per_minute"]

        # Plot all standard charts
        result.plot_results()
    """
    output_path = ensure_path(output_path or self.output_path)
    if output_path:
        test_output_path = output_path / self._test_name
    else:
        test_output_path = None
    _runner = Runner(
        endpoint=self.endpoint,
        tokenizer=self.tokenizer,
        output_path=test_output_path,
    )

    self._results = []
    for c in tqdm(
        self.sequence_of_clients, desc="Configurations", disable=_disable_tqdm
    ):
        if self.run_duration is not None:
            result = await _runner.run(
                payload=self.payload,
                clients=c,
                run_duration=self.run_duration,
                run_name=f"{c:05.0f}-clients",
                callbacks=self.callbacks,
                low_memory=self.low_memory,
                progress_bar_stats=self.progress_bar_stats,
                output_path=test_output_path,
            )
        else:
            result = await _runner.run(
                payload=self.payload,
                clients=c,
                n_requests=self._get_n_requests(c),
                run_name=f"{c:05.0f}-clients",
                callbacks=self.callbacks,
                low_memory=self.low_memory,
                progress_bar_stats=self.progress_bar_stats,
                output_path=test_output_path,
            )
        self._results.append(result)

    return LoadTestResult(
        results={r.clients: r for r in self._results},
        test_name=self._test_name,
        output_path=test_output_path,
    )

LoadTestResult dataclass

LoadTestResult(results, test_name, output_path=None)

load classmethod

load(load_path, test_name=None, load_responses=True)

Load test results from a directory.

Parameters:

Name Type Description Default
load_path ReadablePathLike | None

Directory path containing the load test results subdirectories

required
test_name str | None

Optional name for the test. If not provided, will use the directory name

None
load_responses bool

Whether to load individual invocation responses. Defaults to True. When False, only summaries and pre-computed stats are loaded.

True

Returns:

Name Type Description
LoadTestResult LoadTestResult

A LoadTestResult object containing the loaded results

Raises:

Type Description
FileNotFoundError

If load_path does not exist or is None/empty

ValueError

If no results are found in the directory

Source code in llmeter/experiments.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
@classmethod
def load(
    cls,
    load_path: ReadablePathLike | None,
    test_name: str | None = None,
    load_responses: bool = True,
) -> "LoadTestResult":
    """Load test results from a directory.

    Args:
        load_path: Directory path containing the load test results subdirectories
        test_name: Optional name for the test. If not provided, will use the directory name
        load_responses: Whether to load individual invocation responses. Defaults to True.
            When False, only summaries and pre-computed stats are loaded.

    Returns:
        LoadTestResult: A LoadTestResult object containing the loaded results

    Raises:
        FileNotFoundError: If load_path does not exist or is None/empty
        ValueError: If no results are found in the directory
    """
    if not load_path:
        raise FileNotFoundError("Load path cannot be None or empty")

    if not isinstance(load_path, Path):
        load_path = ensure_path(load_path)

    if not load_path.exists():
        raise FileNotFoundError(f"Load path {load_path} does not exist")

    results = [
        Result.load(x, load_responses=load_responses)
        for x in load_path.iterdir()
        if x.is_dir()
    ]

    if not results:
        raise ValueError(f"No results found in {load_path}")

    return LoadTestResult(
        results={r.clients: r for r in results},
        test_name=test_name or load_path.name,
        output_path=load_path.parent,
    )