Experiments
experiments
Higher-level experiments (generally combining multiple Runs)
This module provides utilities to run more complex "experiments" that go beyond the scope of a single Run.
LatencyHeatmap
dataclass
LatencyHeatmap(endpoint, source_file, clients=4, output_path=None, input_lengths=(lambda: [10, 50, 200, 500])(), output_lengths=(lambda: [128, 256, 512, 1024])(), requests_per_combination=1, create_payload_fn=None, create_payload_kwargs=dict(), tokenizer=None)
Experiment to measure how latency varies by input and output token count
This experiment uses a source text file to generate input prompts/payloads of different lengths, and measures how response time varies with both the input lengths and output/response lengths.
Attributes:
| Name | Type | Description |
|---|---|---|
endpoint |
Endpoint
|
The LLM endpoint to test. |
source_file |
UPath | str
|
The source file from which prompts of different lengths will be
sampled (see |
clients |
int
|
The number of concurrent clients (requests) to use for the experiment. Note that using a high number of concurrent clients could impact observed latency. |
output_path |
UPath | str | None
|
The (local or Cloud e.g. |
input_lengths |
Sequence[int]
|
The approximate input/prompt lengths to test. Since the
locally-available |
output_lengths |
Sequence[int]
|
The target output lengths to test. Since generation may stop early for certain prompts, and some endpoints may not report exact token counts in their responses, the results may not correspond exactly to these targets. |
requests_per_combination |
int
|
The number of requests to make for each combination of input and output lengths. |
create_payload_fn |
Callable | None
|
A function to create the actual endpoint payload for
each invocation, from the sampled text prompt. Typically, you'll want to specify a
prefix for your prompt in either this or the |
create_payload_kwargs |
Dict
|
Keyword arguments to pass to the |
tokenizer |
Tokenizer | None
|
A tokenizer to be used for sampling prompts of the specified
lengths, and also estimating the generated output lengths if necessary for your
endpoint. If not set, the |
LoadTest
dataclass
LoadTest(endpoint, payload, sequence_of_clients, min_requests_per_client=1, min_requests_per_run=10, run_duration=None, low_memory=False, progress_bar_stats=None, output_path=None, tokenizer=None, test_name=None, callbacks=None)
Experiment to explore how performance changes at different concurrency levels.
This experiment creates a series of Runs with different levels of concurrency, defined by
sequence_of_clients, and runs them one after the other.
By default, each run sends a fixed number of requests (count-bound). Set run_duration
to run each concurrency level for a fixed number of seconds instead (time-bound), which
gives a more realistic picture of sustained throughput.
Attributes:
| Name | Type | Description |
|---|---|---|
endpoint |
Endpoint
|
The LLM endpoint to test. |
payload |
dict | list[dict]
|
The request payload(s) to send. |
sequence_of_clients |
list[int]
|
Concurrency levels to test. |
min_requests_per_client |
int
|
Minimum requests per client in count-bound mode. |
min_requests_per_run |
int
|
Minimum total requests per run in count-bound mode. |
run_duration |
int | float | None
|
When set, each concurrency level runs for this
many seconds instead of a fixed request count. Mutually exclusive with
|
low_memory |
bool
|
When |
progress_bar_stats |
dict | None
|
Controls which live stats appear on the progress
bar. See |
output_path |
PathLike | str | None
|
Where to save results. |
tokenizer |
Tokenizer | None
|
Optional tokenizer for token counting. |
test_name |
str | None
|
Name for this test. Defaults to current date/time. |
callbacks |
list[Callback] | None
|
Optional callbacks. |
Example::
# Count-bound: 10 requests per client at each concurrency level
load_test = LoadTest(
endpoint=my_endpoint,
payload=sample_payload,
sequence_of_clients=[1, 5, 10, 20],
min_requests_per_client=10,
output_path="outputs/load_test",
)
result = await load_test.run()
result.plot_results()
# Time-bound: 60 seconds per concurrency level
load_test = LoadTest(
endpoint=my_endpoint,
payload=sample_payload,
sequence_of_clients=[1, 5, 10, 20],
run_duration=60,
output_path="outputs/load_test",
)
result = await load_test.run()
# Time-bound with low-memory mode for large-scale tests
load_test = LoadTest(
endpoint=my_endpoint,
payload=sample_payload,
sequence_of_clients=[1, 5, 10, 20, 50],
run_duration=120,
low_memory=True,
output_path="outputs/large_load_test",
)
result = await load_test.run()
run
async
run(output_path=None)
Run the load test across all configured concurrency levels.
Creates a :class:~llmeter.runner.Runner and iterates through
sequence_of_clients, running one test per concurrency level. In
time-bound mode (run_duration is set), each level runs for a fixed
duration. In count-bound mode, each level sends a fixed number of
requests per client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
load_path
|
Optional (local or remote) folder to save results. If provided, individual |
required | |
Default
|
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
LoadTestResult |
A result object containing one |
|
|
class: |
||
|
client count. |
Example::
load_test = LoadTest(
endpoint=my_endpoint,
payload=sample_payload,
sequence_of_clients=[1, 5, 10],
run_duration=30,
)
result = await load_test.run(output_path="outputs/my_test")
# Access individual results by client count
result.results[5].stats["requests_per_minute"]
# Plot all standard charts
result.plot_results()
Source code in llmeter/experiments.py
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 | |
LoadTestResult
dataclass
LoadTestResult(results, test_name, output_path=None)
load
classmethod
load(load_path, test_name=None, load_responses=True)
Load test results from a directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
load_path
|
ReadablePathLike | None
|
Directory path containing the load test results subdirectories |
required |
test_name
|
str | None
|
Optional name for the test. If not provided, will use the directory name |
None
|
load_responses
|
bool
|
Whether to load individual invocation responses. Defaults to True. When False, only summaries and pre-computed stats are loaded. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
LoadTestResult |
LoadTestResult
|
A LoadTestResult object containing the loaded results |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If load_path does not exist or is None/empty |
ValueError
|
If no results are found in the directory |
Source code in llmeter/experiments.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | |