Experiments
experiments
Higher-level experiments (generally combining multiple Runs)
This module provides utilities to run more complex "experiments" that go beyond the scope of a single Run.
LatencyHeatmap
dataclass
LatencyHeatmap(endpoint, source_file, clients=4, output_path=None, input_lengths=(lambda: [10, 50, 200, 500])(), output_lengths=(lambda: [128, 256, 512, 1024])(), requests_per_combination=1, create_payload_fn=None, create_payload_kwargs=dict(), tokenizer=None)
Experiment to measure how latency varies by input and output token count
This experiment uses a source text file to generate input prompts/payloads of different lengths, and measures how response time varies with both the input lengths and output/response lengths.
Attributes:
| Name | Type | Description |
|---|---|---|
endpoint |
Endpoint
|
The LLM endpoint to test. |
source_file |
UPath | str
|
The source file from which prompts of different lengths will be
sampled (see |
clients |
int
|
The number of concurrent clients (requests) to use for the experiment. Note that using a high number of concurrent clients could impact observed latency. |
output_path |
UPath | str | None
|
The (local or Cloud e.g. |
input_lengths |
Sequence[int]
|
The approximate input/prompt lengths to test. Since the
locally-available |
output_lengths |
Sequence[int]
|
The target output lengths to test. Since generation may stop early for certain prompts, and some endpoints may not report exact token counts in their responses, the results may not correspond exactly to these targets. |
requests_per_combination |
int
|
The number of requests to make for each combination of input and output lengths. |
create_payload_fn |
Callable | None
|
A function to create the actual endpoint payload for
each invocation, from the sampled text prompt. Typically, you'll want to specify a
prefix for your prompt in either this or the |
create_payload_kwargs |
Dict
|
Keyword arguments to pass to the |
tokenizer |
Tokenizer | None
|
A tokenizer to be used for sampling prompts of the specified
lengths, and also estimating the generated output lengths if necessary for your
endpoint. If not set, the |
LoadTest
dataclass
LoadTest(endpoint, payload, sequence_of_clients, min_requests_per_client=1, min_requests_per_run=10, output_path=None, tokenizer=None, test_name=None, callbacks=None)
Experiment to explore how performance changes at different concurrency levels
This experiment creates a series of Runs with different levels of concurrency, defined by
sequence_of_clients, and runs them one after the other.
LoadTestResult
dataclass
LoadTestResult(results, test_name, output_path=None)
load
classmethod
load(load_path, test_name=None, load_responses=True)
Load test results from a directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
load_path
|
UPath | str | None
|
Directory path containing the load test results subdirectories |
required |
test_name
|
str | None
|
Optional name for the test. If not provided, will use the directory name |
None
|
load_responses
|
bool
|
Whether to load individual invocation responses. Defaults to True. When False, only summaries and pre-computed stats are loaded. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
LoadTestResult |
LoadTestResult
|
A LoadTestResult object containing the loaded results |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If load_path does not exist or is None/empty |
ValueError
|
If no results are found in the directory |
Source code in llmeter/experiments.py
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | |