Extend LLMeter with Callbacks
There's a lot more to performance testing beyond calling your LLM lots of times and capturing the speed of the responses... But to avoid bloating the performance-sensitive core of the library, LLMeter handles many of these more advanced or optional concerns via a callback mechanism.
See the dedicated sections in this guide on how built-in callbacks and standard patterns can help you to, for example:
- Avoid prompt caching on your target Endpoint
- Model and compare the costs of different Endpoints
- Track your experiments with MLflow
Build your own custom callbacks
You can implement custom callbacks to add your own functionality to LLMeter, by extending from the Callback base class and implementing one or multiple of its standard methods:
before_runis called before each test Run starts, and has the opportunity to inspect or modify the Run configuration.before_invokeis called before each individual model invocation, and can inspect or modify the request payload.after_invokeis called after each model invocation, and can inspect or modify theInvocationResponse.after_runis called after each test Run completes, and can inspect or modify the RunResult.
Callbacks are processed outside of the timing of invocations, and before_run and after_run callbacks are processed outside of the timing of the overall Run. However, it's important to remember that slow callbacks could still affect the maximum volume of traffic that LLMeter is able to drive - and that the backlog of after_invoke callbacks must be cleared before a Run is considered ended.
You can specify one or multiple callbacks when setting up your Runner, as below:
runner = Runner(
endpoint,
output_path=f"outputs/{endpoint.model_id}",
callbacks=[MlflowCallback(), MyCoolCallback()],
)
results = await runner.run(
payload=sample_payloads,
n_requests=10,
clients=10,
)
Each callback will be processed in the same order as you provide them to the runner. This is important to remember and configure properly, if you're stacking multiple callbacks that access the same data (for example - transforming an invocation response then logging/exporting it somewhere, both using after_invoke).