Skip to content

Metrics and Statistics

LLM latency metrics

When evaluating the performance of a large language model, latency is typically broken down into a few key metrics. These are general concepts, independent of any specific tool.

sequenceDiagram
    participant Client
    participant LLM as LLM Endpoint

    Client->>LLM: Send request
    Note right of Client: ⏱ Clock starts
    LLM-->>Client: First token arrives
    Note right of Client: ⏱ TTFT
    LLM-->>Client: ...tokens streaming...
    LLM-->>Client: Last token arrives
    Note right of Client: ⏱ TTLT

Time to First Token (TTFT)

The time a consumer waits before receiving any response. For streaming endpoints, this is the delay before the first chunk arrives. TTFT is primarily influenced by the model's prefill phase (processing the input prompt) and network latency.

When comparing two endpoints, TTFT is most meaningful when the input token count is the same (or close enough that the difference is smaller than the precision you want to measure).

Time to Last Token (TTLT)

The total end-to-end latency from sending the request to receiving the complete response. This is the metric most directly tied to user-perceived response time for non-streaming use cases.

Time Per Output Token (TPOT)

The average generation speed per token, computed as:

TPOT = (TTLT - TTFT) / (output_tokens - 1)

Ideally (excluding non-linear approaches like speculative decoding), TPOT should be a property of the underlying accelerated compute and largely independent of the number of input and generated tokens. This makes TPOT a useful metric for comparing endpoints even when the workloads differ.

Streaming vs non-streaming

TTFT, TPOT, and the distinction from TTLT only apply to streaming endpoints. Non-streaming endpoints report only TTLT (the total round-trip time) since the entire response arrives at once.

Note

TTLT may differ between streaming and non-streaming invocations of the same model, so always test the invocation method you'll actually be using. Don't choose to test streaming only because it provides additional metrics.

Token counts

Input and output token counts are fundamental to understanding latency behavior. TTFT tends to scale with input length, while TTLT scales with output length. Token counts are also the primary cost driver for most pay-per-use LLM APIs.

Why distributions matter

Latency metrics are best understood as distributions, not single numbers. While summary statistics (median, p90, average) are convenient for comparison, the underlying distribution reveals important behavior: bimodal patterns, long tails, or outliers that a single percentile would hide. Highly skewed data is common in LLM latency measurements.

Percentiles and sample size

High percentiles (p90, p95, p99) are widely used to characterize tail latency, but their reliability depends directly on how many data points you have. A percentile estimate is only as good as the number of observations that actually fall in the tail beyond it.

The expected number of tail observations for a given percentile is n × (1 - p):

Percentile Tail fraction n for 1 tail observation n for 5 tail observations
p90 10% 10 50
p95 5% 20 100
p99 1% 100 500

With only 1 tail observation, the percentile estimate equals a single data point — any outlier (network glitch, cold start, GC pause) dominates the value. With 5 or more tail observations, the estimate is based on multiple independent measurements and becomes meaningfully stable.

As a practical guideline:

  • Below 1 / (1 - p) samples (e.g. fewer than 100 for p99), the percentile is purely extrapolated and should not be reported.
  • Between 1 / (1 - p) and 5 / (1 - p) samples, the percentile exists but is unreliable — treat it as a rough approximation.
  • Above 5 / (1 - p) samples (e.g. 500+ for p99), the estimate is based on enough tail observations to be trustworthy for decision-making.

Sizing your test runs

When planning how many requests to include in a test run, consider which percentiles you need to report. If p99 latency is important for your SLOs, aim for at least 500 successful requests per run. For p90, 50 requests is a reasonable minimum.

This is not specific to LLM testing — it applies to any latency measurement. For further reading:


How LLMeter captures these metrics

LLMeter measures all timings as wall-clock values from the client side using Python's time.perf_counter. They include network latency between LLMeter and the endpoint under test.

Per-request fields

Each request produces an InvocationResponse with:

Field Unit Description
time_to_first_token seconds TTFT. Only populated for streaming endpoints.
time_to_last_token seconds TTLT. Always populated on successful requests.
time_per_output_token seconds TPOT. Only available when num_tokens_output > 1.
num_tokens_input count Input token count. Reported by the endpoint or estimated by a tokenizer configured on the Runner.
num_tokens_output count Output token count. Reported by the endpoint or estimated by a tokenizer configured on the Runner.
error string Error message if the request failed, None otherwise.

Run-level statistics

After a batch of requests completes, the Runner computes aggregate statistics available via Result.stats.

Throughput and error metrics

Statistic Description
requests_per_minute Total requests divided by total test time, scaled to per-minute rate.
failed_requests Count of requests that returned an error.
failed_requests_rate Ratio of failed requests to total requests (0.0 to 1.0).
total_input_tokens Sum of num_tokens_input across all requests.
total_output_tokens Sum of num_tokens_output across all requests.
average_input_tokens_per_minute Total input tokens divided by test time, scaled to per-minute rate.
average_output_tokens_per_minute Total output tokens divided by test time, scaled to per-minute rate.

Distribution statistics

For each of the four core per-request metrics (time_to_last_token, time_to_first_token, num_tokens_output, num_tokens_input), LLMeter computes distributional aggregates across all successful responses. Each metric gets the following aggregations, accessible as {metric}-{aggregation} keys in Result.stats:

Aggregation Key suffix Description
Mean -average Arithmetic mean of all values.
Median (p50) -p50 50th percentile.
p90 -p90 90th percentile.
p99 -p99 99th percentile.

For example, Result.stats["time_to_first_token-p90"] gives the 90th percentile TTFT across all requests in the run.

Tip

NaN values (from failed requests) are automatically excluded from all aggregation calculations.

Accessing statistics

Statistics are available through the Result.stats property, or from the stats.json file saved in the output folder when an output_path is configured:

result = await runner.run(payload=payload, n_requests=30, clients=5)

# Access stats programmatically
print(result.stats["time_to_first_token-p50"])
print(result.stats["requests_per_minute"])

# Or filter for a specific metric
ttft_stats = {k: v for k, v in result.stats.items() if "time_to_first_token" in k}

Cost metrics

The CostModel callback extends Result.stats with cost estimates. When attached to a Runner or Experiment, it adds:

Statistic Description
cost_total Total estimated cost for the entire run (all dimensions combined).
cost_{DimensionName} Total cost for a specific dimension (e.g. cost_InputTokens, cost_OutputTokens).
cost_per_request-average Mean cost per request (request-level dimensions only).
cost_per_request-p50 Median cost per request.
cost_per_request-p90 90th percentile cost per request.
cost_{DimensionName}_per_request-{aggregation} Per-dimension, per-request statistics.

Warning

When using a cost model with both request-level and run-level dimensions, cost_per_request-average only includes request-level costs. For the true average cost per request including infrastructure costs, use result.cost_total / result.total_requests.

See the Model Costs example notebook for a walkthrough of request-based pricing, infrastructure-based pricing, and custom cost dimensions.

Experiment-level metrics

Load test

The LoadTest experiment runs multiple batches at different concurrency levels (sequence_of_clients) and collects per-run statistics for each. This lets you observe how latency, throughput, and error rate change as concurrent request volume increases. The LoadTestResult.plot_results() method generates standard charts covering:

  • Average input/output tokens vs. number of clients
  • Error rate vs. number of clients
  • Requests per minute vs. number of clients
  • Time to first token vs. number of clients
  • Time to last token vs. number of clients

Latency heatmap

The LatencyHeatmap experiment explores how latency varies as a function of input prompt length and output completion length, producing a 2D heatmap of response times.

Visualizing distributions

LLMeter provides Plotly-based plotting functions (requires the plotting extra):

from llmeter.plotting import boxplot_by_dimension, histogram_by_dimension, scatter_histogram_2d
import plotly.graph_objects as go

# Boxplot comparison of TTFT across two runs
fig = go.Figure()
fig.add_traces([
    boxplot_by_dimension(result=result_1, dimension="time_to_first_token"),
    boxplot_by_dimension(result=result_2, dimension="time_to_first_token"),
])

# Histogram of TTFT with 20ms bins
fig = go.Figure()
fig.add_trace(
    histogram_by_dimension(result, dimension="time_to_first_token", xbins={"size": 0.02})
)

# 2D scatter + histogram of TPOT vs output token count
fig = scatter_histogram_2d(result, "num_tokens_output", "time_per_output_token", 20, 20)

For complete comparison workflows, see the example notebooks:

  • TTFT comparison — comparing time to first token distributions with bootstrapped confidence intervals
  • TPOT comparison — comparing time per output token, including TPOT vs. output length analysis
  • Compare load tests — overlaying load test results side by side