Skip to content

vLLM Environment Variables

LISA Serve supports configuring vLLM model serving through environment variables. These variables allow you to control performance, memory usage, parallelization, and advanced features when deploying models with vLLM.

  • NOTE: Standard vLLM environment variables are supported and passed directly into the VLLM container. See vLLM's documentation

Core Performance & Memory

VariableDescriptionDefaultExample
VLLM_GPU_MEMORY_UTILIZATIONFraction of GPU memory to use (0.0-1.0)0.90.85
VLLM_MAX_MODEL_LENMaximum context length overrideAuto4096
MAX_TOTAL_TOKENSLegacy alias for VLLM_MAX_MODEL_LENAuto4096

Model Format & Loading

VariableDescriptionDefaultExample
VLLM_DTYPEModel precisionautohalf, float16, bfloat16, float32
VLLM_QUANTIZATIONQuantization method-awq, gptq, squeezellm, fp8
VLLM_TRUST_REMOTE_CODEAllow custom model code executionfalsetrue

Performance Tuning

VariableDescriptionDefaultExample
VLLM_MAX_NUM_BATCHED_TOKENSMaximum tokens per batchAuto8192
VLLM_MAX_NUM_SEQSMaximum concurrent sequences256128, 512
VLLM_ENABLE_PREFIX_CACHINGEnable prefix caching for repeated promptsfalsetrue
VLLM_ENABLE_CHUNKED_PREFILLEnable chunked prefillfalsetrue

Parallel Processing

VariableDescriptionDefaultExample
VLLM_TENSOR_PARALLEL_SIZESplit model across N GPUs12, 4, 8

Tool Calling / Function Calling

VariableDescriptionDefaultExample
VLLM_ENABLE_AUTO_TOOL_CHOICEEnable automatic tool choice routingfalsetrue
VLLM_TOOL_CALL_PARSERTool call parser implementation-hermes, mistral, llama3_json, qwen

Note: Tool calling requires both VLLM_ENABLE_AUTO_TOOL_CHOICE=true and specifying an appropriate VLLM_TOOL_CALL_PARSER for your model. See vLLM Tool Calling Documentation for details.

Reference

For more details on vLLM configuration, see the official vLLM documentation.