Skip to content

vLLM Environment Variables

LISA Serve supports configuring vLLM model serving through environment variables. These variables allow you to control performance, memory usage, parallelization, and advanced features when deploying models with vLLM.

  • NOTE: Standard vLLM environment variables are supported and passed directly into the VLLM container. See vLLM's documentation
  • Review your ECS instance type's specifications to determine if the model you want LISA Serve to host has the proper VRAM/RAM capacity. Instances that have multiple GPUs may require the VLLM_TENSOR_PARALLEL_SIZE environment variable set to utilize all GPUs.

Core Performance & Memory

VariableDescriptionDefaultExample
VLLM_GPU_MEMORY_UTILIZATIONFraction of GPU memory to use (0.0-1.0)0.90.85
VLLM_MAX_MODEL_LENMaximum context length overrideAuto4096
MAX_TOTAL_TOKENSLegacy alias for VLLM_MAX_MODEL_LENAuto4096

Model Format & Loading

VariableDescriptionDefaultExample
VLLM_DTYPEModel precisionautohalf, float16, bfloat16, float32
VLLM_QUANTIZATIONQuantization method-awq, gptq, squeezellm, fp8
VLLM_TRUST_REMOTE_CODEAllow custom model code executionfalsetrue

Performance Tuning

VariableDescriptionDefaultExample
VLLM_MAX_NUM_BATCHED_TOKENSMaximum tokens per batchAuto8192
VLLM_MAX_NUM_SEQSMaximum concurrent sequences256128, 512
VLLM_ENABLE_PREFIX_CACHINGEnable prefix caching for repeated promptsfalsetrue
VLLM_ENABLE_CHUNKED_PREFILLEnable chunked prefillfalsetrue
VLLM_ASYNC_SCHEDULINGAdds --async-scheduling for higher performance if hardware supportedfalsetrue

Parallel Processing

VariableDescriptionDefaultExample
VLLM_TENSOR_PARALLEL_SIZESplit model across N GPUs12, 4, 8

Tool Calling / Function Calling

VariableDescriptionDefaultExample
VLLM_ENABLE_AUTO_TOOL_CHOICEEnable automatic tool choice routingfalsetrue
VLLM_TOOL_CALL_PARSERTool call parser implementation-hermes, mistral, llama3_json, qwen

Note: Tool calling requires both VLLM_ENABLE_AUTO_TOOL_CHOICE=true and specifying an appropriate VLLM_TOOL_CALL_PARSER for your model. See vLLM Tool Calling Documentation for details.

Reference

For more details on vLLM configuration, see the official vLLM documentation.