Optimization Levels¶
Overview¶
vLLM now supports optimization levels (-O0, -O1, -O2, -O3). Optimization levels provide an intuitive mechnaism for users to trade startup time for performance. Higher levels have better performance but worse startup time. These optimization levels have associated defaults to help users get desired out of the box performance. Importantly, defaults set by optimization levels are purely defaults; explicit user settings will not be overwritten.
Level Summaries and Usage Examples¶
# CLI usage
python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O0
# Python API usage
from vllm.entrypoints.llm import LLM
llm = LLM(
model="RedHatAI/Llama-3.2-1B-FP8",
optimization_level=0
)
-O1: Quick Optimizations¶
- Startup: Moderate startup time
- Performance: Inductor compilation, CUDAGraphMode.PIECEWISE
- Use case: Balance for most development scenarios
# CLI usage
python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O1
# Python API usage
from vllm.entrypoints.llm import LLM
llm = LLM(
model="RedHatAI/Llama-3.2-1B-FP8",
optimization_level=1
)
-O2: Full Optimizations (Default)¶
- Startup: Longer startup time
- Performance:
-O1+ CUDAGraphMode.FULL_AND_PIECEWISE - Use case: Production workloads where performance is important. This is the default use case. It is also very similar to the previous default. The primary difference is that noop & fusion flags are enabled.
# CLI usage (default, so optional)
python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O2
# Python API usage
from vllm.entrypoints.llm import LLM
llm = LLM(
model="RedHatAI/Llama-3.2-1B-FP8",
optimization_level=2 # This is the default
)
-O3: Full Optimization¶
Still in development. Added infrastructure to prevent changing API in future release. Currently behaves the same O2.
Troubleshooting¶
Common Issues¶
- Startup Time Too Long: Use
-O0or-O1for faster startup - Compilation Errors: Use
debug_dump_pathfor additional debugging information - Performance Issues: Ensure using
-O2for production