vLLM Forked Inference Server with Intel Gaudi

Note

The v1.22.2 release introduces the production-ready vLLM Hardware Plugin for Intel® Gaudi®, a community-driven integration layer based on the vLLM v1 architecture. The plugin is an alternative to the vLLM fork described in this documentation, which remains functional for legacy use cases until it is deprecated in v1.24.0. We strongly encourage all fork users to begin planning their migration to the plugin. For more information about the plugin, see vLLM Hardware Plugin for Intel Gaudi documentation.

The following sections provide instructions for setting up and using vLLM fork with Intel® Gaudi® AI accelerator.

The following sections provide instructions for setting up and using vLLM with Intel® Gaudi® AI accelerator. They include quick start guide, troubleshooting tips, performance tuning guidelines, warmup time optimization strategies, instructions for enabling FP8 calibration and inference, deploying vLLM containers, and profiling methods. Additionally, answers to common questions are provided to assist with getting started.