vLLM Forked Inference Server with Intel Gaudi¶

Note

The v1.22.2 release introduced the production-ready vLLM Hardware Plugin for Intel® Gaudi®, a community-driven integration layer based on the vLLM v1 architecture. The plugin is an alternative to the vLLM fork described in this documentation, which remains functional for legacy use cases until it is deprecated in v1.24.0. We strongly encourage all fork users to begin planning their migration to the plugin. For more information about the plugin, see vLLM Hardware Plugin for Intel Gaudi documentation.

The following sections provide instructions for setting up and using vLLM fork with Intel® Gaudi® AI accelerator. They include quick start guide, troubleshooting tips, performance tuning guidelines, warmup time optimization strategies, instructions for enabling FP8 calibration and inference, deploying vLLM containers, and profiling methods. Additionally, answers to common questions are provided to assist with getting started.

vLLM Quick Start Guide
Inference Using vLLM
FP8 Calibration and Inference with vLLM
Managing and Reducing vLLM Warmup Time
Deployable vLLM Containers Tutorial
Profiling with vLLM
vLLM with Intel Gaudi FAQs

Gaudi Documentation 1.23.0 documentation

vLLM Forked Inference Server with Intel Gaudi

vLLM Forked Inference Server with Intel Gaudi¶