Inference Using DeepSpeed¶

The purpose of this document is to guide Data Scientists to run inference on pre-trained PyTorch models using DeepSpeed with Intel® Gaudi® AI accelerator.

DeepSpeed Validated Configurations¶

The following configurations have been validated to be fully functioning for DeepSpeed inference on Gaudi:

Configuration	Description	Example
Inference on multi-server	Run the model on two or more independent servers.	N/A
Tensor Parallel / Auto Tensor Parallel	Splits the model tensors into chunks so that each tensor resides on its designated HPU. DeepSpeed supports Auto tensor parallel enabled when using Transformer models.	LLaMA
BF16 precision	Reduces model memory consumption and improves performance.	LLaMA
Weights sharding using meta device	Avoids loading all weights to RAM. Instead, it loads the weights of each device to RAM separately and copies to the device to be able to fit into host memory.	LLaMA
CUDA graph (with HPU Graph implementation)	DeepSpeed provides a flag for capturing the CUDA-Graph of the inference ops. Using the graph replay, the graphs run faster To enable this flag, see Using HPU Graphs.	LLaMA
`torch.compile`	Wraps parts of a model into a graph for improved performance. Model parts are compiled once at the start, allowing the compiled part to be called throughout execution. For further details, refer to Using torch.compile section.	TBD

Note

DeepSpeed’s multi-server training uses pdsh for invoking the processes on remote hosts. Make sure it is installed on your machine before using it.
Not all models support Tensor Parallel/Auto Tensor Parallel (auto TP). For example, auto TP is not supported in tiiuae / falcon-7b with 71 num_attention_heads since that cannot be divided across the available 8 ranks.

Installing DeepSpeed Library¶

Intel Gaudi provides a DeepSpeed fork which includes changes to add support for the Intel Gaudi software. To use DeepSpeed with Gaudi, you must install Intel Gaudi’s DeepSpeed fork. Intel Gaudi’s DeepSpeed fork is based on DeepSpeed v0.14.4:

pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.21.0

Integrating DeepSpeed with Gaudi¶

Prepare your PyTorch model to run on Gaudi by following the steps detailed in the PyTorch Model Porting section. If you have an existing training script that runs on Gaudi, migrating your model is not required.

Follow DeepSpeed’s instructions at https://www.deepspeed.ai/tutorials/inference-tutorial/ with the below modification:

Pass an additional parameter args.device = 'hpu' to deepspeed.init_inference(). For example:

world_size = int(os.getenv('WORLD_SIZE', '0'))
parser = argparse.ArgumentParser()
args = parser.parse_args(args='')
args.device = "hpu"
model = deepspeed.init_inference(model,
                                mp_size=world_size,
                                dtype=dtype,
                                args=args,
                                **kwargs)

Run the DeepSpeed runner:

deepspeed --num_gpus <amount of gaudi cards> model.py <model args>

Note

Multi-server is also supported. Refer to DeepSpeed documentation for more details.
Hugging Face models optimized with the Optimum for Intel Gaudi library also support DeepSpeed inference with minor modifications. Refer to the list of validated models for more information.

Using HPU Graphs¶

Intel Gaudi’s DeepSpeed fork supports HPU Graphs natively. It is enabled by setting enable_cuda_graph=True when the device is set to 'hpu'. To enable HPU Graphs:

Set the following environment variable: export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
Pass the following parameter to deepspeed initialization: deepspeed.init_inference(enable_cuda_graph=True,...)

For more details on HPU Graphs, refer to Run Inference Using HPU Graphs.

Using `torch.compile`¶

When compiling your model with DeepSpeed, call the dedicated compile() API of the DeepSpeedEngine (object returned by deepspeed.initialize()):

def compile(self,
         backend=get_accelerator().get_compile_backend(),
         compile_kwargs={} -> None:

Note

Calling the DeepSpeedEngine compile() with the default arguments is sufficient. However, specifying a non-default backend or passing other compile_kwargs to the torch.compile API is also supported.
Using HPU Graphs with torch.compile is not supported.

Workaround¶

When running DeepSpeed inference with keep_module_on_host = True flag, make sure to set DEEPSPEED_USE_HABANA_FRAMEWORKS_DETERMINISTIC_API=1. This prevents OOM issues on the host when loading the model’s parameters.

Reference Model¶

For a full example of a DeepSpeed inference model with Gaudi, refer to LLaMA.

Gaudi Documentation 1.21.1 documentation

Inference Using DeepSpeed

On this Page

Inference Using DeepSpeed¶

DeepSpeed Validated Configurations¶

Installing DeepSpeed Library¶

Integrating DeepSpeed with Gaudi¶

Using HPU Graphs¶

Using `torch.compile`¶

Workaround¶

Reference Model¶

Gaudi Documentation 1.21.1 documentation

Inference Using DeepSpeed

On this Page

Inference Using DeepSpeed¶

DeepSpeed Validated Configurations¶

Installing DeepSpeed Library¶

Integrating DeepSpeed with Gaudi¶

Using HPU Graphs¶

Using torch.compile¶

Workaround¶

Reference Model¶

Using `torch.compile`¶