Inference Using DeepSpeed

The purpose of this document is to guide Data Scientists to run inference on pre-trained PyTorch models using a DeepSpeed interface integrated with Intel® Gaudi® AI accelerator.

DeepSpeed Validated Configurations

The following DeepSpeed configurations have been validated to be fully functioning with HPU:

Configuration

Description

Example

Inference on multi-node

Run the model on two or more independent servers.

Bloom

Tensor Parallel / Auto Tensor Parallel

Splits the model tensors into chunks so that each tensor resides on its designated HPU. DeepSpeed supports Auto tensor parallel enabled when using transformer models.

Bloom

BF16 precision

Reduces model memory consumption and improves performance.

Bloom

Weights sharding using meta device

Avoids loading all weights to RAM. Instead, it loads the weights of each device to RAM separately and copies to the device to be able to fit into host memory.

Bloom

CUDA graph (with HPU Graph implementation)

DeepSpeed provides a flag for capturing the CUDA-Graph of the inference ops. Using the graph replay, the graphs run faster To enable this flag, see Using HPU Graphs.

Bloom

Note

  • DeepSpeed’s multi-node training uses pdsh for invoking the processes on remote hosts. Make sure it is installed on your machine before using it.

  • Not all models support Tensor Parallel/Auto Tensor Parallel (auto TP). For example, auto TP is not supported in tiiuae / falcon-7b with 71 num_attention_heads since that cannot be divided across the available 8 ranks.

Installing DeepSpeed Library

The Intel Gaudi GitHub has a fork of the DeepSpeed library that includes changes to add support for Intel Gaudi software. To use DeepSpeed with Gaudi, you must install Intel Gaudi’s fork for DeepSpeed directly from the DeepSpeed fork repository:

pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.15.1

DeepSpeed inference was tested on this fork which is based on DeepSpeed v0.12.4.

Integrating DeepSpeed with Gaudi

Follow DeepSpeed’s instructions at https://www.deepspeed.ai/tutorials/inference-tutorial/ with the following modifications:

  • Pass an additional parameter args.device = ‘hpu’ to deepspeed.init_inference(). For example:

    world_size = int(os.getenv('WORLD_SIZE', '0'))
    parser = argparse.ArgumentParser()
    args = parser.parse_args(args='')
    args.device = "hpu"
    model = deepspeed.init_inference(model,
                                     mp_size=world_size,
                                     dtype=dtype,
                                     args=args,
                                     **kwargs)
    
  • Run the DeepSpeed runner:

    deepspeed --num_gpus <amount of gaudi cards> model.py <model args>
    

Multi-node is also supported. Refer to DeepSpeed documentation for more details.

Using HPU Graphs

Intel Gaudi fork for DeepSpeed supports HPU graph natively. It is enabled by setting enable_cuda_graph=True when the device is set to ‘hpu’.

To enable HPU Graphs:

  • Set the following environment variable: export PT_HPU_ENABLE_LAZY_COLLECTIVES=true

  • Pass the following parameter to deepspeed initialization: deepspeed.init_inference(enable_cuda_graph=True,...)

For more details on HPU Graphs, refer to Run Inference Using HPU Graphs.

Reference Model

For a full example of a DeepSpeed inference model with Gaudi, refer to BLOOM Model Reference.