PyTorch Support Matrix

The following table shows the supported functionalities by the Intel® Gaudi® PyTorch integration. For more details on Intel Gaudi’s PyTorch integration and the supported execution modes, see PyTorch Gaudi Theory of Operations.

Note

Support for Eager mode and Eager mode with torch.compile are in early stage development. Lazy mode is the default mode.

Item

Eager + torch.compile (PT_HPU_LAZY_MODE=0)

Lazy Mode (PT_HPU_LAZY_MODE=1)

Comments

Workload Type

Training

Yes

Yes

Inference

Yes

Yes

Device Name in PyTorch

HPU

Yes

Yes

Native Gaudi device name

CPU

Yes

Yes

CUDA

No

No

Automatically converted to HPU using GPU Migration Toolkit

Programming Language

C++

Yes*

Yes

*Limited - no support for graphs

Python

Yes

Yes

Modes

Eager

Yes

No

Graph

Yes

Yes

Graph Solutions

torch.compile(backend=”eager”)

Yes

No

torch.compile(backend=”inductor”)

No

No

Automatically converted using GPU Migration Toolkit

torch.compile(backend=”hpu_backend”)

Yes

No

Lazy graph

No

Yes

Fx tracing

Yes

Yes

HPU Graphs

No

Yes

torch.jit / TorchScript

No

No

ONNX

No

No

Dynamic Shapes

Eager

Yes

No

torch.compile

Yes

No

Lazy

No

Yes

Export

torch.export

Yes

No

torch.jit / TorchScript

No

No

ONNX

No

No

Data Types

Int8

Yes

Yes

Limited ops support as listed in PyTorch Operators

Int16

Yes

Yes

Limited ops support as listed in PyTorch Operators

Int32

Yes

Yes

Limited ops support as listed in PyTorch Operators

Int64

Yes

Yes

Float8

Yes

Yes

Supported on Gaudi 2 only

Float16

Yes

Yes

Limited ops support as listed in PyTorch Operators

Float32

Yes

Yes

BFloat16

Yes

Yes

Boolean

Yes

Yes

Limited ops support as listed in PyTorch Operators

Float64

No

No

Complex32

No

No

Complex64

No

No

QInt8

No

No

QInt16

No

No

QInt32

No

No

QInt64

No

No

Tensor Types

Native PyTorch tensor support

Yes

No

Dense tensors

Yes

Yes

Views

Yes

Yes

Channel last

Yes

Yes

Strided tensors

Yes

Yes

Output strides can be different than CUDA/CPU

User tensor subclass

Yes

Yes

Sparse tensors

No

No

Masked tensors

No

No

Nested tensors

No

No

Mixed Precision

torch.autocast

Yes

Yes

Intel Gaudi Transformer Engine (FP8)

Yes

Yes

Distributed

DeepSpeed

Yes*

Yes

See our DeepSpeed documentation for more details. *DeepSpeed support with torch.compile is still work in progress by DeepSpeed.

PyTorch DDP

Yes

Yes

PyTorch FSDP

Yes

No

See Using Fully Sharded Data Parallel (FSDP) with Intel Gaudi

PyTorch DTensor

Yes

No

See Using DistributedTensor with Intel Gaudi

PyTorch Tensor Parallel

Yes

No

See Using DistributedTensor with Intel Gaudi

PyTorch Pipeline Parallel

No

No

PyTorch Distributed Elastic

No

No

Distributed Backend

HCCL

Yes

Yes

Gaudi’s version of NCCL which is converted using GPU Migration Toolkit

MPI

Yes

Yes

Gloo

No

No

NCCL

No

No

Gaudi’s version of NCCL is HCCL which is converted using GPU Migration Toolkit

Device Management

Single device in a single process

Yes

Yes

Multiple devices in a single process

No

No

Sharing 1 device between multiple processes

No

No

Custom Ops

Writing custom ops in TPC C

Yes

Yes

Writing custom ops in CUDA

No

No

Writing custom ops in Triton

No

No

Quantization

FP8 quantization

Yes

Yes

Int8/16/32 quantization

No

No

Int4 quantization

No

No

Data loader

Native PyTorch

Yes

Yes

Gaudi Media Loader

Yes

Yes

Exposes Gaudi’s HW acceleration

Serving Solution

TGI

Yes

Yes

Triton

Yes

Yes

See Triton Inference Server with Gaudi

TorchServe

Yes

No

See TorchServe Inference Server with Gaudi

vLLM

Yes

Yes

Operators

Aten ops

Yes

Yes

See PyTorch Operators

Fused operators

Yes

Yes

See Fused Optimizers and Custom Ops for Intel Gaudi

Other

torch.profiler

Yes

Yes

TensorBoard

Yes

Yes

Checkpoints

Yes

Yes

Weights Sharing

Yes

Yes*

*Limited support on Lazy mode. See Weight Sharing.

HPU stream and event support

Yes

Yes

Native PyTorch SDPA

Yes

Yes

Limited support. See Using Fused Scaled Dot Product Attention (FusedSDPA).

Gaudi Optimized Flash Attention

Yes

Yes

Flash attention algorithm + additional Intel Gaudi optimizations. See Using Fused Scaled Dot Product Attention (FusedSDPA).

FFT

No

No

torch.cond

Yes

No

torch.signal

No

No

torch.special

No

No

torch.func

No

No

torch.hub

No

No

PyTorch Libraries

TorchVision

Yes

Yes

TorchAudio

Yes

Yes

TorchText

Yes

Yes

TorchData

Yes

Yes

TorchRec

No

No

TorchArrow

No

No

TorchX

No

No

ExecuTorch

No

No