PyTorch Support Matrix
PyTorch Support Matrix¶
The following table shows the supported functionalities by the Intel® Gaudi® PyTorch integration. For more details on Intel Gaudi’s PyTorch integration and the supported execution modes, see PyTorch Gaudi Theory of Operations.
Note
Support for Eager mode and Eager mode with torch.compile are in early stage development. Lazy mode is the default mode.
Item |
Eager + torch.compile (PT_HPU_LAZY_MODE=0) |
Lazy Mode (PT_HPU_LAZY_MODE=1) |
Comments |
|
---|---|---|---|---|
Workload Type |
Training |
Yes |
Yes |
|
Inference |
Yes |
Yes |
||
Device Name in PyTorch |
HPU |
Yes |
Yes |
Native Gaudi device name |
CPU |
Yes |
Yes |
||
CUDA |
No |
No |
Automatically converted to HPU using GPU Migration Toolkit |
|
Programming Language |
C++ |
Yes* |
Yes |
*Limited - no support for graphs |
Python |
Yes |
Yes |
||
Modes |
Eager |
Yes |
No |
|
Graph |
Yes |
Yes |
||
Graph Solutions |
torch.compile(backend=”eager”) |
Yes |
No |
|
torch.compile(backend=”inductor”) |
No |
No |
Automatically converted using GPU Migration Toolkit |
|
torch.compile(backend=”hpu_backend”) |
Yes |
No |
||
Lazy graph |
No |
Yes |
||
Fx tracing |
Yes |
Yes |
||
HPU Graphs |
No |
Yes |
||
torch.jit / TorchScript |
No |
No |
||
ONNX |
No |
No |
||
Dynamic Shapes |
Eager |
Yes |
No |
|
torch.compile |
Yes |
No |
||
Lazy |
No |
Yes |
||
Export |
torch.export |
Yes |
No |
|
torch.jit / TorchScript |
No |
No |
||
ONNX |
No |
No |
||
Data Types |
Int8 |
Yes |
Yes |
Limited ops support as listed in PyTorch Operators |
Int16 |
Yes |
Yes |
Limited ops support as listed in PyTorch Operators |
|
Int32 |
Yes |
Yes |
Limited ops support as listed in PyTorch Operators |
|
Int64 |
Yes |
Yes |
||
Float8 |
Yes |
Yes |
Supported on Gaudi 2 only |
|
Float16 |
Yes |
Yes |
Limited ops support as listed in PyTorch Operators |
|
Float32 |
Yes |
Yes |
||
BFloat16 |
Yes |
Yes |
||
Boolean |
Yes |
Yes |
Limited ops support as listed in PyTorch Operators |
|
Float64 |
No |
No |
||
Complex32 |
No |
No |
||
Complex64 |
No |
No |
||
QInt8 |
No |
No |
||
QInt16 |
No |
No |
||
QInt32 |
No |
No |
||
QInt64 |
No |
No |
||
Tensor Types |
Native PyTorch tensor support |
Yes |
No |
|
Dense tensors |
Yes |
Yes |
||
Views |
Yes |
Yes |
||
Channel last |
Yes |
Yes |
||
Strided tensors |
Yes |
Yes |
Output strides can be different than CUDA/CPU |
|
User tensor subclass |
Yes |
Yes |
||
Sparse tensors |
No |
No |
||
Masked tensors |
No |
No |
||
Nested tensors |
No |
No |
||
Mixed Precision |
torch.autocast |
Yes |
Yes |
|
Intel Gaudi Transformer Engine (FP8) |
Yes |
Yes |
||
Distributed |
DeepSpeed |
Yes* |
Yes |
See our DeepSpeed documentation for more details. *DeepSpeed support with torch.compile is still work in progress by DeepSpeed. |
PyTorch DDP |
Yes |
Yes |
||
PyTorch FSDP |
Yes |
No |
See Using Fully Sharded Data Parallel (FSDP) with Intel Gaudi |
|
PyTorch DTensor |
Yes |
No |
||
PyTorch Tensor Parallel |
Yes |
No |
||
PyTorch Pipeline Parallel |
No |
No |
||
PyTorch Distributed Elastic |
No |
No |
||
Distributed Backend |
HCCL |
Yes |
Yes |
Gaudi’s version of NCCL which is converted using GPU Migration Toolkit |
MPI |
Yes |
Yes |
||
Gloo |
No |
No |
||
NCCL |
No |
No |
Gaudi’s version of NCCL is HCCL which is converted using GPU Migration Toolkit |
|
Device Management |
Single device in a single process |
Yes |
Yes |
|
Multiple devices in a single process |
No |
No |
||
Sharing 1 device between multiple processes |
No |
No |
||
Custom Ops |
Writing custom ops in TPC C |
Yes |
Yes |
|
Writing custom ops in CUDA |
No |
No |
||
Writing custom ops in Triton |
No |
No |
||
Quantization |
FP8 quantization |
Yes |
Yes |
|
Int8/16/32 quantization |
No |
No |
||
Int4 quantization |
No |
No |
||
Data loader |
Native PyTorch |
Yes |
Yes |
|
Gaudi Media Loader |
Yes |
Yes |
Exposes Gaudi’s HW acceleration |
|
Serving Solution |
TGI |
Yes |
Yes |
|
Triton |
Yes |
Yes |
||
TorchServe |
Yes |
No |
||
vLLM |
Yes |
Yes |
||
Operators |
Aten ops |
Yes |
Yes |
|
Fused operators |
Yes |
Yes |
||
Other |
torch.profiler |
Yes |
Yes |
|
TensorBoard |
Yes |
Yes |
||
Checkpoints |
Yes |
Yes |
||
Weights Sharing |
Yes |
Yes* |
*Limited support on Lazy mode. See Weight Sharing. |
|
HPU stream and event support |
Yes |
Yes |
||
Native PyTorch SDPA |
Yes |
Yes |
Limited support. See Using Fused Scaled Dot Product Attention (FusedSDPA). |
|
Gaudi Optimized Flash Attention |
Yes |
Yes |
Flash attention algorithm + additional Intel Gaudi optimizations. See Using Fused Scaled Dot Product Attention (FusedSDPA). |
|
FFT |
No |
No |
||
torch.cond |
Yes |
No |
||
torch.signal |
No |
No |
||
torch.special |
No |
No |
||
torch.func |
No |
No |
||
torch.hub |
No |
No |
||
PyTorch Libraries |
TorchVision |
Yes |
Yes |
|
TorchAudio |
Yes |
Yes |
||
TorchText |
Yes |
Yes |
||
TorchData |
Yes |
Yes |
||
TorchRec |
No |
No |
||
TorchArrow |
No |
No |
||
TorchX |
No |
No |
||
ExecuTorch |
No |
No |