Release Notes
On this Page
Release Notes¶
Note
For previous versions of the Release Notes, please refer to Previous Release Notes.
New Features - v1.22.1¶
The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.22.1-6 We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.
The Intel Gaudi Software Suite version v1.22.1 may not include all the latest functional and security updates. A new version, v1.23.0, is targeted to be released in October 2025 and will include additional functional and security updates. Customers should update to the latest version as it becomes available.
Validated Mellanox OFED and DOCA OFED on the following operating systems. See Mellanox OFED & DOCA OFED Support Matrix:
MOFED 24.10-3.2.5.0-LTS on Ubuntu 24.04.2 LTS
MOFED 24.10-3.2.5.0-LTS on RHEL 9.4
DOCA OFED 3.0.0-058000_25.04 on Ubuntu 24.04.2 LTS and RHEL 9.4
Known Issues and Limitations - v1.22.1¶
When running any command with OpenMPI v4.1.6 inside a Docker container on a RHEL 9.4 host, you may experience a delay of about 3 minutes
before receiving a response from a remote host. For example, the command mpirun --allow-run-as-root -N 1 -H HOST1,HOST2 date
returns a delayed response.
New Features - v1.22.0¶
The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.22.0-740 We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.
The Intel Gaudi Software Suite version v1.22.0 may not include all the latest functional and security updates. A new version, v1.23.0, is targeted to be released in October 2025 and will include additional functional and security updates. Customers should update to the latest version as it becomes available.
General¶
Added support for the following operating systems For further details, see Support Matrix:
OpenCloudOS 9.2 on Gaudi 3 and Gaudi 2
RHEL 9.6 on Gaudi 3 and Gaudi 2
Navix 9.4 on Gaudi 2 only
Starting from v1.22.0, Docker files for all operating systems, except from Ubuntu 24 and Ubuntu22, will no longer be available for download from the vault. Instead, they can now be obtained by building custom Dockers. See Use Intel Gaudi Containers section.
Firmware¶
For Gaudi 3:
Upgraded SVN version to 3.
Added supports for sensors read rate.
Added Auxiliary name PDRs.
vLLM¶
Starting from v1.22.0, an early developer preview of the vllm-gaudi plugin is available at https://github.com/vllm-project/vllm-gaudi. It integrates Intel Gaudi with vLLM for optimized LLM inference. This early-stage plugin is intended for development and is not yet suitable for general use. Plugin will become default in v1.23.0. The vLLM fork remains functional for legacy use cases until it is deprecated in v1.24.0.
Added the vLLM container image. See https://github.com/HabanaAI/vllm-fork/tree/v1.22.0/.cd
Added support for the following features to the Intel Gaudi vLLM fork:
BlockSoftmaxAdjustment
Initial support for
BlockSoftmax
Tensor-parallel inference with torchrun
For Gaudi 3 and Gaudi 2, added support for new models in Intel Gaudi vLLM fork:
Llama-4-Scout-17B-16E BF16/FP8 on 4/8 cards
Llama-4-Maverick-17Bx128E BF16/FP8 on 8 cards
Qwen2-72B-Instruct BF16/FP8 on 8 cards
Qwen2.5-72B-Instruct BF16/FP8 on 8 cards
Qwen2.5-VL-7B-Instruct BF16/FP8 on 4 cards
Qwen/Qwen2.5-VL-72B-Instruct BF16/FP8 on 4/8 cards
For Gaudi 3, added support for new models in Intel Gaudi vLLM fork:
Qwen3-32B BF16 on 8 cards
Qwen3-30B-A3B BF16 on 8 cards
The TGI-gaudi fork has been deprecated in favor of the upstreamed version. The last supported release was with Optimum Habana v1.16, tested with v1.20.x. Starting from v1.22.0, please refer to the official TGI repository at: https://github.com/huggingface/text-generation-inference/tree/main/backends/gaudi.
Validated the Intel Gaudi 1.22.0 software release with vLLM v0.9.0.1. See the Support Matrix for a full list of version support.
PyTorch¶
Intel Gaudi offers a wide range of models using Eager mode and
torch.compile
, which is the default mode. However, any model requiring Lazy mode will need to set thePT_HPU_LAZY_MODE=1
flag, as it is set to 0 by default. Usingtorch.compile
with Eager mode is recommended, as Eager mode alone can be slower due to its limited optimization of computation graphs. Note that Lazy mode will be deprecated in future releases.Added support for the following:
NF4 data type for inference. See Run Inference Using NF4 section.
QLoRA fine-tuning LLMs. See QLoRA Fine-Tuning on Intel Gaudi.
SGLang inference. See SGLang Inference Server with Intel Gaudi.
Deprecated the following MLPerf 4.0 configurations for training:
GPT3
Llama 70B LoRA
Deprecated the following MLPerf 4.0 configurations for inference:
Llama 70B
Stable Diffusion XL
v1.22.0 release has been validated with Hugging Face Optimum for Intel Gaudi library and model version v1.19.0. Future releases of the Optimum for Intel Gaudi library may be validated with this release. See the Support Matrix for a full list of version support.
The PyTorch Lightning has been deprecated. The last supported PyTorch Lightning version was 2.5.1, tested with v1.21.x. See https://lightning.ai/docs/pytorch/stable/integrations/hpu/.
Upgraded to Intel Neural Compressor (INC) v3.5.
Enhancements - v1.22.0¶
Firmware¶
For Gaudi 3:
Applied sensor thresholds during runtime.
Ensured that the meta-state of sensors and effecters is persistent across boot and runtime.
Improved I2C bus handling.
vLLM¶
Improved vLLM Gaudi backend support:
Inference with torch.compile fully supporting FP8 and BF16 precisions.
FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC).
Included automatic prefix caching (APC) for more efficient prefills, configurable by standard
--enable-prefix-caching
parameter.
PyTorch¶
Enabled the
PT_HPU_HUGE_PAGES_LIMIT_MB
flag. The flag sets the huge page limit (in MB) for the current worker. See Runtime Environment Variables.Intel Gaudi PyTorch fork and the public PyTorch now use the same setting of the
_GLIBCXX_USE_CXX11_ABI
flag, which is set to 1. Any libraries or packages previously compiled with_GLIBCXX_USE_CXX11_ABI=0
must be recompiled with_GLIBCXX_USE_CXX11_ABI=1
to ensure compatibility. See PyTorch Gaudi Theory of Operations.
Bug Fixes and Resolved Issues - v1.22.0¶
Firmware¶
For Gaudi 3, fixed the following:
I2C driver semaphore handling.
NIC register access during reset.
I2C controllers’ clock configuration.
PLDM health indication sensor.
For Gaudi 2, fixed the following:
Process name display when running DeepSeek training.
Known Issues and Limitations - v1.22.0¶
General¶
Enabling IOMMU passthrough is required only for Ubuntu 24.04.2/22.04.5 with Linux kernel 6.8. For more details, see Enable IOMMU Passthrough.
Running functional test in high power mode experiences a performance failure with an 8.5% degradation, whereas the functional test in extreme power mode is fully operational.
Intel Gaudi Media Loader is not supported in RHEL9.4 OS with Python 3.11.
Firmware¶
For Gaudi 2 only, firmware SPI version v1.21.2 and later are not compatible with Boot FIT v1.20.1 and earlier.
PyTorch¶
There can be compatibility issues with Optimum-Habana RHEL 8.6 and RHEL 9.4 due to using numpy > 2.0 in Python 3.11 which is the default Python version for these operating systems. To avoid this, upgrade the Numba version to 0.61.0 in these operating systems when using optimum-habana.
In torch.compile,
view.dtype
is not supported with the public PyTorch. To mitigate that, make sure to use it with the Intel Gaudi PyTorch fork only.Multiprocess worker creation using the
fork
start method is not supported with PyTorch dataloader. Instead, use thespawn
orforkserver
start method. See Torch Multiprocessing for DataLoaders.In certain models, there is performance degradation when using HPU graphs with Lazy collectives. To mitigate this, set
PT_HPU_LAZY_COLLECTIVES_HOLD_TENSORS=1
flag, though it may lead to increased memory consumption on the device side. This will be fixed in the subsequent release.Sporadic numerical instability may occur when training with FP8 precision.
Starting from PyTorch 2.5.1,
torch.nn.functional.scaled_dot_product_attention
automatically casts inputs to FP32, which can lead to performance degradation or GC failures. To avoid this, settorch._C._set_math_sdp_allow_fp16_bf16_reduction(True)
to enabletorch.nn.functional.scaled_dot_product_attention
to operate on BF16. Without this setting, the inputs will always be cast to FP32.DeepSpeed is not compatible with PyTorch Lightning in SW releases greater than 1.17.X. However, it is supported in older releases that include lightning-habana 1.6.
To bypass a performance issue in Linux kernel version >= 5.9 (e.g. Ubuntu 22.04.5), the
intel_idle
driver must be disabled by addingintel_idle.max_cstate=0
to the kernel command line.Support for
torch.compile
is in early stage. Models may not work (due to missing OPs implementation) or performance may be affected.Support for Eager mode is in early stages. Models may not work (due to missing OPs implementation) or performance may be affected. The functionality of Eager mode as a subset of Lazy mode can be emulated by using
PT_HPU_MAX_COMPOUND_OP_SIZE
environment variable and limiting cluster sizes to 1. See Eager Mode.Model checkpointing for ResNet50 in
torch.compile
mode is broken. This will be fixed in the next release.Timing events where
enable_timing=True
may not provide accurate timing information.Handling Dynamic shapes can be initiated by setting the
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES
flag. Flag is disabled by default but enabled selectively for several models. For best performance, users should follow the guidance on how to work with Dynamic Shapes in the Handling Dynamic Shapes document.Graphs displayed in TensorBoard have some minor limitations, eg. operator’s assigned device is displayed as “unknown device” when it is scheduled to HPU.
HPU tensor strides might not match that of CPU as tensor storage is managed differently. Reference to tensor storage (such as torch.as_strided) should take into account the input tensor strides explicitly. It is recommended to use other view functions instead of torch.as_strided. For further details, see Tensor Views and TORCH.AS_STRIDED.
Weights sharing:
Weights can be shared among two or more layers using PyTorch with Gaudi only if they are created inside the module. For more details, refer to Weight Sharing.
Weights are not shared with operators outside of the PyTorch library (i.e. PyBind11 functions).
User-defined attributes in HPU
torch.nn.Parameter
are not preserved aftertorch.nn.Parameter
is assigned with a CPU tensor.Python API
habana_frameworks.torch.hpu.current_device()
returns 0 regardless of the actual device being used.For
torch.nn.Parameter
which is not created insidetorch.nn.Module
:When two
torch.nn.Parameter
are on CPU storage and referencing the same parameter, the connection will be lost if one of them is moved to HPU.Assigning a CPU tensor to HPU
torch.nn.Parameter
is not supported.
Saving metrics to a file configured using Runtime Environment Variables is not supported for workloads spawned via
torch.multiprocessing
.Using torch.device(hpu:x) - (for example, as model.to) - where x is rank > 0 may lead to memory leaks. Instead, always use torch.device(hpu) to access the current rank.
Added the capability to serialize constant tensors, enabling recipe caching to disk for inference scenarios. However, due to a technical limitation, sharing recipes between cards on a single server is not possible. Recipes from each card are stored in separate directories, leading to increased usage of disk space.
Performing view-related operations on tensors with INT64 data type (
torch.long
) in Lazy mode can lead to incorrect results. If this data type is not required, the script should work with INT32 tensors (torch.int
). By default, PyTorch creates integer tensors withtorch.long
data type, so make sure to explicitly create INT32 tensors. This limitation does not apply to Eager +torch.compile
mode (PT_HPU_LAZY_MODE=0
).Launching ops with tensor inputs from mixed devices (
hpu
andcpu
) is not supported in Eager +torch.compile
mode (PT_HPU_LAZY_MODE=0
). All tensors need to reside onhpu
. Launching ops with tensor inputs from mixed devices is supported in Lazy mode in which internal transfers tohpu
are performed.