Release Notes
On this Page
Release Notes¶
Note
For previous versions of the Release Notes, please refer to Previous Release Notes.
New Features - v1.23.0¶
The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.23.0-695 We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.
The Intel Gaudi Software Suite v1.23.0 refreshes multiple software-side components with the latest third-party security fixes, removing exposure to recent CVEs and improving overall runtime stability. A new version, v1.24.0, is targeted to be released in early 2026 and will include additional functional and security updates. Customers should update to the latest version as it becomes available.
General¶
Added Rack Scale script for Diagnostic tool. See Rack Scale Script.
Added ReportNCheck tool. See Intel Gaudi ReportNCheck Tool.
Upgraded Open MPI to v5.0.8. In this version, the
--rank-by coreoption has been deprecated and replaced with--rank-by slot. If your existingmpirunscripts use--rank-by corefor round-robin ranking behavior, update them to use--rank-by slotto maintain the same behavior.
vLLM¶
Introduced vLLM Hardware Plugin for Intel Gaudi 0.13.0 based on vLLM 0.13.0 and compatible with Intel Gaudi v1.23.0. It offers the following improvements:
Experimental dynamic quantization for MatMul and KV‑cache operations. This feature improves performance, with minimal expected impact on accuracy. For more information, see the Dynamic Quantization for MatMul and KV‑cache Operations section in the documentation.
Support for the following new models on Intel Gaudi 3:
Full validation of the following models:
For the list of all supported models, see Validated Models.
The vLLM fork is deprecated and scheduled for end of life in v1.24.0, remaining functional only for legacy use cases. All fork users are strongly encouraged to migrate to the vLLM Hardware Plugin for Intel Gaudi.
The TGI-gaudi fork has been deprecated in favor of the upstreamed version. The last supported release was with Optimum Habana v1.16, tested with v1.20.x. Starting from v1.22.0, please refer to the official TGI repository at: https://github.com/huggingface/text-generation-inference/tree/main/backends/gaudi.
PyTorch¶
Added support for MoE using Runtime Scale Patching:
LLaMA model with an estimated 5-20% performance degradation.
DeepSeek model with no performance degradation and in some cases improvements.
Qwen model - Partial support only with approximately 70% performance degradation. Using BF16 remains faster for this model.
Parallel Compilation is now supported on both Vision and non-Vision based models. See Compile Mode.
For MLM:
Rebased code to upstream core_r0.13.0 release.
Added torch.compile support for Mixtral.
Added distributed checkpoint
torch_distsupport for LLaMA and Mixtral.
Intel Gaudi offers a wide range of models using Eager mode and
torch.compile, which is the default mode. However, any model requiring Lazy mode will need to set thePT_HPU_LAZY_MODE=1flag, as it is set to 0 by default. Usingtorch.compilewith Eager mode is recommended, as Eager mode alone can be slower due to its limited optimization of computation graphs. Note that Lazy mode will be deprecated in future releases.v1.223.0 release has been validated with Hugging Face Optimum for Intel Gaudi library and model version v1.19.0. Future releases of the Optimum for Intel Gaudi library may be validated with this release. See the Support Matrix for a full list of version support.
The PyTorch Lightning has been deprecated. The last supported PyTorch Lightning version was 2.5.1, tested with v1.21.x. See https://lightning.ai/docs/pytorch/stable/integrations/hpu/.
Upgraded to Intel Neural Compressor (INC) vTBD.
In MediaPipe, FFmpeg is an open source project licensed under LGPL and GPL. See https://www.ffmpeg.org/legal.html. You are solely responsible for determining if your use of FFmpeg requires any additional licenses. Intel is not responsible for obtaining any such licenses, nor liable for any licensing fees due, in connection with your use of FFmpeg.
Bug Fixes and Resolved Issues - v1.23.0¶
Firmware¶
Incorrect 12V PSU Power Draw Reporting - Previously, the power draw of the 12V PSU was incorrectly displayed due to a failure in reading its value. When this occurred, the firmware printed an invalid value (4294967295, equivalent to -1 or 0xFFFFFFFF) instead of showing N/A, which indicates a sensor read error. The issue has been fixed so that failed sensor reads are now correctly reported as N/A.
Dependency on 54V PSU Read Causing Incorrect 12V PSU Value - A failure to read the 54V PSU previously prevented the 12V PSU from being read or updated, resulting in incorrect or unexpected power draw values being displayed. This has been fixed where the 12V PSU readings are now updated independently, even if the 54V PSU read fails.
HW Semaphore Handling - Fixed an issue in the hardware semaphore usage where the semaphore mechanism did not properly prevent simultaneous access to a critical resource. The fix adds a write barrier after writing to the hardware semaphore register before reading its value, ensuring correct synchronization behavior. This change applies to Gaudi 3.
Known Issues and Limitations - v1.23.0¶
General¶
Enabling IOMMU passthrough is required only for Ubuntu 24.04.2/22.04.5 with Linux kernel 6.8. For more details, see Enable IOMMU Passthrough.
Running functional test in high power mode experiences a performance failure with an 8.5% degradation, whereas the functional test in extreme power mode is fully operational.
Intel Gaudi Media Loader is not supported in RHEL9.4 OS with Python 3.11.
Firmware¶
For Gaudi 2 only, firmware SPI version v1.21.2 and later are not compatible with Boot FIT v1.20.1 and earlier.
PyTorch¶
There can be compatibility issues with Optimum-Habana RHEL 8.6 and RHEL 9.4 due to using numpy > 2.0 in Python 3.11 which is the default Python version for these operating systems. To avoid this, upgrade the Numba version to 0.61.0 in these operating systems when using optimum-habana.
Multiprocess worker creation using the
forkstart method is not supported with PyTorch dataloader. Instead, use thespawnorforkserverstart method. See Torch Multiprocessing for DataLoaders.In certain models, there is performance degradation when using HPU graphs with Lazy collectives. To mitigate this, set
PT_HPU_LAZY_COLLECTIVES_HOLD_TENSORS=1flag, though it may lead to increased memory consumption on the device side. This will be fixed in the subsequent release.Using Lazy mode with LLMs that have a very large number of parameters (for example, Llama3.1-405B) may cause the system to become unresponsive. It is recommended to use DeepSpeed v1.22.0 when working with LLMs of this scale. This issue does not affect
torch.compileor Eager mode scenarios.Sporadic numerical instability may occur when training with FP8 precision.
Starting from PyTorch 2.5.1,
torch.nn.functional.scaled_dot_product_attentionautomatically casts inputs to FP32, which can lead to performance degradation or GC failures. To avoid this, settorch._C._set_math_sdp_allow_fp16_bf16_reduction(True)to enabletorch.nn.functional.scaled_dot_product_attentionto operate on BF16. Without this setting, the inputs will always be cast to FP32.DeepSpeed is not compatible with PyTorch Lightning in SW releases greater than 1.17.X. However, it is supported in older releases that include lightning-habana 1.6.
To bypass a performance issue in Linux kernel version >= 5.9 (e.g. Ubuntu 22.04.5), the
intel_idledriver must be disabled by addingintel_idle.max_cstate=0to the kernel command line.Support for
torch.compileis in early stage. Models may not work (due to missing OPs implementation) or performance may be affected.Support for Eager mode is in early stages. Models may not work (due to missing OPs implementation) or performance may be affected. The functionality of Eager mode as a subset of Lazy mode can be emulated by using
PT_HPU_MAX_COMPOUND_OP_SIZEenvironment variable and limiting cluster sizes to 1. See Eager Mode.Model checkpointing for ResNet50 in
torch.compilemode is broken. This will be fixed in the next release.Timing events where
enable_timing=Truemay not provide accurate timing information.Handling Dynamic shapes can be initiated by setting the
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPESflag. The flag is disabled by default but enabled selectively for several models. For best performance, follow the instructions on how to work with Dynamic Shapes in the Handling Dynamic Shapes document.Graphs displayed in TensorBoard have some minor limitations, eg. operator’s assigned device is displayed as “unknown device” when it is scheduled to HPU.
HPU tensor strides might not match that of CPU as tensor storage is managed differently. Reference to tensor storage (such as torch.as_strided) should take into account the input tensor strides explicitly. It is recommended to use other view functions instead of torch.as_strided. For further details, see Tensor Views and TORCH.AS_STRIDED.
Weights sharing:
Weights can be shared among two or more layers using PyTorch with Gaudi only if they are created inside the module. For more details, refer to Weight Sharing.
Weights are not shared with operators outside of the PyTorch library (i.e. PyBind11 functions).
User-defined attributes in HPU
torch.nn.Parameterare not preserved aftertorch.nn.Parameteris assigned with a CPU tensor.Python API
habana_frameworks.torch.hpu.current_device()returns 0 regardless of the actual device being used.For
torch.nn.Parameterwhich is not created insidetorch.nn.Module:When two
torch.nn.Parameterare on CPU storage and referencing the same parameter, the connection will be lost if one of them is moved to HPU.Assigning a CPU tensor to HPU
torch.nn.Parameteris not supported.
Saving metrics to a file configured using Runtime Environment Variables is not supported for workloads spawned via
torch.multiprocessing.Using torch.device(hpu:x) - (for example, as model.to) - where x is rank > 0 may lead to memory leaks. Instead, always use torch.device(hpu) to access the current rank.
Added the capability to serialize constant tensors, enabling recipe caching to disk for inference scenarios. However, due to a technical limitation, sharing recipes between cards on a single server is not possible. Recipes from each card are stored in separate directories, leading to increased usage of disk space.
Performing view-related operations on tensors with INT64 data type (
torch.long) in Lazy mode can lead to incorrect results. If this data type is not required, the script should work with INT32 tensors (torch.int). By default, PyTorch creates integer tensors withtorch.longdata type, so make sure to explicitly create INT32 tensors. This limitation does not apply to Eager +torch.compilemode (PT_HPU_LAZY_MODE=0).Launching ops with tensor inputs from mixed devices (
hpuandcpu) is not supported in Eager +torch.compilemode (PT_HPU_LAZY_MODE=0). All tensors need to reside onhpu. Launching ops with tensor inputs from mixed devices is supported in Lazy mode in which internal transfers tohpuare performed.