Release Notes¶

Note

For previous versions of the Release Notes, please refer to Previous Release Notes.

New Features - v1.21.3¶

The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.21.3-57 We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.

The Intel Gaudi Software Suite version v1.21.3 may not include all the latest functional and security updates. A new version, v1.22.0, is targeted to be released in August 2025 and will include additional functional and security updates. Customers should update to the latest version as it becomes available.

General¶

First-gen Gaudi is no longer supported in software releases starting from v1.21.3.
Upgraded Gaudi 3 SVN to sec-3.
Improved HBM bandwidth of Gaudi 2E.

Bug Fixes and Resolved Issues - v1.21.3¶

Firmware¶

For Gaudi 3, fixed NICs port disabling mechanism.

Known Issues and Limitations - v1.21.3¶

Optimum for Intel Gaudi 1.18.0¶

The datasets package is not pinned, leading to an automatic upgrade to version 4.0.0, which breaks some examples (e.g., speech recognition, audio classification or lm_eval in text generation). To mitigate this, manually pin datasets to version <=3.6.0.
The sentencepiece dependency may trigger a numpy upgrade, which can cause issues in the Stable Diffusion (text-to-image generation) example. To mitigate this, add a dependency on numpy==1.26.4.

New Features - v1.21.2¶

The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.21.2-76 We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.

The Intel Gaudi Software Suite version v1.21.2 may not include all the latest functional and security updates. A new version, v1.22.0, is targeted to be released in August 2025 and will include additional functional and security updates. Customers should update to the latest version as it becomes available.

General¶

Gaudi 2 support in v1.21.2 - This release supports Gaudi 2 PCIe only. Gaudi 2C and Gaudi 2D are not supported.
This is the final release to support first-gen Gaudi, and it will no longer be supported in future software releases.

PyTorch¶

Added an option to store compiled recipes on NFS location. See <RECIPE_CACHE_ON_NFS> in Runtime Environment Variables.
Validated the Intel Gaudi 1.21.2 software release with vLLM v0.8.5.post1. See the Support Matrix for a full list of version support.
Added support for the following features to the Intel Gaudi vLLM fork:
- 3D-warmup
- torchrun for offline inference

Bug Fixes and Resolved Issues - v1.21.2¶

Firmware¶

For Gaudi 2 only, firmware SPI version v1.21.2 and later are not compatible with Boot FIT v1.20.1 and earlier.

New Features - v1.21.1¶

The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.21.1-16 We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.

The Intel Gaudi Software Suite version v1.21.1 may not include all the latest functional and security updates. A new version, v1.22.0, is targeted to be released in July 2025 and will include additional functional and security updates. Customers should update to the latest version as it becomes available.

General¶

In Qual Package Installation Validator, added support for the following. See Qual Package Installation Validator:
- Open MPI library installation verification
- Dynamically linked libraries load path verification
- Gaudi 3 PCIe (HL-338) device top board installation verification
Introduced the following to the Test Plan Automation feature in the Diagnostic Tool. See Test Plan Automation:
- Added example YAML configuration files for test plans targeting Gaudi 3 and Gaudi 2.
- Introduced Sensor Logs collection.
In hl_qual, reduced F2 pass/fail criteria for HL-325 and HL-325L.

Enhancements - v1.21.1¶

General¶

Enhanced the Test Plan Automation feature in the Diagnostic Tool. See Test Plan Automation:
- Improved virtual UART data collection for increased reliability and accuracy.
- Enhanced dmesg log collection to provide more comprehensive diagnostics.

Bug Fixes and Resolved Issues - v1.21.1¶

Firmware¶

Fixed power limit display in hl-smi.
Adjusted composite temperature (cTemp) calculation formula.

New Features - v1.21.0¶

The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.21.0-555 We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.

The Intel Gaudi Software Suite version v1.21.0 may not include all the latest functional and security updates. A new version, v1.22.0, is targeted to be released in July 2025 and will include additional functional and security updates. Customers should update to the latest version as it becomes available.

General¶

Added support for Gaudi 3 with the Intel® Tiber™ AI Cloud. See Intel Tiber AI Cloud Quick Start Guide.
In PerfTest tool, added support for Gaudi 2 and the new --basic_check switch. The switch enables testing between the corresponding ports across two systems.
OFI Wrapper and libfabric are no longer installed by default. When using Host NIC with OFI in Intel Gaudi containers, make sure to follow the instructions in Using Host NIC over OFI section.

Firmware¶

For Gaudi 3:

Added support for GetSensorThresholds and SetSensorThresholds sensors.
Added support for I2C direct and IPMI protocols in out-of-band.
Added GET_UID MCTP control command.

For Gaudi 3 and Gaudi 2:

Added hlml_device_get_power_management_limit_constraints API to the Habana Labs Management Library and Habana Labs Python Management Library. See HLML API Reference and PYHLML API Reference.

For HL-225D:

Enabled FP32 workloads.

PyTorch¶

Intel Gaudi offers a wide range of models using Eager mode and torch.compile, which is the default mode. However, any model requiring Lazy mode will need to set the PT_HPU_LAZY_MODE=1 flag, as it is set to 0 by default. Using torch.compile with Eager mode is recommended, as Eager mode alone can be slower due to its limited optimization of computation graphs. Note that Lazy mode will be deprecated in future releases.
Enabled multi-threaded graph compilation (only with PT_HPU_LAZY_MODE=0) to improve performance in Eager mode and reduce time-to-train (TTT) in Compile mode. See Parallel Compilation section.
Added vLLM profiling methods. See Profiling with vLLM.
Added support for Reducing vLLM FP8 Warmup Time feature to enhance performance. See Reducing vLLM FP8 Warmup Time.
Added support for the following features to the Intel Gaudi vLLM fork:
- Automatic Prefix Caching
- Pipeline Parallelism - see Pipeline Parallelism section.
- Guided Decoding
- V1 Support - initial support
- Multimodality
- Exponential Bucketing - initial support
- Delayed Sampling - initial support
- FP16 support - limited models (for further details, see Intel Gaudi vLLM fork)
- Multi node
- INT4 support (AWQ/GPTQ) - limited models (for further details, see Intel Gaudi vLLM fork)
- Support for RedHat Openshift AI
- Split QKV optimizations for BF16
Added support for new models in Intel Gaudi vLLM fork:
- DeepSeek-R1
- Codellama-34b-instruct-hf
- Llama 3.3-70B-instruct-hf
- mistralai/Mistral-Small-Instruct-2407
Enabled Context Parallelism in training with BF16 precision for MLM.
Added support for NumPy 2.x. Intel Gaudi Docker image includes NumPy 1.26.4 by default but can be upgraded to the latest NumPy 2.x version as needed.
The mixture_of_experts (MoE) custom op has been extended to support the following new flows. For further details, refer to Mixture of Experts Forward (MoE):
- Dynamic Per-Token Quantization: This flavor runs the operation in FP8 precision. Downscale GEMM scales are computed internally, while the scales for the upscale GEMM and weights are provided by the caller. It supports per-token scales, which can be either squeezed 1D or unsqueezed 2D tensors.
- Blockwise Quantization: In this flavor, the operation runs in BF16 precision, with weights provided using FP8 blockwise quantization. Weight dequantization is performed internally within the operation. Only square blocks are supported.
In MediaPipe, added support for HEVC video decode. See MediaPipe.
v1.21.0 release has been validated with Hugging Face Optimum for Intel Gaudi library and model version v1.18.0. Future releases of the Optimum for Intel Gaudi library may be validated with this release. See the Support Matrix for a full list of version support.
The TGI-gaudi fork has been deprecated. The last supported release was with Optimum Habana v1.16, tested with v1.20.x. Starting from v1.21.0, please refer to the official TGI repository at: https://github.com/huggingface/text-generation-inference/tree/main/backends/gaudi.
Validated the Intel Gaudi 1.21.0 software release with vLLM v0.7.2. See the Support Matrix for a full list of version support.
Validated the Intel Gaudi 1.21.0 software release on PyTorch Lightning version 2.5.1. See https://lightning.ai/docs/pytorch/stable/integrations/hpu/. PyTorch Lightning will be deprecated in the subsequent release.
Upgraded to Intel Neural Compressor (INC) v3.4.

Enhancements - v1.21.0¶

Firmware¶

For Gaudi 3:

Improved I2C busses stability for sensors readings and PLDM transmissions.
Fine-tuned ADC power sensor to achieve more accurate readings.
Improved firmware update transfer mechanism over PLDM/MCTP.
For HL338 PCIe card, fine-tuned idle power consumption as a base for utilization calculation.
Improved OAM health reports over PLDM.

HuggingFace¶

Integrated HPU as the native backend for Transformers and Accelerate, successfully validated Llama 70B and Mixtral PEFT across 8 cards.

Transformer Engine¶

Improved FP8 training performance by introducing a smoother Swiglu implementation.

Intel Gaudi, Kernels, `tpc_guid` and HCCL¶

Stability and performance of the compiler in Eager mode for transpose operations.
Stability of the compiler in Eager mode for FP8 data type (Llama inference).

Embedded System Tools¶

hl_qual test (F2 extreme) support to prevent interrupt storms on certain cards.
hl-smi-async tool to provide valid Ethernet status even when NICs are disabled.
Card recognition when it appears on a non-zero PCIe domain, resolving unexpected behavior during maintenance.
Extended sensor tool configuration to support new variants of Gaudi 3, such as HL-328.

Bug Fixes and Resolved Issues - v1.21.0¶

Firmware¶

For Gaudi 3:

Fixed the values of the engines utilization.
Fixed PLDM package component classification values to use 10 (firmware) instead of 0 (unknown).
Fixed onboard temperature sensors readings on PCB revision R0C.
Aligned PCIe HDR sensors to the PLDM specification.

For Gaudi 2:

Fixed the reporting of ASIC serial number out-of-band when LKD is loaded.
Fixed Ethernet error injection to conform the out-of-band Ethernet opcode specification.

DeepSpeed¶

Fixed graph breaks caused by the fetch_sub_module routine, which improved DeepSpeed (Zero3) performance when using PyTorch compile.
Using torch.use_deterministic_algorithms no longer breaks memory mapping when keep_module_on_host flag is set while running inference on large models.
Added missing components to support DeepSeek V3 AutoTP.

Transformer Engine¶

Fixed the following:
- An accuracy issue in Fused Attention during training with FP8 precision on Gaudi 3.
- A bug in full recompute mode for HL_PP > 1 configurations during FP8 precision training.
- An accuracy bug in configurations using Sequence Parallel mode during FP8 precision training on Gaudi 2.

Intel Gaudi, Kernels, `tpc_guid` and HCCL¶

Reduced compilation (warm-up) time through various backend software improvements, resulting in overall device time improvements on Gaudi 3 through optimizations.
Fixed the following
- Compilation error: “Message: getDramOffset: DRAM offset is not set!” that appeared in the Schwarz model.
- Oversubscription of the MME Suspension Buffer, which caused Compute or DMA timeout errors.
- Long and unbalanced MME computation caused by oversubscription of the Dcore cache.
- Support for the torch.block_diag operation.
- Segmentation fault and endless loop compilation error: “Message: calculator stuck in endless loop” on Gaudi 2.
For mixture_of_experts (MoE):
- Removed limitation on the number of operands for the MoE operation.
- Fixed support for updating tensors in scatter-add operations for partial weight gradient accumulation.

HL_QUAL¶

Removed nic_base dir_bw test variant as the calculated bandwidth was fluctuating. Instead, the bandwidth test variant can be used, which provides stable and predictable results. See Serdes Base Test Testing Modes.

Known Issues and Limitations - v1.21.0¶

General¶

Enabling IOMMU passthrough is required only for Ubuntu 24.04.2/22.04.5 with Linux kernel 6.8. For more details, see Enable IOMMU Passthrough.
Running functional test in high power mode experiences a performance failure with an 8.5% degradation, whereas the functional test in extreme power mode is fully operational.
Intel Gaudi Media Loader is not supported in RHEL9.4 OS with Python 3.11.

PyTorch¶

Multiprocess worker creation using the fork start method is not supported with PyTorch dataloader. Instead, use the spawn or forkserver start method. See Torch Multiprocessing for DataLoaders.
In certain models, there is performance degradation when using HPU graphs with Lazy collectives. To mitigate this, set PT_HPU_LAZY_COLLECTIVES_HOLD_TENSORS=1 flag, though it may lead to increased memory consumption on the device side. This will be fixed in the subsequent release.
When running DeepSpeed inference with keep_module_on_host = True flag, make sure to set DEEPSPEED_USE_HABANA_FRAMEWORKS_DETERMINISTIC_API=1. This prevents OOM issues on the host when loading the model’s parameters.
Sporadic numerical instability may occur when training with FP8 precision.
Starting from PyTorch 2.5.1, torch.nn.functional.scaled_dot_product_attention automatically casts inputs to FP32, which can lead to performance degradation or GC failures. To avoid this, set torch._C._set_math_sdp_allow_fp16_bf16_reduction(True) to enable torch.nn.functional.scaled_dot_product_attention to operate on BF16. Without this setting, the inputs will always be cast to FP32.
DeepSpeed is not compatible with PyTorch Lightning in SW releases greater than 1.17.X. However, it is supported in older releases that include lightning-habana 1.6.
To bypass a performance issue in Linux kernel version >= 5.9 (e.g. Ubuntu 22.04.5), the intel_idle driver must be disabled by adding intel_idle.max_cstate=0 to the kernel command line.
Support for torch.compile is in early stage. Models may not work (due to missing OPs implementation) or performance may be affected.
Support for Eager mode is in early stages. Models may not work (due to missing OPs implementation) or performance may be affected. The functionality of Eager mode as a subset of Lazy mode can be emulated by using PT_HPU_MAX_COMPOUND_OP_SIZE environment variable and limiting cluster sizes to 1. See Eager Mode.
Model checkpointing for ResNet50 in torch.compile mode is broken. This will be fixed in the next release.
Timing events where enable_timing=True may not provide accurate timing information.
Handling Dynamic shapes can be initiated by setting the PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES flag. Flag is disabled by default but enabled selectively for several models. For best performance, users should follow the guidance on how to work with Dynamic Shapes in the Handling Dynamic Shapes document.
Graphs displayed in TensorBoard have some minor limitations, eg. operator’s assigned device is displayed as “unknown device” when it is scheduled to HPU.
HPU tensor strides might not match that of CPU as tensor storage is managed differently. Reference to tensor storage (such as torch.as_strided) should take into account the input tensor strides explicitly. It is recommended to use other view functions instead of torch.as_strided. For further details, see Tensor Views and TORCH.AS_STRIDED.
Weights sharing:
- Weights can be shared among two or more layers using PyTorch with Gaudi only if they are created inside the module. For more details, refer to Weight Sharing.
- Weights are not shared with operators outside of the PyTorch library (i.e. PyBind11 functions).
User-defined attributes in HPU torch.nn.Parameter are not preserved after torch.nn.Parameter is assigned with a CPU tensor.
Python API habana_frameworks.torch.hpu.current_device() returns 0 regardless of the actual device being used.
For torch.nn.Parameter which is not created inside torch.nn.Module:
- When two torch.nn.Parameter are on CPU storage and referencing the same parameter, the connection will be lost if one of them is moved to HPU.
- Assigning a CPU tensor to HPU torch.nn.Parameter is not supported.
Saving metrics to a file configured using Runtime Environment Variables is not supported for workloads spawned via torch.multiprocessing.
Using torch.device(hpu:x) - (for example, as model.to) - where x is rank > 0 may lead to memory leaks. Instead, always use torch.device(hpu) to access the current rank.
Added the capability to serialize constant tensors, enabling recipe caching to disk for inference scenarios. However, due to a technical limitation, sharing recipes between cards on a single server is not possible. Recipes from each card are stored in separate directories, leading to increased usage of disk space.
Performing view-related operations on tensors with INT64 data type (torch.long) in Lazy mode can lead to incorrect results. If this data type is not required, the script should work with INT32 tensors (torch.int). By default, PyTorch creates integer tensors with torch.long data type, so make sure to explicitly create INT32 tensors. This limitation does not apply to Eager + torch.compile mode (PT_HPU_LAZY_MODE=0).
Launching ops with tensor inputs from mixed devices (hpu and cpu) is not supported in Eager + torch.compile mode (PT_HPU_LAZY_MODE=0). All tensors need to reside on hpu. Launching ops with tensor inputs from mixed devices is supported in Lazy mode in which internal transfers to hpu are performed.

Gaudi Documentation 1.21.1 documentation

Release Notes

On this Page

Release Notes¶

New Features - v1.21.3¶

General¶

Bug Fixes and Resolved Issues - v1.21.3¶

Firmware¶

Known Issues and Limitations - v1.21.3¶

Optimum for Intel Gaudi 1.18.0¶

New Features - v1.21.2¶

General¶

PyTorch¶

Bug Fixes and Resolved Issues - v1.21.2¶

Firmware¶

New Features - v1.21.1¶

General¶

Enhancements - v1.21.1¶

General¶

Bug Fixes and Resolved Issues - v1.21.1¶

Firmware¶

New Features - v1.21.0¶

General¶

Firmware¶

PyTorch¶

Enhancements - v1.21.0¶

Firmware¶

HuggingFace¶

Transformer Engine¶

Intel Gaudi, Kernels, `tpc_guid` and HCCL¶

Embedded System Tools¶

Bug Fixes and Resolved Issues - v1.21.0¶

Firmware¶

DeepSpeed¶

Transformer Engine¶

Intel Gaudi, Kernels, `tpc_guid` and HCCL¶

HL_QUAL¶

Known Issues and Limitations - v1.21.0¶

General¶

PyTorch¶

Gaudi Documentation 1.21.1 documentation

Release Notes

On this Page

Release Notes¶

New Features - v1.21.3¶

General¶

Bug Fixes and Resolved Issues - v1.21.3¶

Firmware¶

Known Issues and Limitations - v1.21.3¶

Optimum for Intel Gaudi 1.18.0¶

New Features - v1.21.2¶

General¶

PyTorch¶

Bug Fixes and Resolved Issues - v1.21.2¶

Firmware¶

New Features - v1.21.1¶

General¶

Enhancements - v1.21.1¶

General¶

Bug Fixes and Resolved Issues - v1.21.1¶

Firmware¶

New Features - v1.21.0¶

General¶

Firmware¶

PyTorch¶

Enhancements - v1.21.0¶

Firmware¶

HuggingFace¶

Transformer Engine¶

Intel Gaudi, Kernels, tpc_guid and HCCL¶

Embedded System Tools¶

Bug Fixes and Resolved Issues - v1.21.0¶

Firmware¶

DeepSpeed¶

Transformer Engine¶

Intel Gaudi, Kernels, tpc_guid and HCCL¶

HL_QUAL¶

Known Issues and Limitations - v1.21.0¶

General¶

PyTorch¶

Intel Gaudi, Kernels, `tpc_guid` and HCCL¶

Intel Gaudi, Kernels, `tpc_guid` and HCCL¶