Release Notes
On this Page
Release Notes¶
Note
For previous versions of the Release Notes, please refer to Previous Release Notes.
New Features and Enhancements - 1.19.1¶
The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.19.1-26. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.
This release includes various bug fixes, as well as updated firmware for Gaudi 3.
Added a new run option for hl-smi-async tool. See hl_smi_async Tool.
Updating OAM CPLD using hl-fw-loader is now supported.
Updated SUSE 15.5 kernel version to 5.14.21-150500.55.88.1.
New Features and Enhancements - 1.19.0¶
The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.19.0-561. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.
General¶
Added support for the following:
Kubernetes versions 1.30, 1.31, 1.32
TencentOS on Gaudi 3
Ubuntu 24.04 on Gaudi 2
RHEL 8.6 with Python 3.11
Ubuntu 22.04 PyTorch Dockers with Python version 3.11. See PyTorch Docker.
BMC exporter on Gaudi 3
Added RDMA PerfTest tool. The tool executes performance testing for low-level, high-performance connectivity through ping-pong and bandwidth communication tests. See Intel Gaudi RDMA PerfTest Tool.
Deprecated Amazon Linux 2.
Firmware¶
Added Hypervisor tools package which includes Memory Scrubbing Verification (MSV) tool and hl-smi-async tool. See Hypervisor Tools Installation and Usage.
Added support for disabling and enabling external NIC ports on Gaudi 3. See Disable/Enable NICs section.
Added Firmware Update Lock feature for Gaudi 3. See Firmware Update Lock.
Added the following APIs to the Habana Labs Management Library and Habana Labs Python Management Library. See HLML API Reference and PYHLML API Reference:
hlml_device_get_process_utilization
hlml_device_set_power_management_limit
hlml_get_nic_driver_ve rsion
hlml_device_get_supported_performance_states
PyTorch¶
v1.19.0 release has been validated with Hugging Face Optimum for Intel Gaudi library and model version 1.15.0. Future releases of the Optimum for Intel Gaudi library may be validated with this release. See the Support Matrix for a full list of version support.
The Hugging Face Optimum for Intel Gaudi library now supports the updated TGI-gaudi version 2.3.1. See https://github.com/huggingface/tgi-gaudi.
Validated the Intel Gaudi 1.19.0 software release with vLLM v0.6.4.post2. See the Support Matrix for a full list of version support.
Added support for the following features to the Intel Gaudi vLLM fork:
Multi-step scheduling HPU with tensor parallelism
Asynchronous Output Processing
Long context with LoRA (up to 128k)
Automatic Prefix Caching
Repetition penalty
Structured Output (guided JSON)
FusedMoE
Non-invasive model graph splitting
Upgraded to PyTorch version 2.5.1. See PyTorch Support Matrix.
Validated the Intel Gaudi 1.19.0 software release on PyTorch Lightning version 2.3.3. See https://lightning.ai/docs/pytorch/stable/integrations/hpu/.
Gaudi PyTorch Bridge source code and the associated Gaudi PyTorch Fork code are publicly available at Gaudi PyTorch bridge and Gaudi PyTorch Fork.
The Gaudi PyTorch Bridge now supports the stock PyTorch version 2.5.1, along with Gaudi PyTorch fork. The stock PyTorch operates in eager mode with
torch.compile
only. This feature is currently experimental. See Public PyTorch Support.Enabled updating CPU settings on Gaudi 3 using Sapphire and Granite Rapids to optimize performance. See Set CPU Setting to Performance section.
Added the following training reference models for Gaudi 3 and Gaudi 2. See Intel Gaudi Megatron-DeepSpeed and Intel Gaudi Megatron-LM:
LLaMA 3.1 8B on 8 cards
LLaMA 3.1 70B on 64 cards
Added Mixtral 8x7B BF16 on 32 cards training reference model for Gaudi 2. See Intel Gaudi Megatron-LM.
FP8 scalar scaling default behavior is set to
scalar
by default. See Compile Time and Throughput Optimization.Intel Gaudi Megatron-DeepSpeed will be deprecated and replaced with Megatron-LM in version 1.20.0.
Intel Gaudi offers a wide range of models using Eager mode and
torch.compile
. In subsequent releases, Lazy mode will be deprecated. Eager mode andtorch.compile
will be the default.
Known Issues and Limitations - 1.19.0¶
General¶
EDP test is not functional using RHEL9.2/9.4 operating systems.
Intel Gaudi Media Loader is not supported in RHEL9.4 OS with Python 3.11.
Firmware¶
After upgrading the FW to v1.19.0, make sure to reboot or power cycle the host machine. Otherwise, the Telemetry tool stops receiving data, and a RAZWI error could appear in the LKD driver debug log.
When using the
hl-smi
tool on Gaudi 3 while the following test plugins are running, the tool outputs 0% OAM utilization:HBM_DMA_STRESS
HBM_TPC_STRESS
HBM_FULL_DATA_CHECK
E2E concurrency
SER
These low-level tests use a special mode that ensures low latency and fast execution. This mode leaves no trace on the utilization calculation.
Updating OAM CPLD using hl-fw-loader is not supported in v1.19.0. This will be fixed in the next release.
PyTorch¶
Sporadic numerical instability may occur when training with FP8 precision.
Starting from PyTorch 2.5.1,
torch.nn.functional.scaled_dot_product_attention
automatically casts inputs to FP32, which can lead to performance degradation or GC failures. To avoid this, settorch._C._set_math_sdp_allow_fp16_bf16_reduction(True)
to enabletorch.nn.functional.scaled_dot_product_attention
to operate on BF16. Without this setting, the inputs will always be cast to FP32.DeepSpeed is not compatible with PyTorch Lightning in SW releases greater than 1.17.X. However, it is supported in older releases that include lightning-habana 1.6.
To bypass a performance issue in Linux kernel version >= 5.9 (e.g. Ubuntu 22.04), the
intel_idle
driver must be disabled by addingintel_idle.max_cstate=0
to the kernel command line.Support for
torch.compile
is in early stage. Models may not work (due to missing OPs implementation) or performance may be affected.Support for Eager mode is in early stages. Models may not work (due to missing OPs implementation) or performance may be affected. The functionality of Eager mode as a subset of Lazy mode can be emulated by using
PT_HPU_MAX_COMPOUND_OP_SIZE
environment variable and limiting cluster sizes to 1. See Eager Mode.Model checkpointing for ResNet50 in
torch.compile
mode is broken. This will be fixed in the next release.Timing events where
enable_timing=True
may not provide accurate timing information.Handling Dynamic shapes can be initiated by setting the
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES
flag. Flag is disabled by default but enabled selectively for several models. For best performance, users should follow the guidance on how to work with Dynamic Shapes in the Handling Dynamic Shapes document.Graphs displayed in TensorBoard have some minor limitations, eg. operator’s assigned device is displayed as “unknown device” when it is scheduled to HPU.
HPU tensor strides might not match that of CPU as tensor storage is managed differently. Reference to tensor storage (such as torch.as_strided) should take into account the input tensor strides explicitly. It is recommended to use other view functions instead of torch.as_strided. For further details, see Tensor Views and TORCH.AS_STRIDED.
Weights sharing:
Weights can be shared among two or more layers using PyTorch with Gaudi only if they are created inside the module. For more details, refer to Weight Sharing.
Weights are not shared with operators outside of the PyTorch library (i.e. PyBind11 functions).
User-defined attributes in HPU
torch.nn.Parameter
are not preserved aftertorch.nn.Parameter
is assigned with a CPU tensor.Python API
habana_frameworks.torch.hpu.current_device()
returns 0 regardless of the actual device being used.For
torch.nn.Parameter
which is not created insidetorch.nn.Module
:When two
torch.nn.Parameter
are on CPU storage and referencing the same parameter, the connection will be lost if one of them is moved to HPU.Assigning a CPU tensor to HPU
torch.nn.Parameter
is not supported.
Saving metrics to a file configured using Runtime Environment Variables is not supported for workloads spawned via
torch.multiprocessing
.Using torch.device(hpu:x) - (for example, as model.to) - where x is rank > 0 may lead to memory leaks. Instead, always use torch.device(hpu) to access the current rank.
Added the capability to serialize constant tensors, enabling recipe caching to disk for inference scenarios. However, due to a technical limitation, sharing recipes between cards on a single server is not possible. Recipes from each card are stored in separate directories, leading to increased usage of disk space.
Performing view-related operations on tensors with INT64 data type (
torch.long
) in Lazy mode can lead to incorrect results. If this data type is not required, the script should work with INT32 tensors (torch.int
). By default, PyTorch creates integer tensors withtorch.long
data type, so make sure to explicitly create INT32 tensors. This limitation does not apply to Eager +torch.compile
mode (PT_HPU_LAZY_MODE=0
).Launching ops with tensor inputs from mixed devices (
hpu
andcpu
) is not supported in Eager +torch.compile
mode (PT_HPU_LAZY_MODE=0
). All tensors need to reside onhpu
. Launching ops with tensor inputs from mixed devices is supported in Lazy mode in which internal transfers tohpu
are performed.