Release Notes

Note

For previous versions of the Release Notes, please refer to Previous Release Notes.

New Features and Enhancements - 1.18.0

The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.18.0-524. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.

General

  • Added Ubuntu 24.04 support for Gaudi 3.

  • Updating Ubuntu 22.04 Python version to 3.11 in version 1.19.0

  • Deprecating Amazon Linux 2 in version 1.19.0.

  • Added video (H.264 format) support to the MediaPipe interface.

  • Removed ssh keys from Intel Gaudi Dockers. To add ssh keys, see Installation Guide.

  • habanalabs-firmware-odm package is automatically installed using the habanalabs-installer.

  • Added support for the following features to the Intel Gaudi vLLM fork:

    • Attention with Linear Biases (ALiBi)

    • Quantization with Intel Neural Compressor (INC)

    • LoRA adapters

    • Long context (up to 32k)

    • Probs and LogProbs

    • Initial support for torch.compile

Firmware

Added the following APIs to the Habana Labs Management Library and Habana Labs Python Management Library. See HLML API Reference and PYHLML API Reference:

  • hlml_device_get_clock_limit_info

  • hlml_device_get_power_management_limit

  • hlml_device_get_power_management_mode

PyTorch

  • 1.18.0 release has been validated with Hugging Face Optimum for Intel Gaudi library and model version 1.13.1. Future releases of the Optimum for Intel Gaudi library may be validated with this release. Please check the Support Matrix for a full list of version support.

  • The Hugging Face Optimum for Intel Gaudi library now supports the updated TGI-gaudi version 2.0.4. See https://github.com/huggingface/tgi-gaudi.

  • Upgraded to PyTorch version 2.4.0. See PyTorch Support Matrix.

  • Quantizing PyTorch models to UINT4 using the Intel® Neural Compressor (INC) is now supported. See Run Inference Using UINT4.

  • Intel Gaudi PyTorch bridge is automatically loaded, thus import of habana_frameworks.torch.hpu is no longer mandatory. See PyTorch Autoloading.

  • Intel Gaudi Megatron-DeepSpeed will be deprecated and replaced with Megatron-LM in version 1.19.0.

  • Added the following training reference models for Gaudi 3. See Intel Gaudi Megatron-DeepSpeed:

    • LLaMA 3.1 8B FP8/BF16 on 8 cards

    • LLaMA 3.1 70B FP8/BF16 on 32/64 cards

  • Added the following training reference models for Gaudi 2. See Intel Gaudi Megatron-DeepSpeed:

    • LLaMA 3.1 8B FP8/BF16 on 8 cards

    • LLaMA 3.1 70B FP8/BF16 on 64 cards

  • Validated the Intel Gaudi 1.18.0 software release on PyTorch Lightning version 2.3.3. See https://lightning.ai/docs/pytorch/stable/integrations/hpu/.

  • Intel Gaudi offers a wide range of models using Eager mode and torch.compile. In subsequent releases, Lazy mode will be deprecated. Eager mode and torch.compile will be the default.

  • Upgraded DeepSpeed to version 0.14.4.

  • Added support for CustomOP API using Eager mode and torch.compile. See PyTorch CustomOp API.

  • Added support for custom operators and Fused optimizers in Eager mode. See Fused Optimizers and Custom Ops for Intel Gaudi.

  • Performance improvements for FusedSDPA when using triangular Softmax mask in training and inference. Performance gain can vary depending on query tensor sequence length. See Using Fused Scaled Dot Product Attention (FusedSDPA).

  • Added a Python hinted SDPA op which allows users to pass the hint to the Intel Gaudi software backend for refining optimizations.

  • Added torch.cond support in torch.compile. See PyTorch Support Matrix.

  • Added support for Mixture of Experts (MoE) custom ops. See Fused Optimizers and Custom Ops for Intel Gaudi.

Known Issues and Limitations - 1.18.0

General

  • EDP test is not functional using RHEL9.2/9.4 operating systems.

  • Intel Gaudi Media Loader is not supported in RHEL9.4 OS with Python 3.11.

Firmware

When using the hl-smi tool on Gaudi 3 while the following test plugins are running, the tool outputs 0% OAM utilization:

  • HBM_DMA_STRESS

  • HBM_TPC_STRESS

  • HBM_FULL_DATA_CHECK

  • E2E concurrency

  • SER

These low-level tests use a special mode that ensures low latency and fast execution. This mode leaves no trace on the utilization calculation.

PyTorch

  • To bypass a performance issue in Linux kernel version >= 5.9 (e.g. Ubuntu 22.04), the intel_idle driver must be disabled by adding intel_idle.max_cstate=0 to the kernel command line.

  • Support for torch.compile is in early stage. Models may not work (due to missing OPs implementation) or performance may be affected.

  • Support for Eager mode is in early stages. Models may not work (due to missing OPs implementation) or performance may be affected. The functionality of Eager mode as a subset of Lazy mode can be emulated by using PT_HPU_MAX_COMPOUND_OP_SIZE environment variable and limiting cluster sizes to 1. See Eager Mode.

  • Model checkpointing for ResNet50 in torch.compile mode is broken. This will be fixed in the next release.

  • Timing events where enable_timing=True may not provide accurate timing information.

  • Handling Dynamic shapes can be initiated by setting the PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES flag. Flag is disabled by default but enabled selectively for several models. For best performance, users should follow the guidance on how to work with Dynamic Shapes in the Handling Dynamic Shapes document.

  • Graphs displayed in TensorBoard have some minor limitations, eg. operator’s assigned device is displayed as “unknown device” when it is scheduled to HPU.

  • HPU tensor strides might not match that of CPU as tensor storage is managed differently. Reference to tensor storage (such as torch.as_strided) should take into account the input tensor strides explicitly. It is recommended to use other view functions instead of torch.as_strided. For further details, see Tensor Views and TORCH.AS_STRIDED.

  • Weights sharing:

    • Weights can be shared among two or more layers using PyTorch with Gaudi only if they are created inside the module. For more details, refer to Weight Sharing.

    • Weights are not shared with operators outside of the PyTorch library (i.e. PyBind11 functions).

  • User-defined attributes in HPU torch.nn.Parameter are not preserved after torch.nn.Parameter is assigned with a CPU tensor.

  • Python API habana_frameworks.torch.hpu.current_device() returns 0 regardless of the actual device being used.

  • For torch.nn.Parameter which is not created inside torch.nn.Module:

    • When two torch.nn.Parameter are on CPU storage and referencing the same parameter, the connection will be lost if one of them is moved to HPU.

    • Assigning a CPU tensor to HPU torch.nn.Parameter is not supported.

  • Saving metrics to a file configured using Runtime Environment Variables is not supported for workloads spawned via torch.multiprocessing.

  • Using torch.device(hpu:x) - (for example, as model.to) - where x is rank > 0 may lead to memory leaks. Instead, always use torch.device(hpu) to access the current rank.

  • Added the capability to serialize constant tensors, enabling recipe caching to disk for inference scenarios. However, due to a technical limitation, sharing recipes between cards on a single server is not possible. Recipes from each card are stored in separate directories, leading to increased usage of disk space.

  • Performing view-related operations on tensors with INT64 data type (torch.long) in Lazy mode can lead to incorrect results. If this data type is not required, the script should work with INT32 tensors (torch.int). By default, PyTorch creates integer tensors with torch.long data type, so make sure to explicitly create INT32 tensors. This limitation does not apply to Eager + torch.compile mode (PT_HPU_LAZY_MODE=0).

  • Launching ops with tensor inputs from mixed devices (hpu and cpu) is not supported in Eager + torch.compile mode (PT_HPU_LAZY_MODE=0). All tensors need to reside on hpu. Launching ops with tensor inputs from mixed devices is supported in Lazy mode in which internal transfers to hpu are performed.