Release Notes

Note

For previous versions of the Release Notes, please refer to Previous Release Notes.

New Features and Enhancements - 1.16.2

The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.16.2-2. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.

This release includes bug fixes for scaling large workloads.

New Features and Enhancements - 1.16.1

The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.16.1-7. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.

This release includes bug fixes, mainly focusing on improvements to stability for long runs of large language model pre-training.

New Features and Enhancements - 1.16.0

The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.16.0-526. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.

This release supports the Intel Gaudi 3 AI accelerator for data center OEM and ODM manufacturing and testing. Public access to Intel Gaudi 3 accelerator will be available soon.

General

  • Added vLLM support for Gaudi 2. Performance optimizations will be added in subsequent releases. See the Intel Gaudi vLLM fork.

  • Added Ray.io support for Gaudi 2 and Gaudi 3.

  • Added a new tutorial on how to use and optimize TGI-gaudi with Hugging Face based models.

  • Added support for Slurm Workload Manager using Gaudi. See Using Slurm Workload Manager with Intel Gaudi.

  • Intel Gaudi now provides a Remote Trace Viewer tool allowing users to analyze multiple remote trace profiles. See Remote Trace Viewer Tool.

  • Added RHEL 8.6 support for Gaudi 2.

  • Added support for Float16 data type to the Habana Collective Communications Library (HCCL).

  • Debian 10.10 will be deprecated in the next release.

PyTorch

  • Published the scripts to reproduce MLPerf 4.0 results with LLaMA 70B and Stable Diffusion XL inference benchmarks on Gaudi 2 with the latest Intel Gaudi software. See Model References GitHub repository.

  • The Hugging Face Optimum-Habana library now supports the updated TGI-gaudi version 2.0.0. See https://github.com/huggingface/tgi-gaudi.

  • Upgraded to PyTorch version 2.2.2. See PyTorch Support Matrix.

  • Upgraded to DeepSpeed version 0.14.0.

  • Validated the Intel Gaudi 1.16.0 software release on PyTorch Lightning version 2.2.4. See https://lightning.ai/docs/pytorch/stable/integrations/hpu/.

  • 1.16.0 release has been validated with Hugging Face Optimum-Habana library and model version 1.11.1. Future releases of the Optimum-Habana library may be validated with this release. Please check the Support Matrix for a full list of version support.

  • Intel Gaudi offers a wide range of models using Eager mode and torch.compile. In subsequent releases, Lazy mode will be deprecated. Eager mode and torch.compile will be the default.

  • Added the following Megatron-DeepSpeed models:

    • LLaMA 2 70B FP8

    • Mixtral 4x7B

    • Mixtral 8x7B

  • The Quantization Toolkit (HQT) now supports torch.nn.functional.scaled_dot_product_attention(). See Run Inference Using FP8.

  • For models using torch.compile, aot_hpu_training_backend and aot_hpu_inference_backend are no longer available. Make sure to use hpu_backend instead.

Known Issues and Limitations - 1.16.0

General

To ensure proper registration of Gaudi external ports by the IBVerbs driver, after loading the drivers, bring down the ports by running ./manage_network_ifs.sh --down command, and then bring up the ports by running ./manage_network_ifs.sh --up command.

PyTorch

  • To bypass a performance issue in Linux kernel version >= 5.9 (e.g. Ubuntu 22.04), the intel_idle driver must be disabled by adding intel_idle.max_cstate=0 to the kernel command line.

  • Support for torch.compile is in early stage. Models may not work (due to missing OPs implementation) or performance may be affected.

  • Support for Eager mode is in early stages. Models may not work (due to missing OPs implementation) or performance may be affected. The functionality of Eager mode as a subset of Lazy mode can be emulated by using PT_HPU_MAX_COMPOUND_OP_SIZE environment variable and limiting cluster sizes to 1. See Eager Mode.

  • Model checkpointing for ResNet50 in torch.compile mode is broken. This will be fixed in the next release.

  • Timing events where enable_timing=True may not provide accurate timing information.

  • Handling Dynamic shapes can be initiated by setting the PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES flag. Flag is disabled by default but enabled selectively for several models. For best performance, users should follow the guidance on how to work with Dynamic Shapes in the Handling Dynamic Shapes document.

  • Graphs displayed in TensorBoard have some minor limitations, eg. operator’s assigned device is displayed as “unknown device” when it is scheduled to HPU.

  • HPU tensor strides might not match that of CPU as tensor storage is managed differently. Reference to tensor storage (such as torch.as_strided) should take into account the input tensor strides explicitly. It is recommended to use other view functions instead of torch.as_strided. For further details, see Tensor Views and TORCH.AS_STRIDED.

  • Weights sharing:

    • Weights can be shared among two or more layers using PyTorch with Gaudi only if they are created inside the module. For more details, refer to Weight Sharing.

    • Weights are not shared with operators outside of the PyTorch library (i.e. PyBind11 functions).

  • User-defined attributes in HPU torch.nn.Parameter are not preserved after torch.nn.Parameter is assigned with a CPU tensor.

  • Python API habana_frameworks.torch.hpu.current_device() returns 0 regardless of the actual device being used.

  • For torch.nn.Parameter which is not created inside torch.nn.Module:

    • When two torch.nn.Parameter are on CPU storage and referencing the same parameter, the connection will be lost if one of them is moved to HPU.

    • Assigning a CPU tensor to HPU torch.nn.Parameter is not supported.

  • Saving metrics to a file configured using Runtime Environment Variables is not supported for workloads spawned via torch.multiprocessing.

  • Using torch.device(hpu:x) - (for example, as model.to) - where x is rank > 0 may lead to memory leaks. Instead, always use torch.device(hpu) to access the current rank.

  • Added the capability to serialize constant tensors, enabling recipe caching to disk for inference scenarios. However, due to a technical limitation, sharing recipes between cards on a single server is not possible. Recipes from each card are stored in separate directories, leading to increased usage of disk space.