Release Notes v1.14.0¶

New Features and Enhancements - 1.14.0¶

The following documentation and packages correspond to the latest software release version from Intel Gaudi: 1.14.0-493. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.

Important Note: With the acquisition of Habana® Labs by Intel®, the Habana and Gaudi brands have changed to Intel and Intel Gaudi® accelerator or Intel Gaudi AI accelerator. To make this transition as easy as possible for developers and customers, many of the brand references in developer literature remain unchanged. Specifically, no code has changed. You will see continued reference to Habana and HPU in the code and code examples.

General¶

Removed Ubuntu 20.04 support.
Added Ubuntu 22.04 and TencentOS support to Gaudi 2.

PyTorch¶

Improved performance for the following Gaudi 2 models. See Models Performance page.
- LLaMA v2 70B Pre-training
- LLaMA v2 70B Fine-tuning
- LLaMA v2 70B FP8 for inference
- LLaMA v2 70B BF16 for inference
Added DeepSpeed-Chat LLaMA-7B model for Gaudi 2 on 8 cards. See DeepSpeedExamples GitHub repository.
Intel Gaudi now provides a Quantization Toolkit (HQT) which enables model measurement and quantization capabilities to quantize the model to FP8 where possible. See Inference Using FP8. HQT is enabled on LLaMA v2 70B and 7B. Additional models will be supported in a future release.
DeepSpeed:
- Upgraded Intel Gaudi’s DeepSpeed fork to version 0.12.4.
- Enabled ZeRO++ hpZ configuration for training. See DeepSpeed User Guide for Training.
- Moved DeepSpeed-Chat model to DeepSpeedExamples GitHub repository.
For profiling, you can now view .hltv files in https://perfetto.habana.ai.
For models using torch.compile, aot_hpu_training_backend is now deprecated and will be removed in a future release. Replace aot_hpu_training_backend with hpu_backend.
Upgraded to PyTorch v2.1.1.
Validated the Intel Gaudi software 1.14.0 release on PyTorch Lightning v2.1.2. See https://lightning.ai/docs/pytorch/stable/integrations/hpu/advanced.html.
All models using mixed precision should now use Autocast only.

TensorFlow¶

Upgraded to TensorFlow v2.15.0.
tensorflow-io is no longer installed on Intel Gaudi dockers.
Starting with v1.15.0, support for TensorFlow will no longer be available.

Known Issues and Limitations - 1.14.0¶

PyTorch¶

To bypass a performance issue in Linux kernel version >= 5.9 (e.g. Ubuntu 22.04), the intel_idle driver must be disabled by adding intel_idle.max_cstate=0 to the kernel command line.
Support for torch.compile is in early stage. Models may not work (due to missing OPs implementation) or performance may be affected.
Support for Eager mode is in early stages. Models may not work (due to missing OPs implementation) or performance may be affected. The functionality of Eager mode as a subset of Lazy mode can be emulated by using PT_HPU_MAX_COMPOUND_OP_SIZE environment variable and limiting cluster sizes to 1. See Eager Mode.
Model checkpointing for ResNet50 in torch.compile mode is broken. This will be fixed in the next release.
Timing events where enable_timing=True may not provide accurate timing information.
Handling Dynamic shapes can be initiated by setting the PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES flag. This is disabled by default. For best performance, users should follow the guidance on how to work with Dynamic Shapes in the Handling Dynamic Shapes document.
Graphs displayed in TensorBoard have some minor limitations, eg. operator’s assigned device is displayed as “unknown device” when it is scheduled to HPU.
HPU tensor strides might not match that of CPU as tensor storage is managed differently. Reference to tensor storage (such as torch.as_strided) should take into account the input tensor strides explicitly. It is recommended to use other view functions instead of torch.as_strided. For further details, see Tensor Views and TORCH.AS_STRIDED.
Weights sharing:
- Weights can be shared among two or more layers using PyTorch with Gaudi only if they are created inside the module. For more details, refer to Weight Sharing.
- Weights are not shared with operators outside of the PyTorch library (i.e. PyBind11 functions).
User-defined attributes in HPU torch.nn.Parameter are not preserved after torch.nn.Parameter is assigned with a CPU tensor.
Python API habana_frameworks.torch.hpu.current_device() returns 0 regardless of the actual device being used.
For torch.nn.Parameter which is not created inside torch.nn.Module:
- When two torch.nn.Parameter are on CPU storage and referencing the same parameter, the connection will be lost if one of them is moved to HPU.
- Assigning a CPU tensor to HPU torch.nn.Parameter is not supported.
Saving metrics to a file configured using Runtime Environment Variables is not supported for workloads spawned via torch.multiprocessing.
Using torch.device(hpu:x) - (for example, as model.to) - where x is rank > 0 may lead to memory leaks. Instead, always use torch.device(hpu) to access the current rank.

TensorFlow¶

When using TF dataset cache feature where the dataset size is large, setting hugepage for host memory may be required. Refer to SSD_ResNet34 Model Reference for instructions on setting hugepage.
Users need to convert models to TensorFlow2 if they are currently based on TensorFlow V1. TF XLA compiler option is currently not supported.
Control flow ops such as tf.cond and tf.while_loop are currently not supported on Gaudi and will fall back on CPU for execution.
Eager mode feature in TensorFlow2 is not supported and must be disabled to run TensorFlow models on Gaudi. To disable Eager mode, see Creating a TensorFlow Example.
Distributed training with tf.distribute is enabled only with HPUStrategy. Other TensorFlow built-in distribution strategies such as MirroredStrategy, MultiWorkerMirroredStrategy, CentralStorageStrategy, ParameterServerStrategy are not supported.
(Gaudi 2) In rare cases, when a hardware accelerated media loader is used, a segmentation fault occurs when closing TensorFlow after training is completed. This may happen due to an error in the order of which the Python interpreter unloads the modules. This issue does not affect training results.

Gaudi Documentation 1.21.1 documentation

Release Notes v1.14.0

On this Page

Release Notes v1.14.0¶

New Features and Enhancements - 1.14.0¶

General¶

PyTorch¶

TensorFlow¶

Known Issues and Limitations - 1.14.0¶

PyTorch¶

TensorFlow¶