Release Notes v1.17
On this Page
Release Notes v1.17¶
New Features and Enhancements - 1.17.1¶
The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.17.1-40. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.
This release includes specific bug fixes for hl-qual and hl-smi related to Gaudi 3 functionality, as well as updated firmware for Gaudi 3.
Added SUSE 15.5 support for Gaudi 3.
New Features and Enhancements - 1.17.0¶
The following documentation and packages correspond to the latest software release version from Intel® Gaudi®: 1.17.0-495. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.
General¶
Added the Intel Gaudi Base Operator for Kubernetes, allowing users to automate the management of all Intel Gaudi software components in Kubernetes. See Intel Gaudi Base Operator for Kubernetes.
Added the Offline Trace Parser tool. The tool enables parsing previously captured trace while the workload is offline. See Offline Trace Parser Tool.
Intel Gaudi now provides the HCL source code.
Added software backward/forward compatibility support for Gaudi 2, allowing users to run 1.16 dockers with Intel Gaudi software version 1.17.0.
Added RHEL 9.4 support for Gaudi 2 and Gaudi 3.
Added OpenShift 4.16 support on RHEL 9.4 for Gaudi 2 and Gaudi 3.
Removed support for Debian 10.10.
Renamed HabanaAI Operator for OpenShift to Intel Gaudi Base Operator for OpenShift.
PyTorch¶
The Hugging Face Optimum for Intel Gaudi library now supports the updated TGI-gaudi version 2.0.1. See https://github.com/huggingface/tgi-gaudi.
Upgraded to PyTorch version 2.3.1. See PyTorch Support Matrix.
Validated the Intel Gaudi 1.17.0 software release on PyTorch Lightning version 2.3.3. See https://lightning.ai/docs/pytorch/stable/integrations/hpu/.
1.17.0 release has been validated with Hugging Face Optimum for Intel Gaudi library and model version 1.12.1. Future releases of the Optimum for Intel Gaudi library may be validated with this release. Please check the Support Matrix for a full list of version support.
Intel Gaudi offers a wide range of models using Eager mode and
torch.compile
. In subsequent releases, Lazy mode will be deprecated. Eager mode andtorch.compile
will be the default.Added the following inference reference models for Gaudi 3. See Hugging Face Optimum for Intel Gaudi:
LLaMA 2 7B on 1 card
LLaMA 2 70B up to 2 cards
LLaMA 3 8B on 1 card
Added LLaMA 3 8B FP8 inference model for Gaudi 2. See Hugging Face Optimum for Intel Gaudi.
Inference on FP8 is now achieved using the Intel® Neural Compressor (INC). INC replaces Intel Gaudi’s Quantization Toolkit (HQT). References to HQT in the code will be removed in future releases. See Run Inference Using FP8.
Added support for inference on UINT4 data type. Running inference in UINT4 halves the required memory bandwidth compared to running inference in FP8. See Run Inference Using UINT4.
The GPU Migration Toolkit can now be enabled using
PT_HPU_GPU_MIGRATION=1
environment variable. Usinghabana_frameworks.torch.gpu_migration
package to enable GPU Migration will be deprecated in a future release. See GPU Migration Toolkit.Added support for PyTorch Distributed Tensor (DTensor) and Tensor Parallel. DTensor allows sharding tensors across multiple devices and performs operations on those tensors in a distributed manner. See Using DistributedTensor with Intel Gaudi.
Added support for deploying PyTorch models using TorchServe. See TorchServe Inference Server with Gaudi.
Known Issues and Limitations - 1.17.0¶
General¶
To ensure proper registration of Gaudi external ports by the IBVerbs driver, after loading the drivers, bring down the ports by running
./manage_network_ifs.sh --down
command, and then bring up the ports by running./manage_network_ifs.sh --up
command.EDP test is not functional using RHEL9.2/9.4 operating systems.
Intel Gaudi Media Loader is not supported in RHEL9.4 OS with Python 3.11.
PyTorch¶
To bypass a performance issue in Linux kernel version >= 5.9 (e.g. Ubuntu 22.04), the
intel_idle
driver must be disabled by addingintel_idle.max_cstate=0
to the kernel command line.Support for
torch.compile
is in early stage. Models may not work (due to missing OPs implementation) or performance may be affected.Support for Eager mode is in early stages. Models may not work (due to missing OPs implementation) or performance may be affected. The functionality of Eager mode as a subset of Lazy mode can be emulated by using
PT_HPU_MAX_COMPOUND_OP_SIZE
environment variable and limiting cluster sizes to 1. See Eager Mode.Model checkpointing for ResNet50 in
torch.compile
mode is broken. This will be fixed in the next release.Timing events where
enable_timing=True
may not provide accurate timing information.Handling Dynamic shapes can be initiated by setting the
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES
flag. Flag is disabled by default but enabled selectively for several models. For best performance, users should follow the guidance on how to work with Dynamic Shapes in the Handling Dynamic Shapes document.Graphs displayed in TensorBoard have some minor limitations, eg. operator’s assigned device is displayed as “unknown device” when it is scheduled to HPU.
HPU tensor strides might not match that of CPU as tensor storage is managed differently. Reference to tensor storage (such as torch.as_strided) should take into account the input tensor strides explicitly. It is recommended to use other view functions instead of torch.as_strided. For further details, see Tensor Views and TORCH.AS_STRIDED.
Weights sharing:
Weights can be shared among two or more layers using PyTorch with Gaudi only if they are created inside the module. For more details, refer to Weight Sharing.
Weights are not shared with operators outside of the PyTorch library (i.e. PyBind11 functions).
User-defined attributes in HPU
torch.nn.Parameter
are not preserved aftertorch.nn.Parameter
is assigned with a CPU tensor.Python API
habana_frameworks.torch.hpu.current_device()
returns 0 regardless of the actual device being used.For
torch.nn.Parameter
which is not created insidetorch.nn.Module
:When two
torch.nn.Parameter
are on CPU storage and referencing the same parameter, the connection will be lost if one of them is moved to HPU.Assigning a CPU tensor to HPU
torch.nn.Parameter
is not supported.
Saving metrics to a file configured using Runtime Environment Variables is not supported for workloads spawned via
torch.multiprocessing
.Using torch.device(hpu:x) - (for example, as model.to) - where x is rank > 0 may lead to memory leaks. Instead, always use torch.device(hpu) to access the current rank.
Added the capability to serialize constant tensors, enabling recipe caching to disk for inference scenarios. However, due to a technical limitation, sharing recipes between cards on a single server is not possible. Recipes from each card are stored in separate directories, leading to increased usage of disk space.
Performing view-related operations on tensors with INT64 data type (
torch.long
) in Lazy mode can lead to incorrect results. If this data type is not required, the script should work with INT32 tensors (torch.int
). By default, PyTorch creates integer tensors withtorch.long
data type, so make sure to explicitly create INT32 tensors. This limitation does not apply to Eager +torch.compile
mode (PT_HPU_LAZY_MODE=0
).Launching ops with tensor inputs from mixed devices (
hpu
andcpu
) is not supported in Eager +torch.compile
mode (PT_HPU_LAZY_MODE=0
). All tensors need to reside onhpu
. Launching ops with tensor inputs from mixed devices is supported in Lazy mode in which internal transfers tohpu
are performed.