Release Notes v1.13.0
On this Page
Release Notes v1.13.0¶
New Features and Enhancements - 1.13.0¶
The following documentation and packages correspond to the latest software release version from Habana: 1.17.1-40. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.
General¶
Ubuntu 20.04 will be deprecated and replaced with Ubuntu 22.04 starting from SynapseAI 1.14.0.
PyTorch¶
Improved LLaMA v2 7B FP8 inference performance on Gaudi2. See Habana Models Performance page.
Added the following new models for Gaudi2. See Model References GitHub repository:
LLaMA V2 70B on 256 cards for training
Bloom 176B with FP8 on 8 cards for inference
Added FP8 data type support for Gaudi2 training. See FP8 Training with Habana Transformer Engine.
Published the scripts to reproduce MLPerf 3.1 benchmark results on Gaudi2 with the latest SynapseAI release for the following. See Model References GitHub repository:
GPT3 FP8 training on 256/384 cards
Stable Diffusion training on 64 cards
PyTorch Eager mode and Eager mode with
torch.compile
are available for early preview. See PyTorch Gaudi Theory of Operations for more details. The following models are supported and can be found in our Model References GitHub repository:PyTorch ResNet50 for Gaudi2 on 1 and 8 cards: Eager mode and Eager mode with
torch.compile
PyTorch Lightning ResNet50 for Gaudi2 on 1 and 8 cards: Eager mode with
torch.compile
BERT Pretraining phase1 for Gaudi2 on 1 and 8 cards: Eager mode with
torch.compile
BERT Nvidia FineTuning for Gaudi2 on 1 and 8 cards: Eager mode with
torch.compile
DeepSpeed:
Upgraded Habana’s DeepSpeed fork to version 0.10.3.
Enabled Model Sequence Parallelism DeepSpeed configuration for training. See DeepSpeed User Guide for Training.
Upgraded to PyTorch v2.1.0.
Validated this release of SynapseAI on PyTorch Lightning v2.1.0. See https://lightning.ai/docs/pytorch/stable/integrations/hpu/advanced.html.
TensorFlow¶
Upgraded to TensorFlow v2.13.1.
Python 3.10 is the supported version for TensorFlow.
Support for MPI based Host NIC scale-out using
HOROVOD_HIERARCHICAL_ALLREDUCE
is deprecated and will be removed in 1.14.0 release. Use libfabric instead as further detailed in Distributed Training with TensorFlow.
Known Issues and Limitations - 1.13.0¶
PyTorch¶
Support for
torch.compile
is in early stage. Models may not work (due to missing OPs implementation) or performance may be affected.Support for Eager mode is in early stages. Models may not work (due to missing OPs implementation) or performance may be affected. The functionality of Eager mode as a subset of Lazy mode can be emulated by using
PT_HPU_MAX_COMPOUND_OP_SIZE
environment variable and limiting cluster sizes to 1. See Eager Mode.Model checkpointing for ResNet50 and BERT pretraining in
torch.compile
mode is broken. This will be fixed in the next release.Timing events where
enable_timing=True
may not provide accurate timing information.With Hugging Face Optimum-Habana 1.8.1, Falcon-40B and Stable-Diffusion v2.1 models for inference are non-functional; users should wait to run these with the next release of Optimum Habana. Additionally, the Wav2Vec2 Automatic Speech Recognition model is showing a reduction in model accuracy during training.
Handling Dynamic shapes can be initiated by setting the
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES`
flag. This is disabled by default. For best performance, users should follow the guidance on how to work with Dynamic Shapes in the Handling Dynamic Shapes document.Graphs displayed in TensorBoard have some minor limitations, eg. operator’s assigned device is displayed as “unknown device” when it is scheduled to HPU.
HPU tensor strides might not match that of CPU as tensor storage is managed differently. Reference to tensor storage (such as torch.as_strided) should take into account the input tensor strides explicitly. It is recommended to use other view functions instead of torch.as_strided. For further details, see Tensor Views and TORCH.AS_STRIDED.
Weights sharing:
Weights can be shared among two or more layers using PyTorch with Gaudi only if they are created inside the module. For more details, refer to Weight Sharing.
Weights are not shared with operators outside of the PyTorch library (i.e. PyBind11 functions).
User-defined attributes in HPU
torch.nn.Parameter
are not preserved aftertorch.nn.Parameter
is assigned with a CPU tensor.EFA installation on Habana’s containers includes OpenMPI 4.1.2 which does not recognize the CPU cores and threads properly in a KVM virtualized environment. To enable identifying CPU/Threads configuration, replace
mpirun
withmpirun --bind-to hwthread --map-by hwthread:PE=3
. This limitation is not applicable for AWS DL1 instances.Python API
habana_frameworks.torch.hpu.current_device()
returns 0 regardless of the actual device being used.For
torch.nn.Parameter
which is not created insidetorch.nn.Module
:When two
torch.nn.Parameter
are on CPU storage and referencing the same parameter, the connection will be lost if one of them is moved to HPU.Assigning a CPU tensor to HPU
torch.nn.Parameter
is not supported.
Training/Inference using HPU Graphs: HPU Graphs offer the best performance with minimal host overhead. However, their functionality is currently limited:
Only models that run completely on HPU have been tested. Models that contain CPU Ops are not supported. During HPU Graphs capturing, in case the Op is not supported, the following message will appear: “… is not supported during HPU Graph capturing”.
HPU Graphs can be used only to capture and replay static graphs. Dynamic shapes are not supported.
Data Dependent dynamic flow is not supported with HPU Graphs.
Capturing HPU Graphs on models containing in-place view updates is not supported.
Saving metrics to a file configured using Runtime Environment Variables is not supported for workloads spawned via
torch.multiprocessing
.Using torch.device(hpu:x) - (for example, as model.to) - where x is rank > 0 may lead to memory leaks. Instead, always use torch.device(hpu) to access the current rank.
TensorFlow¶
When using TF dataset cache feature where the dataset size is large, setting hugepage for host memory may be required. Refer to SSD_ResNet34 Model Reference for instructions on setting hugepage.
Users need to convert models to TensorFlow2 if they are currently based on TensorFlow V1. TF XLA compiler option is currently not supported.
Control flow ops such as tf.cond and tf.while_loop are currently not supported on Gaudi and will fall back on CPU for execution.
Eager mode feature in TensorFlow2 is not supported and must be disabled to run TensorFlow models on Gaudi. To disable Eager mode, see Creating a TensorFlow Example.
Distributed training with tf.distribute is enabled only with HPUStrategy. Other TensorFlow built-in distribution strategies such as MirroredStrategy, MultiWorkerMirroredStrategy, CentralStorageStrategy, ParameterServerStrategy are not supported.
EFA installation on Habana’s containers includes OpenMPI 4.1.2 which does not recognize the CPU cores and threads properly in a KVM virtualized environment. To enable identifying CPU/Threads configuration, replace
mpirun
withmpirun --bind-to hwthread --map-by hwthread:PE=3
. This limitation is not applicable for AWS DL1 instances.(Gaudi2) In rare cases, when a hardware accelerated media loader is used, a segmentation fault occurs when closing TensorFlow after training is completed. This may happen due to an error in the order of which the Python interpreter unloads the modules. This issue does not affect training results.