Release Notes v1.7.0
On this Page
Release Notes v1.7.0¶
New Features and Enhancements - 1.7.0¶
The following documentation and packages correspond to the latest software release version from Habana: 1.7.0-665. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.
General Features¶
Upgraded Libfabric version to 1.16.1.
Added EKS 1.23 support.
Removed support for EKS 1.20.
Upgraded OpenShift version to 4.11.
Habana now provides a utility which automatically installs all packages for both the SynapseAI SW stack and TensorFlow/PyTorch. See Installation Guide.
TensorFlow¶
Added support for TensorFlow 2.10.0.
Upgraded supported TensorFlow from 2.8.2 to 2.8.3.
Removed support for TensorFlow 2.9.1
Upgraded Habana Horovod version to v0.25.0.
Fixed stability issues in DenseNet.
Enabled the following for Gaudi2. See Model References GitHub page.
SSD on 1 and 8 cards
UNet2D on 1 and 8 cards
UNet3D on 1 and 8 cards
Transformer on 1 and 8 cards
PyTorch¶
This release of SynapseAI was validated with PyTorch Lightning v1.7.7.
Enabled Medical Segmentation Decathlon (BraTS) for Habana Media Loader. See Using Media Loader with PyTorch.
Added a simple DeepSpeed model example to illustrate basic steps for migration and model execution. See Getting Started with DeepSpeed.
Added support for user Events. See Stream APIs and Event APIs.
Deprecated configuration options -o1 and -o2. Default behavior will be same as “-o1” mode in previous releases, and includes expanded list of BF16 and FP32 Ops. The option to provide custom Op lists is still available. See PyTorch Mixed Precision Training on Gaudi.
Enabled ResNet50 with LARS optimizer on 8 cards for first-gen Gaudi. See Model References GitHub page.
Enabled the following for Gaudi2. See Model References GitHub page.
SSD on 8 cards
Transformer on 8 cards
Enabled the following for first-gen Gaudi and Gaudi2. See Model References GitHub page.
BLOOM 7B on 1 card for inference
Stable diffusion on 1 card for inference
Known Issues and Limitations - 1.7.0¶
TensorFlow¶
When using TF dataset cache feature where the dataset size is large, setting hugepage for host memory may be required. Refer to SSD_ResNet34 Model Reference for instructions on setting hugepage.
Users need to convert models to TensorFlow2 if they are currently based on TensorFlow V1. TF XLA compiler option is currently not supported.
Control flow ops such as tf.cond and tf.while_loop are currently not supported on Gaudi and will fall back on CPU for execution.
Eager Mode feature in TensorFlow2 is not supported and must be disabled to run TensorFlow models on Gaudi. To disable Eager mode, see Creating a TensorFlow Example.
Distributed training with tf.distribute is enabled only with HPUStrategy. Other TensorFlow built-in distribution strategies such as MirroredStrategy, MultiWorkerMirroredStrategy, CentralStorageStrategy, ParameterServerStrategy are not supported.
EFA installation on Habana’s containers includes OpenMPI 4.1.2 which does not recognize the CPU cores and threads properly in a KVM virtualized environment. To enable identifying CPU/Threads configuration, replace
mpirun
withmpirun --bind-to hwthread --map-by hwthread:PE=3
. This limitation is not applicable for AWS DL1 instances.
PyTorch¶
PyTorch Lightning currently uses framework default dataloader only.
A known issue with Zero2 implementation results in a decrease in per card performance under a fixed global size when the number of gradient-accumulation-steps is increased. This will be fixed in a subsequent release.
Support for Dynamic shapes is limited. Included guidance on how to work with dynamic shapes in the Model Performance Optimization Guide for PyTorch.
For Transformer models, time to train is high due to evaluation phase.
Graphs displayed in TensorBoard have some minor limitations, eg. operator’s assigned device is displayed as “unknown device” when it is scheduled to HPU.
HPU tensor strides might not match that of CPU as tensor storage is managed differently. Reference to tensor storage (such as torch.as_strided) should take into account the input tensor strides explicitly. It is recommended to use other view functions instead of torch.as_strided. For further details, see Tensor Views and TORCH.AS_STRIDED.
When module weights are shared among two or more layers, using PyTorch with Gaudi requires these weights to be shared after moving the model to the HPU device. For more details, refer to Weight Sharing.
EFA installation on Habana’s containers includes OpenMPI 4.1.2 which does not recognize the CPU cores and threads properly in a KVM virtualized environment. To enable identifying CPU/Threads configuration, replace
mpirun
withmpirun --bind-to hwthread --map-by hwthread:PE=3
. This limitation is not applicable for AWS DL1 instances.Python API
habana_frameworks.torch.hpu.current_device()
returns 0 regardless of the actual device being used.The default DeepSpeed configuration was changed from WARN to IGNORE tag validation to avoid accuracy issues when saving a checkpoint. This will be fixed in a future release.
HPU Graph APIs are currently not supported for training and will be enabled in a future release.
Performance issues may occur with Wav2vec 2.0 on 32 cards. This will be fixed in a future release.
For
torch.nn.Parameter
functionality:When two
torch.nn.Parameter
are on CPU storage and referencing the same parameter, the connection will be lost if one of them is moved to HPU.Assigning a CPU tensor to HPU
torch.nn.Parameter
is not supported.
Inference on Gaudi:
Specifying
torch.device('hpu')
as themap_location
intorch.jit.load
is currently not supported. This will be fixed in a future release.Inference using HPU Graphs has been validated only on single cards.
HPU Graphs offer the best performance with minimal host overhead. However, their functionality is currently limited:
Only models that run completely on HPU have been tested. Models that contain CPU Ops are not supported. The log will indicate when this is the case.
HPU Graphs can be used only to capture and replay static graphs. Dynamic shapes are not supported.
DeepSpeed BERT-5B possible micro-BS is limited due to higher than expected memory consumption. This will be improved in future releases.
When using CTRL+C to stop the training, some of the devices may not be released. Therefore, starting a new training session will fail on device acquire. To release the devices:
Run
echo 1 | sudo tee /sys/devices/virtual/habanalabs/hl?/hard_reset
If hard reset does not work, set
PT_HPU_ERROR_HANDLER=False
and then runecho 1 | sudo tee /sys/devices/virtual/habanalabs/hl?/hard_reset
This will be fixed in a future release.
Habana Communication Library¶
Single Process Multiple Device Support in HCCL: Since multiple processes are required for multi-node (cross chassis) scaling, it only supports one device per process mode so that users do not need to differentiate the inter-node and intra-node usage cases.