Release Notes v1.6.0
On this Page
Release Notes v1.6.0¶
New Features and Enhancements - 1.6.0¶
The following documentation and packages correspond to the latest software release version from Habana: 1.6.0-439. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.
General Features¶
Added support for RHEL8.6.
Incorporated updates to SynapseAI Profiling Configuration Tool. For more information, refer to Configuration.
Deprecated
synprof_configuration.json
file. Theprof_config.json
is the novel configuration file, generated byhl-prof-config
.Added support to enable and disable all the siblings simultaneously. For further information, refer to Multiple Siblings Selection Screen.
Enabled
--invoc arg
and--merged arg
trace file output options to replace--json arg
,--json-compressed arg
,--hltv arg
and--csv arg
. See CLI Configuration Tool - hl-prof-config <args>.
Removed support for scale-out using HCCL based Host-NIC over TCP. Only scale-out via Host-NIC over OFI is supported.
Habana EKS AMI does not support docker as runtime. Only containerD is supported.
Updated Orchestration versions:
Removed Kubernetes 1.19 support.
Upgraded OpenShift to 4.9.
TensorFlow¶
Upgraded Habana Horovod version to v0.24.3.
Improved ResNeXt101 Media Loading HW Acceleration for Gaudi2.
PyTorch¶
Upgraded PyTorch to v1.12.0.
Habana PyTorch container uses port 3022 for SSH. See Step 5 in Running Distributed Training over Multiple DL1 Instances.
Added support for ZerRO-2 and Activation Checkpointing DeepSpeed configurations. See DeepSpeed Validated Configurations.
Enabled COCO dataset for Habana Media Loader. See Using Media Loader with PyTorch.
Introduced preliminary inference capabilities on Gaudi. For further details, refer to Inference on Gaudi.
Added the following APIs. See Habana PyTorch Python API (habana_frameworks.torch).
Stream APIs that can be used to improve training and inference performance by allowing more parallelism in execution of HPU operations. See Stream APIs.
HPU Graphs APIs that can be used to capture and replay static graphs to improve inference performance through minimizing host operations. See HPU Graph APIs.
Added initial support for TensorBoard profiling. See Profiling with Pytorch.
Enabled the following for first-gen Gaudi. See Model References GitHub page.
YOLOX on 1 and 8 cards
Wav2vec 2.0 on 1, 8, 16 and 32 cards
DeepSpeed BERT-5B with LANS optimizer
V-Diffusion on 1 card for inference
BERT-1.2B Parameter with native PyTorch ZeroRedundancyOptimizer
Published the following pretrained models for first-gen Gaudi. See Model References GitHub page.
BERT-L on 32 cards
ResNet-50 on 8 cards
Enabled the following for Gaudi2. See Model References GitHub page.
SSD on 1 card
UNet2D on 1 and 8 cards
UNet3D on 1 and 8 cards
Transformer on 1 card
Habana Qualification Library¶
Serdes Loopback test is no longer supported. You can use Serdes Base test as it supports the same functionality.
Known Issues and Limitations - 1.6.0¶
TensorFlow¶
When using TF dataset cache feature where the dataset size is large, setting hugepage for host memory may be required. Refer to SSD_ResNet34 Model Reference for instructions on setting hugepage.
Users need to convert models to TensorFlow2 if they are currently based on TensorFlow V1. TF XLA compiler option is currently not supported.
Control flow ops such as tf.cond and tf.while_loop are currently not supported on Gaudi and will fall back on CPU for execution.
DenseNet: Sporadic issues with training on 8 Gaudis may occur.
Eager Mode feature in TensorFlow2 is not supported and must be disabled to run TensorFlow models on Gaudi. To disable Eager mode, see Creating a TensorFlow Example.
Distributed training with tf.distribute is enabled only with HPUStrategy. Other TensorFlow built-in distribution strategies such as MirroredStrategy, MultiWorkerMirroredStrategy, CentralStorageStrategy, ParameterServerStrategy are not supported.
EFA installation on Habana’s containers includes OpenMPI 4.1.2 which does not recognize the CPU cores and threads properly in a KVM virtualized environment. To enable identifying CPU/Threads configuration, replace
mpirun
withmpirun --bind-to hwthread --map-by hwthread:PE=3
. This limitation is not applicable for AWS DL1 instances.
PyTorch¶
Incorporated updates to PyTorch models for ease of use. If you have existing migrated models from previous SynapseAI releases (v1.4.1 and below), follow the below steps. See Porting a Simple PyTorch Model to Gaudi for more information.
You must remove
permute_params
andpermute_momentum
from your training models, as weight permutation in vision models is no longer required.You must add
mark_step()
right afterloss.backward()
andoptimizer.step()
.Habana recommends removing
load_habana_module
for both single card and distributed training since importingload_habana_module
is no longer required.
channels_last mode is not supported and will be enabled in a future release.
Cross_entropy loss function (
torch.nn.functional.cross_entropy
ortorch.nn.CrossEntropyLoss
) has an accuracy issue in BF16 that will be fixed in a future release. The FP32 version should be used.Support for Dynamic shapes is limited. This will be improved in future releases.
For Transformer models, time to train is high due to evaluation phase.
Graphs displayed in TensorBoard have some minor limitations, eg. operator’s assigned device is displayed as “unknown device” when it is scheduled to HPU.
EFA installation on Habana’s containers includes OpenMPI 4.1.2 which does not recognize the CPU cores and threads properly in a KVM virtualized environment. To enable identifying CPU/Threads configuration, replace
mpirun
withmpirun --bind-to hwthread --map-by hwthread:PE=3
. This limitation is not applicable for AWS DL1 instances.HPU tensor strides might not match that of CPU as tensor storage is managed differently. Reference to tensor storage (such as torch.as_strided) should take into account the input tensor strides explicitly. It is recommended to use other view functions instead of torch.as_strided. For further details, see Tensor Views and TORCH.AS_STRIDED.
When module weights are shared among two or more layers, using PyTorch with Gaudi requires these weights to be shared after moving the model to the HPU device. For more details, refer to Weight Sharing.
Python API
habana_frameworks.torch.hpu.current_device()
returns 0 regardless of the actual device being used.The default DeepSpeed configuration was changed from WARN to IGNORE tag validation to avoid accuracy issues when saving a checkpoint. This will be fixed in a future release.
HPU user streams using Stream APIs does not support Events. This will be enabled in a future release.
HPU Graph APIs are currently not supported for training and will be enabled in a future release.
Performance issues may occur with Wav2vec 2.0 on 32 cards. This will be fixed in a future release.
Eager mode is not supported in YOLOX.
For
torch.nn.Parameter
functionality:When two
torch.nn.Parameter
are on CPU storage and referencing the same parameter, the connection will be lost if one of them is moved to HPU.Assigning a CPU tensor to HPU
torch.nn.Parameter
is not supported.
Inference on Gaudi:
Specifying
torch.device('hpu')
as themap_location
intorch.jit.load
is currently not supported. This will be fixed in a future release.Inference using HPU Graphs has been validated only on single cards.
HPU Graphs offer the best performance with minimal host overhead. However, their functionality is currently limited. Only models that run completely on HPU have been tested. Models that run partially on CPU may not work.
HPU Graphs can be only used to capture and replay static graphs.
Multi-graph and multi-model support is currently experimental.
There is 15% degradation vs pervious release in Host NIC performance. This will be fixed in future releases.
Habana Communication Library¶
Single Process Multiple Device Support in HCCL: Since multiple processes are required for multi-node (cross chassis) scaling, it only supports one device per process mode so that users do not need to differentiate the inter-node and intra-node usage cases.