Release Notes v1.10.0

New Features and Enhancements - 1.10.0

The following documentation and packages correspond to the latest software release version from Habana: 1.10.0-494. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.

General Features

  • Added support for EKS 1.25. Dropped support for EKS 1.22.

  • Upgraded OpenShift version to 4.12.

  • Added habanalabs_ib driver support for Gaudi2 on Ubuntu 20.04, RHEL 8.6 and Debian 10.10. The driver allows users to utilize NICs via IBVerbs. In addition, Habana provides a habanalabs-rdma-core package which installs Habana’s libhlib along with libibverbs. See Installation Guide.

PyTorch

  • Upgraded to PyTorch v2.0.1:

    • The default operating mode is Lazy mode (as in previous releases) with an option to use HPU Graphs.

    • torch.compile will be enabled in a future release.

  • Enabled ZeRO-Infinity support with ZeRO-3 for DeepSpeed training. See DeepSpeed Validated Configurations.

  • DeepSpeed activation checkpointing validated with cpu_checkpointing and synchronize_checkpoint_boundary flags. See DeepSpeed Validated Configurations.

  • This release of SynapseAI was validated with PyTorch Lightning v2.0.0.

  • Improved logging mechanism by providing more granular control over the logging level of each module separately. See Runtime Environment Variables.

  • The majority of Habana’s reference models are now migrated to autocast. Support for Habana Mixed Precision (HMP) will be dropped in a subsequent release.

  • Habana now provides a detect_recompilation tool to automatically detect dynamic inputs and dynamic ops in a model. See Handling Dynamic Shapes.

  • Added Gaudi specific debug information in TensorBoard. See Profiling with PyTorch.

  • Moved Transformers training model from Model References to Habana Fairseq GitHub repository.

TensorFlow

Known Issues and Limitations - 1.10.0

PyTorch

  • Support for Dynamic shapes is limited. Included guidance on how to work with dynamic shapes in the Handling Dynamic Shapes.

  • Graphs displayed in TensorBoard have some minor limitations, eg. operator’s assigned device is displayed as “unknown device” when it is scheduled to HPU.

  • HPU tensor strides might not match that of CPU as tensor storage is managed differently. Reference to tensor storage (such as torch.as_strided) should take into account the input tensor strides explicitly. It is recommended to use other view functions instead of torch.as_strided. For further details, see Tensor Views and TORCH.AS_STRIDED.

  • Weights sharing:

    • Weights can be shared among two or more layers using PyTorch with Gaudi only if they are created inside the module. For more details, refer to Weight Sharing.

    • Weights are not shared with operators outside of the PyTorch library (i.e. PyBind11 functions).

  • User-defined attributes in HPU torch.nn.Parameter are not preserved after torch.nn.Parameter is assigned with a CPU tensor.

  • EFA installation on Habana’s containers includes OpenMPI 4.1.2 which does not recognize the CPU cores and threads properly in a KVM virtualized environment. To enable identifying CPU/Threads configuration, replace mpirun with mpirun --bind-to hwthread --map-by hwthread:PE=3. This limitation is not applicable for AWS DL1 instances.

  • Python API habana_frameworks.torch.hpu.current_device() returns 0 regardless of the actual device being used.

  • For torch.nn.Parameter which is not created inside torch.nn.Module:

    • When two torch.nn.Parameter are on CPU storage and referencing the same parameter, the connection will be lost if one of them is moved to HPU.

    • Assigning a CPU tensor to HPU torch.nn.Parameter is not supported.

  • Training/Inference using HPU Graphs: HPU Graphs offer the best performance with minimal host overhead. However, their functionality is currently limited:

    • Only models that run completely on HPU have been tested. Models that contain CPU Ops are not supported. During HPU Graphs capturing, in case the Op is not supported, the following message will appear: “… is not supported during HPU Graph capturing”.

    • HPU Graphs can be used only to capture and replay static graphs. Dynamic shapes are not supported.

    • Data Dependent dynamic flow is not supported with HPU Graphs

    • Capturing HPU Graphs on models containing in-place view updates is not supported.

  • Saving metrics to a file configured using Runtime Environment Variables is not supported for workloads spawned via torch.multiprocessing.

  • Using torch.compile is not supported. An error message will appear in the error logs.

Habana Communication Library

  • Single Process Multiple Device Support in HCCL: Since multiple processes are required for multi-node (cross chassis) scaling, it only supports one device per process mode so that users do not need to differentiate the inter-node and intra-node usage cases.

  • Gaudi2 only: When using hcclAll2All command with more than one instance (scale-out), unexpected behavior or hang may be observed.

Qualification Tool Library

Before running the following plugin tests, make sure to set the export  __python_cmd=python3 environment variable:

  • ResNet-50 Training Stress Test

  • Memory Bandwidth Test

  • PCI Bandwidth Test

  • Gaudi2 only: When running BER test, use the switch setting -waitTime200.

  • First-gen Gaudi only: The Functional Test 2 plugin fails, when -serdes switch is used.

TensorFlow

  • When using TF dataset cache feature where the dataset size is large, setting hugepage for host memory may be required. Refer to SSD_ResNet34 Model Reference for instructions on setting hugepage.

  • Users need to convert models to TensorFlow2 if they are currently based on TensorFlow V1. TF XLA compiler option is currently not supported.

  • Control flow ops such as tf.cond and tf.while_loop are currently not supported on Gaudi and will fall back on CPU for execution.

  • Eager Mode feature in TensorFlow2 is not supported and must be disabled to run TensorFlow models on Gaudi. To disable Eager mode, see Creating a TensorFlow Example.

  • Distributed training with tf.distribute is enabled only with HPUStrategy. Other TensorFlow built-in distribution strategies such as MirroredStrategy, MultiWorkerMirroredStrategy, CentralStorageStrategy, ParameterServerStrategy are not supported.

  • EFA installation on Habana’s containers includes OpenMPI 4.1.2 which does not recognize the CPU cores and threads properly in a KVM virtualized environment. To enable identifying CPU/Threads configuration, replace mpirun with mpirun --bind-to hwthread --map-by hwthread:PE=3. This limitation is not applicable for AWS DL1 instances.