Release Notes v1.8.0

New Features and Enhancements - 1.8.0

The following documentation and packages correspond to the latest software release version from Habana: 1.8.0-690. We recommend using the latest release where possible to stay aligned with performance improvements and updated model coverage. Please refer to the Installation Guide for further details.

General Features

  • Added Debian 10.10 support for Gaudi2 Bare Metal Fresh OS.

  • Added Ubuntu 22.04 support for first-gen Gaudi. Support for Ubuntu 18.04 will be deprecated and removed in the next release.

  • Increased the default number of hugepages on AWS DL1 to 21k.

  • For AWS DL1 Instance: added EFA peer direct support for scale out performance improvements.

  • Added an internal HCCL auto-detection mechanism selecting and configuring Gaudi NIC or external Host NIC for the best efficiency. See Scale-Out via Host-NIC.

  • For Gaudi2, the Media Pipe interface is now available allowing users to prepare batches of processed and augmented images and labels to be fed into training or inference. See Media Pipeline.

  • Added EKS 1.24 support.

PyTorch

  • Upgraded PyTorch to v1.13.1.

  • This release of SynapseAI was validated with PyTorch Lightning v1.8.6.

  • Added HPU Graph APIs for training. See HPU Graphs for Training.

  • Enabled Model Pipeline Parallelism, Model Tensor Parallelism, and BF16Optimizer DeepSpeed configurations for training. See DeepSpeed Validated Configurations.

  • Enabled multiple tenants on PyTorch, allowing users to run multiple independent workloads inside a docker or to run multiple docker images on a single card. See Enabling Multiple Tenants.

  • Added inference support for DeepSpeed on Gaudi. See Inference Using DeepSpeed.

  • Added new Checkpoints for PyTorch to the Habana vault: BERT, ResNet, ResNext, Unet2D.

  • The following Hugging Face models were removed from Model References GitHub page. Users are encouraged to run them directly from Hugging Face Optimum Habana:

    • ALBERT Large, ALBERT XXLarge, DistilBERT, RoBERTa, RoBERTa Large, ELECTRA, Swin-Transformer

  • Enabled the following for Gaudi2. See Model References GitHub page:

    • Wav2Vec on 8 cards for training

    • BLOOM 13B (based on Megatron-DeepSpeed) on 64 cards for training

    • BLOOM 176B (DeepSpeed) on 8 cards for inference

    • ResNet-50 on 1 card for inference

    • ResNext101 on 1 card for inference

    • UNet2D on 1 card for inference

  • Enabled the following for first-gen Gaudi. See Model References GitHub page:

    • Stable Diffusion on 8 cards for training

    • Stable Diffusion 2.1 on 1 card for inference

  • Enabled the following for first-gen Gaudi and Gaudi2. See Model References GitHub page.

    • Stable Diffusion 1.5 on 1 card for inference

    • Wav2Vec on 1 card for inference

TensorFlow

  • Enabled support for TensorFlow version 2.11.0.

  • Dropped support for version 2.10.1. Long term support remains on version 2.8.4.

  • In the next release, version 2.12.0 will be the only version supported and will be used as Long Term Support.

  • Removed T5-base Hugging Face model from Model References GitHub page. Users are encouraged to run this model directly from Hugging Face Optimum Habana:

Known Issues and Limitations - 1.8.0

PyTorch

  • PyTorch Lightning currently uses framework default dataloader only.

  • Support for Dynamic shapes is limited. Included guidance on how to work with dynamic shapes in the Model Performance Optimization Guide for PyTorch.

  • For Transformer models, time to train is high due to evaluation phase.

  • Graphs displayed in TensorBoard have some minor limitations, eg. operator’s assigned device is displayed as “unknown device” when it is scheduled to HPU.

  • HPU tensor strides might not match that of CPU as tensor storage is managed differently. Reference to tensor storage (such as torch.as_strided) should take into account the input tensor strides explicitly. It is recommended to use other view functions instead of torch.as_strided. For further details, see Tensor Views and TORCH.AS_STRIDED.

  • Weights can be shared among two or more layers using PyTorch with Gaudi only if they are created inside the module. For more details, refer to Weights Sharing.

  • EFA installation on Habana’s containers includes OpenMPI 4.1.2 which does not recognize the CPU cores and threads properly in a KVM virtualized environment. To enable identifying CPU/Threads configuration, replace mpirun with mpirun --bind-to hwthread --map-by hwthread:PE=3. This limitation is not applicable for AWS DL1 instances.

  • Python API habana_frameworks.torch.hpu.current_device() returns 0 regardless of the actual device being used.

  • For torch.nn.Parameter which is not created inside torch.nn.Module:

    • When two torch.nn.Parameter are on CPU storage and referencing the same parameter, the connection will be lost if one of them is moved to HPU.

    • Assigning a CPU tensor to HPU torch.nn.Parameter is not supported.

  • Inference on Gaudi:

    • HPU Graphs offer the best performance with minimal host overhead. However, their functionality is currently limited:

      • Only models that run completely on HPU have been tested. Models that contain CPU Ops are not supported. During HPU Graphs capturing, in case the Op is not supported, the following message will appear: “… is not supported during HPU Graph capturing”.

      • HPU Graphs can be used only to capture and replay static graphs. Dynamic shapes are not supported.

      • Data Dependent dynamic flow is not supported with HPU Graphs.

      • Capturing HPU Graphs on models containing in-place view updates is not supported.

Habana Communication Library

  • hcclMin and hcclMax Ops are not supported on Gaudi2.

  • Single Process Multiple Device Support in HCCL: Since multiple processes are required for multi-node (cross chassis) scaling, it only supports one device per process mode so that users do not need to differentiate the inter-node and intra-node usage cases.

Qualification Tool Library

  • Before running the following plugin tests, make sure to set the export  __python_cmd=python3 environment variable:

    • ResNet-50 Training Stress Test

    • Memory Bandwidth Test

    • PCI Bandwidth Test

TensorFlow

  • When using TF dataset cache feature where the dataset size is large, setting hugepage for host memory may be required. Refer to SSD_ResNet34 Model Reference for instructions on setting hugepage.

  • Users need to convert models to TensorFlow2 if they are currently based on TensorFlow V1. TF XLA compiler option is currently not supported.

  • Control flow ops such as tf.cond and tf.while_loop are currently not supported on Gaudi and will fall back on CPU for execution.

  • Eager Mode feature in TensorFlow2 is not supported and must be disabled to run TensorFlow models on Gaudi. To disable Eager mode, see Creating a TensorFlow Example.

  • Distributed training with tf.distribute is enabled only with HPUStrategy. Other TensorFlow built-in distribution strategies such as MirroredStrategy, MultiWorkerMirroredStrategy, CentralStorageStrategy, ParameterServerStrategy are not supported.

  • EFA installation on Habana’s containers includes OpenMPI 4.1.2 which does not recognize the CPU cores and threads properly in a KVM virtualized environment. To enable identifying CPU/Threads configuration, replace mpirun with mpirun --bind-to hwthread --map-by hwthread:PE=3. This limitation is not applicable for AWS DL1 instances.