2. Release Notes¶
2.1. New Features and Enhancements¶
2.1.1. General Features¶
Python3.7 and Python3.8 are supported for both TensorFlow and PyTorch. Python3.6 is no longer supported.
Added additional Trace Analyzer capabilities in Habana Labs Trace Viewer. See Profiler User Guide for further details.
Added support for TensorFlow 2.4.1 and 2.5.0. TensorFlow 2.2.2 is no longer supported. In general, we plan to upgrade support to the latest two minor versions of the framework with each release.
Added graph visualization support in TensorBoard. See the Debugging Guide for usage, and review Known Issues and Limitations.
Added a new feature that attempts to delegate computations to CPU in case of failures during runtime. See the Delegating Computations to CPU section for more information.
[Beta] Added support for multi-worker distributed training using tf.distribute with HPUStrategy class. See Distributed Training with TensorFlow for details.
Enabled the option to provide user-specified configuration files for mixed precision training. See TensorFlow Mixed Precision Training on Gaudi section for details.
Currently in beta.
Added support for PyTorch v1.7.1. PyTorch v1.5 is no longer supported.
Only Eager mode and Lazy evaluation mode will be supported going forward. TorchScript graph mode will be deprecated in the next release.
Enabled lazy mode support for ResNet50, BERT and DLRM reference models. Enabled support for ResNext101 topology in the reference models.
Enabled support for mixed precision training with Habana Mixed Precision (HMP) package. See PyTorch Mixed Precision Training on Gaudi for details.
Added support for Kubernetes with Gaudi device plugin, MPI Operator and Helm chart for ease of deployment. See Kubernetes User Guide for more information.
2.2. Known Issues and Limitations¶
Users need to convert models to TensorFlow2 if they are currently based on TensorFlow V1. TF XLA compiler option is currently not supported.
Control flow ops such as tf.cond and tf.while_loop are currently not supported on Gaudi and will fall back on CPU for execution.
Eager Mode feature in TensorFlow2 is not supported and must be disabled to run TensorFlow models on Gaudi.
Dynamic shapes support is available with limited performance. It will be addressed in future releases.
Distributed Training with TensorFlow: Only HPUStrategy is supported for Gaudi. Other TensorFlow built-in distribution strategies such as MirroredStrategy, MultiWorkerMirroredStrategy, CentralStorageStrategy, ParameterServerStrategy are not supported. HPUStrategy is not supported on TensorFlow 2.5.
TensorBoard graph visualization is not supported for models trained with Keras or Estimator. Only TFv2 native APIs (trace_on/trace_export) are supported.
PyTorch support is under active development and available in beta.
PyTorch dataloader may consume a significant portion of the training time, impacting overall model performance.
Convolution weight ordering for vision models: Users will need to manually handle this by following the guidelines in the Gaudi Migration Guide (see Convolution Weight Ordering in PyTorch Habana Vision Topologies). This will be improved in subsequent releases.
Dynamic shapes are not supported and will be enabled in future releases.
2.2.3. Habana Communication Library¶
Single Process Multiple Device Support in HCCL: Since multiple processes are required for multi-node (cross chassis) scaling, it only supports one device per process mode so that users do not need to differentiate the inter-node and intra-node usage cases.
COMM group support in HCCL: Each worker can be assigned, at most, to a single comm group.
Running the ResNet-50 training stress test plugin is available only with Ubuntu 18.04 or Ubuntu 18.04 with Docker.