2. Best Practices for Model Training with Gaudi

2.1. Introduction

Habana Gaudi contains well-balanced compute, memory and networking to allow training medium to large-scale models. Gaudi excels at performing dense matrix and vector computations. Habana SynapseAI software stack integrates into popular Deep Learning frameworks (TensorfFlow, PyTorch), fetches the computational graph, and compiles it into executable suitable for deployment on Habana Gaudi.

2.2. Best Practices

To get the most benefit out of Habana Gaudi:

  1. Input data pipeline should not be a bottleneck.

  1. The model should be mostly static.

    1. The shapes of input and intermediate tensors do not alter shapes between invocations. The support for allowing more dynamicity is under active development, but at this point it is strongly advised to alter the model code to avoid such dynamicity.

  2. Ensure that the convolution operations in your model are channel-major (aka NHWC data layout).

  1. In case your model contains matrix multiplication , ensure that M, N, K are well divisible by 64 in case the op is performed in FP32 data type, or by 128 in case the op is performed in BF16 data type.

  2. For non-convolution/matrix multiplication operators, especially for the operators which cannot be considered element-wise, ensure that the fastest changing dimension (typically, channel/feature dimension) is well divisible by 64 in case the op is performed in FP32 data type, or by 128 in case the op is performed in BF16 data type.

2.3. Profiling

Profiling allows Habana Gaudi user to spot performance bottlenecks and resolve them at model script level. Refer to Profiler User Guide for information on how to use the SynapseAI profiler.

2.4. Data Parallelism

Habana Gaudi contains high-throughput low-latency RDMA NIC, that allows efficient scaling from small to large number of devices to cooperate on training the same network simultaneously. The software is currently optimized for Data parallelism, where each Habana Gaudi device holds its copy of the DL model parameters and performs computation on its slice of the global batch. The gradient reduction is performed across devices before applying optimizer to update the model.

2.5. Model Examples

Examples of models that adhere to the above criteria:

  1. Vision models, such as

    1. Object Classification: ResNets, ResNexts

    2. Object Detection and Segmentation: UNet2D, UNet3D, SegNet, SSD

  2. Language models: most Transformer-based models, including BERT.

For full list of published models and their performance, refer to the Habana Model References repository.