Best Practices for Model Training with Gaudi¶
Habana Gaudi contains well-balanced compute, memory and networking to allow training medium to large-scale models. Gaudi excels at performing dense matrix and vector computations. Habana SynapseAI software stack integrates into popular Deep Learning frameworks (TensorFlow, PyTorch), fetches the computational graph, and compiles it into executable suitable for deployment on Habana Gaudi.
To get the most benefit out of Habana Gaudi:
Input data pipeline should not be a bottleneck.
Majority of the computation should be performed on the device. To that end, refer to the list of supported ops for TensorFlow and PyTorch.
In case an important op within the inner loop of the computation is not supported, it is possible to implement it in TPC-C to avoid placing it to execute on the host, resulting in large chunks of data marshalled between host and Habana Gaudi.
The following constructs and data types are not supported or have very limited support:
Double precision and complex data types
Int64 computations and indexes
The model should be mostly static. The shapes of input and intermediate tensors do not alter shapes between invocations. The support for allowing more dynamicity is under active development, but at this point it is strongly advised to alter the model code to avoid such dynamicity.
Ensure that the convolution operations in your model are channel-major (aka NHWC data layout).
In case your model contains convolution operators, ensure that:
B*H*W is well divisible by 64 in case the op is performed in FP32 data type, or by 128 in case the op is performed in BF16 data type.
The above applies also to #channels.
#input_channels * filter_width is 128 or higher.
In case your model contains matrix multiplication, ensure that M, N, K are well divisible by 64 in case the op is performed in FP32 data type, or by 128 in case the op is performed in BF16 data type.
For non-convolution/matrix multiplication operators, especially for the operators which cannot be considered element-wise, ensure that the fastest changing dimension (typically, channel/feature dimension) is well divisible by 64 in case the op is performed in FP32 data type, or by 128 in case the op is performed in BF16 data type.
Profiling allows Habana Gaudi user to spot performance bottlenecks and resolve them at model script level. Refer to Profiler User Guide for information on how to use the SynapseAI profiler.
Habana Gaudi contains high-throughput low-latency RDMA NIC, that allows efficient scaling from small to large number of devices to cooperate on training the same network simultaneously. The software is currently optimized for Data parallelism, where each Habana Gaudi device holds its copy of the DL model parameters and performs computation on its slice of the global batch. The gradient reduction is performed across devices before applying optimizer to update the model.
Examples of models that adhere to the above criteria:
Vision models, such as:
Object Classification: ResNets, ResNeXts
Object Detection and Segmentation: UNet2D, UNet3D, SegNet, SSD
Language models: most Transformer-based models, including BERT.
For full list of published models and their performance, refer to the Habana Model References repository.