Best Practices for Model Training with Gaudi

Intel® Gaudi® AI accelerator contains well-balanced compute, memory and networking to allow training medium to large-scale models. Gaudi excels at performing dense matrix and vector computations. The Intel Gaudi software stack integrates with PyTorch, the popular Deep Learning framework, fetching the computational graph and compiling it into executable suitable for deployment on Gaudi.

Best Practices

To get the most benefit out of Gaudi:

  1. Input data pipeline should not be a bottleneck.

  2. Majority of the computation should be performed on the device. To that end, refer to the list of supported ops for PyTorch.

    1. In case an important op within the inner loop of the computation is not supported, it is possible to implement it in TPC-C to avoid placing it to execute on the host, resulting in large chunks of data marshalled between host and Gaudi.

    2. The following constructs and data types are not supported or have very limited support:

      1. Double precision and complex data types

      2. Int64 computations and indexes

      3. Sparse tensors

  3. The model should be mostly static. The shapes of input and intermediate tensors do not alter shapes between invocations. The support for allowing more dynamicity is under active development, but at this point it is strongly advised to alter the model code to avoid such dynamicity.

  4. Ensure that the convolution operations in your model are channel-major (aka NHWC data layout).

  5. In case your model contains convolution operators, ensure that:

    1. B*H*W is well divisible by 64 in case the op is performed in FP32 data type, or by 128 in case the op is performed in BF16 data type.

    2. The above applies also to #channels.

    3. #input_channels * filter_width is 128 or higher.

  6. In case your model contains matrix multiplication, ensure that M, N, K are well divisible by 64 in case the op is performed in FP32 data type, or by 128 in case the op is performed in BF16 data type.

  7. For non-convolution/matrix multiplication operators, especially for the operators which cannot be considered element-wise, ensure that the fastest changing dimension (typically, channel/feature dimension) is well divisible by 64 in case the op is performed in FP32 data type, or by 128 in case the op is performed in BF16 data type.

Profiling

Profiling allows Gaudi user to spot performance bottlenecks and resolve them at model script level. Refer to Profiler User Guide for information on how to use the Profiler.

Data Parallelism

Gaudi contains high-throughput low-latency RDMA NIC, that allows efficient scaling from small to large number of devices to cooperate on training the same network simultaneously. The software is currently optimized for Data parallelism, where each Gaudi device holds its copy of the DL model parameters and performs computation on its slice of the global batch. The gradient reduction is performed across devices before applying optimizer to update the model.

Model Examples

Examples of models that adhere to the above criteria:

  • Vision models, such as:

    • Object Classification: ResNets, ResNeXts

    • Object Detection and Segmentation: UNet2D, UNet3D, SegNet, SSD

  • Language models: most Transformer-based models, including BERT.

For full list of published models and their performance, refer to the Intel Model References repository.