Mixed Precision Training with PyTorch Autocast

Intel® Gaudi® AI accelerator supports mixed precision training using native PyTorch autocast. Autocast allows running mixed precision training without extensive modifications to existing FP32 model scripts. It executes operations registered to autocast using lower precision floating data type. The module is provided using the torch.amp package.

For more details on mixed precision training with PyTorch autocast, see https://pytorch.org/docs/stable/amp.html.

Using Autocast on Gaudi

To use autocast on Gaudi, wrap the forward pass (model+loss) of the training to torch.autocast:

with torch.autocast(device_type="hpu", dtype=torch.bfloat16):
   output = model(input)
   loss = loss_fn(output, target)
loss.backward()

For an example model using autocast on HPU, see PyTorch Torchvision.

Registered Operators

There are three types of registration to torch.autocast:

  • Lower precision - These ops run in the lower precision bfloat16 data type.

  • FP32 - These ops run in the higher precision float32 data type.

  • Promote - These ops run in the highest precision data types among its inputs.

The default list of supported ops for each registration type are internally hard-coded. If any op is present in more than one list, it’s handled according to the priority: Lower precision -> FP32 -> Promote. The following provides the default list of registered ops for each type:

  • Lower precision: addmm, addbmm, batch_norm, baddbmm, bmm, conv1d, conv2d, conv3d, conv_transpose1d, conv_transpose2d, conv_transpose3d, dot, dropout, feature_dropout, group_norm, instance_norm, layer_norm, leaky_relu, linear, matmul, mean, mm, mul, mv, softmax, log_softmax, scaled_dot_product_attention

  • FP32: acos, addcdiv, asin, atan2, bilinear, binary_cross_entropy, binary_cross_entropy_with_logits, cdist, cosh, cosine_embedding_loss, cosine_similarity, cross_entropy_loss, dist, div, divide, embedding, embedding_bag, erfinv, exp, expm1, hinge_embedding_loss, huber_loss, kl_div, l1_loss, log, log10, log1p, log2, logsumexp, margin_ranking_loss, mse_loss, multi_margin_loss, multilabel_margin_loss, nll_loss, pdist, poisson_nll_loss, pow, reciprocal, renorm, rsqrt, sinh, smooth_l1_loss, soft_margin_loss, softplus, tan, topk, triplet_margin_loss, truediv, true_divide

  • Promote: add, addcmul, addcdiv, cat, div, exp, mul, pow, sub, iadd, truediv, stack

Note

Float16 data type is not supported. Ensure that bfloat16 specific ops and functions are used in place of float16; for example, tensor.bfloat16() should be used instead of tensor.half().

Override Options

In addition to the native PyTorch autocast functionality, the Intel Gaudi software allows for overriding the default lower precision and FP32 lists. To override these lists, add a file which includes a list of new-line separated ops and pass its location using PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST and PT_HPU_AUTOCAST_FP32_OPS_LIST environment variables. Since registration for autocast occurs during Gaudi modules loading, these variables have to be set before habana_frameworks.torch.core is imported.

The below shows an example using lower precision and FP32 lists:

os.environ['PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST'] = '/path/to/lower_list.txt'
os.environ['PT_HPU_AUTOCAST_FP32_OPS_LIST'] = '/path/to/fp32_list.txt'

import torch
import habana_frameworks.torch.core as htcore

N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device='hpu')
y = torch.randn(N, D_out, device='hpu')

model = torch.nn.Linear(D_in, D_out).to(torch.device('hpu'))
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

for t in range(500):
   with torch.autocast(device_type='hpu', dtype=torch.bfloat16, enabled=args.is_autocast):
       y_pred = model(x)
       loss = torch.nn.functional.mse_loss(y_pred, y)
   optimizer.zero_grad()
   loss.backward()
   optimizer.step()

Note

  • Since autocast works on the C++ level, ops names may differ between Python and C++ levels. All ops applicable for autocast can be found in [python-packages-path]/torch/include/ATen/RegistrationDeclarations.h.

  • Custom ops are currently not supported.

  • Previously used variables - LOWER_LIST and FP32_LIST - are now deprecated.

Autocast Cache

torch.autocast includes cache_enabled parameter which is enabled by default. It controls the functionality of caching cast operations to reuse them, when one tensor is an input to more than one operator registered for autocast. In Gaudi modules, the underlying graph mode handles this optimization. Therefore, autocast caching is permanently disabled on HPU.