Mixed Precision Training with PyTorch Autocast

Intel® Gaudi® AI accelerator supports mixed precision training using native PyTorch autocast. Autocast allows running mixed precision training without extensive modifications to existing FP32 model script. It executes operations registered to autocast using lower precision floating datatype. The module is provided using the torch.amp package.

For more details on mixed precision training with PyTorch autocast, see https://pytorch.org/docs/stable/amp.html.

Using Autocast on HPU

To use autocast on HPU, wrap the forward pass (model+loss) of the training to torch.autocast:

with torch.autocast(device_type="hpu", dtype=torch.bfloat16):
   output = model(input)
   loss = loss_fn(output, target)

For an example model using autocast on HPU, see PyTorch Torchvision on GitHub.

Registered Operators

There are three types of registration to torch.autocast:

  • Lower precision - These ops run in the lower precision bfloat16 datatype.

  • FP32 - These ops run in the higher precision float32 datatype.

  • Promote - These ops run in the highest precision datatypes among its inputs.

The default list of supported ops for each registration type are internally hard-coded. The following provides the default list of registered ops for each type:

  • Lower precision: addmm, addbmm, batch_norm, baddbmm, bmm, conv1d, conv2d, conv3d, conv_transpose1d, conv_transpose2d, conv_transpose3d, dot, dropout, feature_dropout, group_norm, instance_norm, layer_norm, leaky_relu, linear, matmul, mean, mm, mul, mv, softmax, log_softmax, scaled_dot_product_attention

  • FP32: acos, addcdiv, asin, atan2, bilinear, binary_cross_entropy, binary_cross_entropy_with_logits, cdist, cosh, cosine_embedding_loss, cosine_similarity, cross_entropy_loss, dist, div, divide, embedding, embedding_bag, erfinv, exp, expm1, hinge_embedding_loss, huber_loss, kl_div, l1_loss, log, log10, log1p, log2, logsumexp, margin_ranking_loss, mse_loss, multi_margin_loss, multilabel_margin_loss, nll_loss, pdist, poisson_nll_loss, pow, reciprocal, renorm, rsqrt, sinh, smooth_l1_loss, soft_margin_loss, softplus, tan, topk, triplet_margin_loss, truediv, true_divide

  • Promote: add, addcmul, addcdiv, cat, div, exp, mul, pow, sub, iadd, truediv, stack


Float16 datatype is not supported. Ensure that BFloat16 specific OPs and functions are used in place of Float16; for example, tensor.bfloat16() should be used instead of tensor.half().

Override Options

In addition to the native PyTorch autocast functionality, Intel Gaudi software allows for overriding the default lower precision and FP32 lists. To override these lists, add a file which includes a list of new-line separated ops and pass its location using PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST and PT_HPU_AUTOCAST_FP32_OPS_LIST environment variables. Previously used variables - LOWER_LIST and FP32_LIST - are now deprecated. Since registration for autocast occurs during loading Gaudi modules, these variables have to be set before habana_frameworks.torch.core is imported.

The below shows an example using lower precision and FP32 lists:

os.environ['PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST'] = '/path/to/lower_list.txt'
os.environ['PT_HPU_AUTOCAST_FP32_OPS_LIST'] = '/path/to/fp32_list.txt'

import torch
import habana_frameworks.torch.core as htcore

N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device='hpu')
y = torch.randn(N, D_out, device='hpu')

model = torch.nn.Linear(D_in, D_out).to(torch.device('hpu'))
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

for t in range(500):
   with torch.autocast(device_type='hpu', dtype=torch.bfloat16, enabled=args.is_autocast):
       y_pred = model(x)
       loss = torch.nn.functional.mse_loss(y_pred, y)


  • Since autocast works on the C++ level, ops names may differ between Python and C++ levels. All ops applicable for autocast can be found in [python-packages-path]/torch/include/ATen/RegistrationDeclarations.h.

  • Custom ops are currently not supported.

Autocast Cache

torch.autocast includes cache_enabled parameter which is enabled by default. It controls the functionality of caching cast operations to reuse them, when one tensor is an input to more than one operator registered for autocast. In Gaudi modules, the underlying graph mode handles this optimization. Therefore, autocast caching is permanently disabled on HPU.