Native PyTorch Autocast

Autocast is a native PyTorch module that allows running mixed precision training without extensive modifications to existing FP32 model script. It executes operations registered to autocast using lower precision floating datatype. The module is provided using the torch.amp package.

For more details on PyTorch autocast, see https://pytorch.org/docs/stable/amp.html.

Using Autocast on HPU

To use autocast on HPU, wrap the forward pass (model+loss) of the training to torch.autocast:

with torch.autocast(device_type="hpu", dtype=torch.bfloat16):
   output = model(input)
   loss = loss_fn(output, target)
loss.backward()

For an example model using autocast on HPU, see PyTorch Torchvision on GitHub.

Registered Operators

There are three types of registration to torch.autocast:

  • Lower precision - These ops run in the lower precision bfloat16 datatype.

  • FP32 - These ops run in the higher precision float32 datatype.

  • Promote - These ops run in the highest precision datatypes among its inputs.

The default list of supported ops for each registration type are internally hard-coded. The following provides the default list of registered ops for each type:

  • Lower precision: addmm, addbmm, batch_norm, baddbmm, bmm, conv1d, conv2d, conv3d, conv_transpose1d, conv_transpose2d, conv_transpose3d, dot, dropout, feature_dropout, group_norm, instance_norm, layer_norm, leaky_relu, linear, matmul, mean, mm, mul, mv, softmax, log_softmax

  • FP32: acos, addcdiv, asin, atan2, bilinear, binary_cross_entropy, binary_cross_entropy_with_logits, cdist, cosh, cosine_embedding_loss, cosine_similarity, cross_entropy_loss, dist, div, divide, embedding, embedding_bag, erfinv, exp, expm1, hinge_embedding_loss, huber_loss, kl_div, l1_loss, log, log10, log1p, log2, logsumexp, margin_ranking_loss, mse_loss, multi_margin_loss, multilabel_margin_loss, nll_loss, pdist, poisson_nll_loss, pow, reciprocal, renorm, rsqrt, sinh, smooth_l1_loss, soft_margin_loss, softplus, tan, topk, triplet_margin_loss, truediv, true_divide

  • Promote: add, addcmul, addcdiv, cat, div, exp, mul, pow, sub, iadd, truediv, stack

Note

Float16 datatype is not supported. Ensure that BFloat16 specific OPs and functions are used in place of Float16; for example, tensor.bfloat16() should be used instead of tensor.half().

Override Options

In addition to the native PyTorch autocast functionality, SynapseAI allows for overriding the default lower precision and FP32 lists. To override these lists, add a file which includes a list of new-line separated ops and pass its location using LOWER_LIST and FP32_LIST environment variables. Since registration for autocast occurs during loading habana modules, these variables have to be set before habana_frameworks.torch.core is imported.

The below shows an example using lower precision and FP32 lists:

os.environ['LOWER_LIST'] = '/path/to/lower_list.txt'
os.environ['FP32_LIST'] = '/path/to/fp32_list.txt'

import torch
import habana_frameworks.torch.core as htcore

N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device='hpu')
y = torch.randn(N, D_out, device='hpu')

model = torch.nn.Linear(D_in, D_out).to(torch.device('hpu'))
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

for t in range(500):
   with torch.autocast(device_type='hpu', dtype=torch.bfloat16, enabled=args.is_autocast):
       y_pred = model(x)
       loss = torch.nn.functional.mse_loss(y_pred, y)
   optimizer.zero_grad()
   loss.backward()
   optimizer.step()

Note

  • Since autocast works on the C++ level, ops names may differ between Python and C++ levels. All ops applicable for autocast can be found in [python-packages-path]/torch/include/ATen/RegistrationDeclarations.h.

  • Custom ops are currently not supported.

Autocast Cache

torch.autocast includes cache_enabled parameter which is enabled by default. It controls the functionality of caching cast operations to reuse them, when one tensor is an input to more than one operator registered for autocast. In Habana modules, the underlying graph mode handles this optimization. Therefore, autocast caching is permanently disabled on HPU.