Mixed Precision Training with PyTorch Autocast
On this Page
Mixed Precision Training with PyTorch Autocast¶
Intel® Gaudi® AI accelerator supports mixed precision training using native PyTorch autocast.
Autocast allows running mixed precision training without extensive modifications to existing FP32 model scripts.
It executes operations registered to autocast using lower precision floating data type.
The module is provided using the torch.amp
package.
For more details on mixed precision training with PyTorch autocast, see https://pytorch.org/docs/stable/amp.html.
Using Autocast on Gaudi¶
To use autocast on Gaudi, wrap the forward pass (model+loss) of the training to torch.autocast
:
with torch.autocast(device_type="hpu", dtype=torch.bfloat16):
output = model(input)
loss = loss_fn(output, target)
loss.backward()
For an example model using autocast on HPU, see PyTorch Torchvision.
Registered Operators¶
There are three types of registration to torch.autocast
:
Lower precision - These ops run in the lower precision bfloat16 data type.
FP32 - These ops run in the higher precision float32 data type.
Promote - These ops run in the highest precision data types among its inputs.
The default list of supported ops for each registration type are internally hard-coded. If any op is present in more than one list, it’s handled according to the priority: Lower precision -> FP32 -> Promote. The following provides the default list of registered ops for each type:
Lower precision:
addmm, addbmm, batch_norm, baddbmm, bmm, conv1d, conv2d, conv3d, conv_transpose1d, conv_transpose2d, conv_transpose3d, dot, dropout, feature_dropout, group_norm, instance_norm, layer_norm, leaky_relu, linear, matmul, mean, mm, mul, mv, softmax, log_softmax, scaled_dot_product_attention
FP32:
acos, addcdiv, asin, atan2, bilinear, binary_cross_entropy, binary_cross_entropy_with_logits, cdist, cosh, cosine_embedding_loss, cosine_similarity, cross_entropy_loss, dist, div, divide, embedding, embedding_bag, erfinv, exp, expm1, hinge_embedding_loss, huber_loss, kl_div, l1_loss, log, log10, log1p, log2, logsumexp, margin_ranking_loss, mse_loss, multi_margin_loss, multilabel_margin_loss, nll_loss, pdist, poisson_nll_loss, pow, reciprocal, renorm, rsqrt, sinh, smooth_l1_loss, soft_margin_loss, softplus, tan, topk, triplet_margin_loss, truediv, true_divide
Promote:
add, addcmul, addcdiv, cat, div, exp, mul, pow, sub, iadd, truediv, stack
Note
Float16 data type is not supported. Ensure that bfloat16 specific ops and functions are used in place of float16; for example, tensor.bfloat16()
should be used instead of tensor.half()
.
Override Options¶
In addition to the native PyTorch autocast functionality, the Intel Gaudi software allows for overriding the default lower precision and FP32 lists.
To override these lists, add a file which includes a list of new-line separated ops and pass its location using PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST
and PT_HPU_AUTOCAST_FP32_OPS_LIST
environment variables.
Since registration for autocast occurs during Gaudi modules loading, these variables have to be set before habana_frameworks.torch.core
is imported.
The below shows an example using lower precision and FP32 lists:
os.environ['PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST'] = '/path/to/lower_list.txt'
os.environ['PT_HPU_AUTOCAST_FP32_OPS_LIST'] = '/path/to/fp32_list.txt'
import torch
import habana_frameworks.torch.core as htcore
N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device='hpu')
y = torch.randn(N, D_out, device='hpu')
model = torch.nn.Linear(D_in, D_out).to(torch.device('hpu'))
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for t in range(500):
with torch.autocast(device_type='hpu', dtype=torch.bfloat16, enabled=args.is_autocast):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Note
Since autocast works on the C++ level, ops names may differ between Python and C++ levels. All ops applicable for autocast can be found in
[python-packages-path]/torch/include/ATen/RegistrationDeclarations.h
.Custom ops are currently not supported.
Autocast Cache¶
torch.autocast
includes cache_enabled
parameter which is enabled by default. It controls the functionality of caching cast operations to reuse them, when one tensor is an input to more than one operator registered for autocast.
In Gaudi modules, the underlying graph mode handles this optimization. Therefore, autocast caching is permanently disabled on HPU.