Importing PyTorch Models Manually

Importing habana_frameworks.torch.core

  1. Import habana_frameworks.torch.core:

import habana_frameworks.torch.core as htcore
  1. Target the Gaudi HPU device:

device = torch.device("hpu")
  1. Add mark_step(). In Lazy mode, mark_step() must be added in all training scripts right after loss.backward() and optimizer.step().

htcore.mark_step()

Note

If the model has dependencies on GPU libraries, refer to GPU Migration Toolkit.

Enabling Mixed Precision

To run mixed precision training on HPU without extensive modifications to existing FP32 model scripts, Intel Gaudi provides native PyTorch autocast support.

Autocast is a native PyTorch module that allows running mixed precision training. It executes operations registered to autocast using lower precision floating datatype. The module is provided using the torch.amp package.

To use autocast on HPU, wrap the forward pass (model+loss) of the training to torch.autocast:

with torch.autocast(device_type="hpu", dtype=torch.bfloat16):
   output = model(input)
   loss = loss_fn(output, target)
loss.backward()

For further information such as the full default list of registered ops or instructions on creating a custom ops list, see Mixed Precision Training with PyTorch Autocast.

Setting Up Distributed Training

Intel Gaudi support for distributed communication can be enabled using HCCL (Habana Collective Communication Library) backend. Support for HCCL communication backend is loaded and process group communication backend is initialized as “hccl” using the following script changes:

import habana_frameworks.torch.distributed.hccl
torch.distributed.init_process_group(backend='hccl')

In the example above, it is assumed either torchrun or mpirun was used to start training and all necessary environment variables are set before habana_frameworks.torch.distributed.hccl import.

For further details on distributed training, see Distributed Training with PyTorch.