Python Package (habana_frameworks.torch)

This package provides PyTorch bridge interfaces and modules such as optimizers, mixed precision configuration, fused kernels for training on HPU and so on.

The various modules are organized as listed in the below example:

habana_frameworks.torch
  core
  distributed
     hccl
  hpex
     hmp
     kernels
     normalization
     optimizers
  hpu
  utils

The following sections provided a brief description of each module.

core

core module provides Python bindings to PyTorch-Habana bridge interfaces. For example, mark_step which is used to trigger execution of accumulated graphs in Lazy mode.

distributed/hccl

distributed/hccl module registers and adds support for HCCL communication backend.

hpex/hmp

hpex/hmp module contains the habana_mixed_precision (hmp) tool which can be used to train a model in mixed precision on HPU. Refer to PyTorch Mixed Precision Training on Gaudi for further details.

hpex/kernels

hpex/kernels module contains Python interfaces to Habana only custom operators, such as EmbeddingBag and EmbeddingBagPreProc operators.

hpex/normalization

hpex/normalization module contains Python interfaces to the Habana implementation for common normalize & clip operations performed on gradients in some models. Usage of Habana provided implementation can provide better performance (compared to equivalent operator provided in torch). Refer to Other Custom OPs for further details.

hpex/optimizers

hpex/optimizers contains Python interfaces to Habana implementation for some of the common optimizers used in DL models. Usage of Habana implementation can provide better performance (compared to corresponding optimizer implementations available in torch). Refer to Custom Optimizers for further details.

hpu APIs

Support for HPU tensors is provided with this package. The following APIs provide the same functionality as CPU tensors but HPU is used for the underlying implementation. This package can be imported on demand.

  • import habana_frameworks.torch.hpu as hthpu - Imports the package.

  • hthpu.is_available() - Returns a boolean indicating if a HPU device is currently available.

  • hthpu.device_count() - Returns the number of compute-capable devices.

  • hthpu.get_device_name() - Returns the name of the HPU device.

  • hthpu.current_device() - Returns the index of the current selected HPU device.

utils

utils module contains general Python utilities required for training on HPU.

Memory Stats APIs

  • TORCH.HPU.MAX_MEMORY_ALLOCATED - Returns peak HPU memory allocated by tensors (in bytes). reset_peak_memory_stats() can be used to reset the starting point in tracing stats.

  • TORCH.HPU.MEMORY_ALLOCATED - Returns the current HPU memory occupied by tensors.

  • TORCH.HPU.MEMORY_STATS - Returns list of HPU memory statics. The below summarizes the sample memory stats printout and details:

    • Limit - Amount of total memory on HPU device

    • InUse - Amount of allocated memory at any instance. Starting point after reset_peak_memroy_stats()

    • MaxInUse - Amount of total active memory allocated

    • NumAllocs - Number of allocations

    • NumFrees - Number of freed chunks

    • ActiveAllocs - Number of active allocations

    • MaxAllocSize - Maximum allocated size

    • TotalSystemAllocs - Total number of system allocations

    • TotalSystemFrees - Total number of system frees

    • TotalActiveAllocs - Total number of active allocations

  • TORCH.HPU.MEMORY_SUMMARY - Returns human readable printout of current memory stats.

  • TORCH.HPU.RESET_ACCUMULATED_MEMORY_STATS - This API to clear the no of allocs and no of frees.

  • TORCH.HPU.RESET_PEAK_MEMORY_STATS - Resets starting point of memory occupied by tensors.

The below shows a usage example:

import torch
import habana_frameworks.torch as htorch
device = torch.device("hpu")
import torch.nn as nn
import torch.nn.functional as F

if __name__ == '__main__':
    hpu = torch.device('hpu')
    cpu = torch.device('cpu')
    input1 = torch.randn((64,28,28,20),dtype=torch.float, requires_grad=True)
    input1_hpu = input1.contiguous(memory_format=torch.channels_last).to(hpu)
    mem_summary1 = htorch.hpu.memory_summary()
    print('memory_summary1:')
    print(mem_summary1)
    htorch.hpu.reset_peak_memory_stats()
    input2 = torch.randn((64,28,28,20),dtype=torch.float, requires_grad=True)
    input2_hpu = input2.contiguous(memory_format=torch.channels_last).to(hpu)
    mem_summary2 = htorch.hpu.memory_summary()
    print('memory_summary2:')
    print(mem_summary2)
    mem_allocated = htorch.hpu.memory_allocated()
    print('memory_allocated: ', mem_allocated)
    mem_stats = htorch.hpu.memory_stats()
    print('memory_stats:')
    print(mem_stats)
    max_mem_allocated = htorch.hpu.max_memory_allocated()
    print('max_memory_allocated: ', max_mem_allocated)

Random Number Generator APIs

  • torch.hpu.random.get_rng_state - Returns the random number generator state of the specified HPU as a ByteTensor

  • torch.hpu.random.get_rng_state_all - Returns a list of ByteTensor representing the random number states of all devices.

  • torch.hpu.random.set_rng_state - Sets the random number generator state of the specified HPU

  • torch.hpu.random.set_rng_state_all - Sets the random number generator state of all devices.

  • torch.hpu.random.manual_seed - Sets the seed for generating random numbers for the current HPU device.

  • torch.hpu.random.manual_seed_all - Sets the seed for generating random numbers on all HPUs.

  • torch.hpu.random.seed - Sets the seed for geenrating random numbers to a random number for the current HPU

  • torch.hpu.random.seed_all - Sets the seed for generating random numbers to a random number on all HPUs.

  • torch.hpu.random.initial_seed - Returns the current random seed of the current HPU.