Memory Scrub Verification Tool

The Memory Scrub Verification (MSV) is a tool designed to ensure that the HBM on the Gaudi device is properly scrubbed during the boot process. Memory scrubbing is crucial in a Hypervisor environment to guarantee that the device is clean of any residual data left by previous usage. This ensures that your workloads are secure and isolated from others, preventing exposure to potentially sensitive information.

The MSV should be executed by the Hypervisor after the previous machine has been destroyed and all Gaudi 3 devices have completed their boot process. This should occur before the kernel-mode driver is loaded.

This section explains how to integrate MSV into your system, covering the process of installing the tool, building the source code and incorporating the executable into the Hypervisor workflow.

Options and Usage

The following table lists the available MSV options and their usage to help you effectively configure the tool for your specific needs.

Note

Make sure to unload the driver before using MSV.

Example:

$ sudo hbm_scrubbing_validator -busId 0000:09:00.0  -num_of_samples 1000 -sample_size 2 -o /home/user/failed_addresses.txt

Option

Description

-num_of_samples <int>

Number of samples to validate. Must be greater than 0. Default is 1. A sample is a logical block of Gaudi 3 HBM memory the size of the sample is defined by 128GB/num_of_sample (128GB is the Gaudi 3 HBM size).

-sample_size <int>

Number of subsequent addresses to validate in each sample. Must be greater than 0. Default is 1. Specifies how many consecutive addresses to verify from within each sample. For example, specifying 4 will check 4X128b patches within each sample.

-status_read_retries <int>

Number of retries until boot is complete. Default is 12 (each retry takes 5 seconds). The HBM scrubbing tool is operated by the Hypervisor system which needs to know that the tested devices already finished the boot stage. This parameter defines how many retries to perform to validate the boot process.

-o <str>

Path to the output file that will contain the addresses which failed validation. This parameter defines the path to the file which will contain offending addresses met during testing of the device.

Output examples:

  • Success output example:

    $ sudo hbm_scrubbing_validator -busId 0000:09:00.0 -num_of_samples 10000 -sample_size 4
    The boot process is complete, going to validate hbm scrubbing.
    Number of failed reads: 0
    
  • Failure output example:

    $ sudo hbm_scrubbing_validator -busId 0000:21:00.0 -num_of_samples 100000 -sample_size 8 -o /root/logs/failed_addresses.txt
    The boot process is complete, going to validate hbm scrubbing.
    Number of failed reads: 32
    

    The MSV writes the failed validation addresses to ‘/root/logs/failed_addresses.txt’ as follows:

    $ cat /root/logs/failed_addresses.txt
    2010000000184f0
    2010000000184f4
    2010000000184f8
    2010000000184fc
    201000000018500
    201000000018504
    .....