How to Pick Good Nodes in the Datacenter

In general, it is better to test each HLS-3 box using HCCL benchmarks and ensure the performance is within the range of expected performance. You should run 2 box HCCL benchmark tests, 4 box HCCL benchmark tests, 8 box HCCL benchmark tests and so on and slowly add this to the good set of nodes. The following sections show how to run these tests on multi HLS-3 boxes and the expected performance of HCCL benchmarks. Please note that a performance drop within 10% is possible as the the number of HLS-3 boxes and spine switches increase.

Single HLS-3 Box Test

Once the job scheduler allocates a list of HLS-3 nodes, it is recommended to run single node HCCL tests to ensure that nodes are healthy both from scale-up and scale-out links:

cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test
all_reduce --nranks 8 --loop 1000 --node_id 0 --size 256m
--ranks_per_node 8

[BENCHMARK] hcclAllReduce(src!=dst, InputSizeInBytes=268435456,
count=67108864, dtype=float, iterations=1000)
[BENCHMARK]     NW Bandwidth   : 512.957298 GB/s
[BENCHMARK]     Algo Bandwidth : 293.118456 GB/s
[BENCHMARK]     Time           : 0.915792 seconds


cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean
--test all_gather --nranks 8 --loop 1000 --node_id 0 --size 16m
--ranks_per_node 8
[BENCHMARK] hcclAllGather(src!=dst, InputSizeInBytes=16777216,
count=4194304, dtype=float, iterations=1000)
[BENCHMARK]     NW Bandwidth   : 509.189283 GB/s
[BENCHMARK]     Algo Bandwidth : 72.741326 GB/s
[BENCHMARK]     Time           : 0.230642 seconds


cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test
reduce_scatter --nranks 8 --loop 1000 --node_id 0 --size 64m
--ranks_per_node 8
[BENCHMARK] hcclReduceScatter(src!=dst, InputSizeInBytes=268435456,
count=67108864, dtype=float, iterations=1000)
[BENCHMARK]     NW Bandwidth   : 509.285134 GB/s
[BENCHMARK]     Algo Bandwidth : 582.040153 GB/s
[BENCHMARK]     Time           : 0.461197 seconds


cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test
all2all --nranks 8 --loop 1000 --node_id 0 --size 256m --ranks_per_node 8
[BENCHMARK] hcclAlltoAll(src!=dst, InputSizeInBytes=268435456,
count=67108864, dtype=float, iterations=1000)
[BENCHMARK]     NW Bandwidth   : 509.252740 GB/s
[BENCHMARK]     Algo Bandwidth : 582.003132 GB/s
[BENCHMARK]     Time           : 0.461227 seconds

2 HLS-3 Box Test Inside a Single Leaf Switch

# 1st box:

cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test
all_reduce --nranks 16 --loop 1000 --node_id 0 --size 256m
--ranks_per_node 8

# 2nd box:

cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test
all_reduce --nranks 16 --loop 1000 --node_id 1 --size 256m
--ranks_per_node 8

[BENCHMARK] hcclAllReduce(src!=dst, InputSizeInBytes=268435456,
count=67108864, dtype=float, iterations=1000)
[BENCHMARK]     NW Bandwidth   : 549.816918 GB/s
[BENCHMARK]     Algo Bandwidth : 293.235689 GB/s
[BENCHMARK]     Time           : 0.915426 seconds

# 1st box:

cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test
reduce_scatter --nranks 16 --loop 1000 --node_id 0 --size 256m
--ranks_per_node 8

# 2nd box:

cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test
reduce_scatter --nranks 16 --loop 1000 --node_id 1 --size 256m
--ranks_per_node 8

[BENCHMARK] hcclReduceScatter(src!=dst, InputSizeInBytes=268435456,
count=67108864, dtype=float, iterations=1000)
[BENCHMARK]     NW Bandwidth   : 543.741232 GB/s
[BENCHMARK]     Algo Bandwidth : 579.990648 GB/s
[BENCHMARK]     Time           : 0.462827 seconds

# 1st box:

cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test
all_gather --nranks 16 --loop 1000 --node_id 0 --size 16m
--ranks_per_node 8

# 2nd box:

cd $DEMOS_ROOT/gaudi/hccl_test; HLS_ID=0
HCCL_COMM_ID=10.111.233.253:5555 python3 run_hccl_demo.py -clean --test
all_gather --nranks 16 --loop 1000 --node_id 1 --size 16m
--ranks_per_node 8

[BENCHMARK] hcclAllGather(src!=dst, InputSizeInBytes=16777216,
count=4194304, dtype=float, iterations=1000)
[BENCHMARK]     NW Bandwidth   : 545.850580 GB/s
[BENCHMARK]     Algo Bandwidth : 36.390039 GB/s
[BENCHMARK]     Time           : 0.461039 seconds