NIC Isolation Configuration

This document describes the Habana Communication Library (HCL) runtime configuration mechanism for isolating specific Network Interface Card (NIC) ports while HCL is running. It is primarily intended for fault-injection testing, debugging, and maintenance scenarios to control which NIC ports the HCL communication stack will consider available, without rebooting the host.

Environment Variables

NIC isolation is implemented as a file-based control interface. The file paths can be defined using the following environment variables:

  • HCL_FAULT_TOLERANCE_NIC_CONFIG_FILE: The protected NIC configuration file monitored by the fault-tolerance flow. It provides the requested NIC state per module and NIC. If the file path is not defined as an environment variable, the default value is /tmp/nic_config.txt.

  • HCL_FAULT_TOLERANCE_NIC_REPORT_FILE: The report file for recording NIC state change events and indications when updates have been applied. If the file path is not defined as an environment variable, the default value is /tmp/nic_report.txt.

You can also define HCL_FAULT_TOLERANCE_NIC_MONITOR_INTERVAL_SECONDS, which specifies how often, in seconds, the fault-tolerance monitor thread polls the NIC configuration file for changes. If the variable is not set, the default value is 60.

Setup and File Permissions

To create a configuration file with the 644 (root writable) permission, use the following commands:

sudo touch HCL_FAULT_TOLERANCE_NIC_CONFIG_FILE
sudo chmod 644 HCL_FAULT_TOLERANCE_NIC_CONFIG_FILE

The configuration file uses a line-oriented table format with three columns separated by the pipe character |, as in the following example:

module id | nic | status
0         | 20  | 1
0         | 21  | up
1         | 8   | 1
1         | 9   | up
2         | 8   | DOWN
2         | 5   | 0

Where:

  • module_id is the HCL module or device ID. For an 8-device system, valid values range from 0 to 7.

  • nic is the NIC port number. Valid values range from 0 to 23, depending on hardware.

  • status is the NIC state.

This example presents a configuration, where:

  • Module 0 has NIC ports 20 and 21 enabled

  • Module 1 has NIC ports 8 and 9 enabled

  • Module 2 has NIC ports 8 and 5 disabled

If the configuration files do not exist, HCL runs normally with all NIC ports in their default enabled state.

The report file is created by HCL.

Updating NIC Port States

To isolate selected NIC ports, you have to update the configuration file by setting the desired NIC port status for specific ports:

  • To disable a NIC port, set the status to DOWN, down, or 0

  • To enable a NIC port, set the status to UP, up, or 1

For example, the following command disables NIC ports 20 and 21 on module 0:

cat > HCL_FAULT_TOLERANCE_NIC_CONFIG_FILE << 'EOF'
module id | nic | status
0         | 20  | DOWN
0         | 21  | DOWN
EOF

While updating the file, ensure you meet the parsing rules so that validation succeeds:

  • All NICs belonging to the same logical port (0, 1, or 2) must be in the same state (all UP or all DOWN). Mixed states within a logical port are not allowed; such configurations are treated as invalid, ignored, and HCL logs an error to both the log and the report file.

  • At most one logical port may be DOWN at a time.

  • Lines starting with # are ignored.

  • Non-numeric characters are ignored.

  • Extra whitespace around values is ignored.

  • Each HLS node only processes entries matching its own module_id.

  • Rows with invalid format or NIC IDs are skipped with a debug-level log message.

  • At startup, the default NIC port state is UP. If a NIC port does not appear in the update table, the previous state is kept.

Report File

In the report file, you can find all NIC state change events and indications when updates have been applied, as in this example:

moduleId=0, startTimeMs=[10:14:53.484334], nic=20, newState=DOWN, endTimeMs=[10:14:59.097266]
moduleId=2, startTimeMs=[10:14:53.484257], nic=8, newState=DOWN, endTimeMs=[10:14:59.097792]
moduleId=4, startTimeMs=[10:14:53.482186], nic=15, newState=DOWN, endTimeMs=[10:14:59.098569]
moduleId=7, startTimeMs=[10:14:53.484370], nic=15, newState=DOWN, endTimeMs=[10:14:59.099556]
moduleId=6, startTimeMs=[10:14:53.484441], nic=3, newState=DOWN, endTimeMs=[10:14:59.138026]
moduleId=3, startTimeMs=[10:14:53.484435], nic=20, newState=DOWN, endTimeMs=[10:14:59.167210]
moduleId=1, startTimeMs=[10:14:53.484450], nic=8, newState=DOWN, endTimeMs=[10:14:59.209512]
moduleId=5, startTimeMs=[10:14:53.484435], nic=3, newState=DOWN, endTimeMs=[10:14:59.218237]
***********************************************
*** HOST UPDATE COMPLETED [10:14:59.218345] ***
***********************************************

Logging and Diagnostics

There are a few possible log messages that you may see when the HCL application starts:

Files found:

[INFO] NIC config file found: HCL_FAULT_TOLERANCE_NIC_CONFIG_FILE. Starting dynamic monitoring.

Files not found (normal operation):

[INFO] NIC config files not found. Dynamic NIC configuration monitoring is disabled.

Update detected:

[INFO] NIC config update file detected. Processing changes...
[INFO] Successfully processed NIC configuration update.

Common Use Cases

This feature may be especially useful in the following scenarios:

  • Temporarily disabling one or more NIC ports for fault injection testing or hardware maintenance scenarios.

  • Evaluating system behavior when network capacity is reduced.

  • Isolating network issues by selectively disabling problematic NIC ports.

  • Allowing non-root external tools to update NIC states.