NIC Isolation Configuration
On this Page
NIC Isolation Configuration¶
This document describes the Habana Communication Library (HCL) runtime configuration mechanism for isolating specific Network Interface Card (NIC) ports while HCL is running. It is primarily intended for fault-injection testing, debugging, and maintenance scenarios to control which NIC ports the HCL communication stack will consider available, without rebooting the host.
Environment Variables¶
NIC isolation is implemented as a file-based control interface. The file paths can be defined using the following environment variables:
HCL_FAULT_TOLERANCE_NIC_CONFIG_FILE: The protected NIC configuration file monitored by the fault-tolerance flow. It provides the requested NIC state per module and NIC. If the file path is not defined as an environment variable, the default value is/tmp/nic_config.txt.HCL_FAULT_TOLERANCE_NIC_REPORT_FILE: The report file for recording NIC state change events and indications when updates have been applied. If the file path is not defined as an environment variable, the default value is/tmp/nic_report.txt.
You can also define HCL_FAULT_TOLERANCE_NIC_MONITOR_INTERVAL_SECONDS, which specifies how often, in seconds, the fault-tolerance monitor thread polls the NIC configuration file for changes. If the variable is not set, the default value is 60.
Setup and File Permissions¶
To create a configuration file with the 644 (root writable) permission, use the following commands:
sudo touch HCL_FAULT_TOLERANCE_NIC_CONFIG_FILE sudo chmod 644 HCL_FAULT_TOLERANCE_NIC_CONFIG_FILE
The configuration file uses a line-oriented table format with three columns separated by the pipe character |, as in the following example:
module id | nic | status
0 | 20 | 1
0 | 21 | up
1 | 8 | 1
1 | 9 | up
2 | 8 | DOWN
2 | 5 | 0
Where:
module_idis the HCL module or device ID. For an 8-device system, valid values range from 0 to 7.nicis the NIC port number. Valid values range from 0 to 23, depending on hardware.statusis the NIC state.
This example presents a configuration, where:
Module 0 has NIC ports 20 and 21 enabled
Module 1 has NIC ports 8 and 9 enabled
Module 2 has NIC ports 8 and 5 disabled
If the configuration files do not exist, HCL runs normally with all NIC ports in their default enabled state.
The report file is created by HCL.
Updating NIC Port States¶
To isolate selected NIC ports, you have to update the configuration file by setting the desired NIC port status for specific ports:
To disable a NIC port, set the status to
DOWN,down, or0To enable a NIC port, set the status to
UP,up, or1
For example, the following command disables NIC ports 20 and 21 on module 0:
cat > HCL_FAULT_TOLERANCE_NIC_CONFIG_FILE << 'EOF'
module id | nic | status
0 | 20 | DOWN
0 | 21 | DOWN
EOF
While updating the file, ensure you meet the parsing rules so that validation succeeds:
All NICs belonging to the same logical port (0, 1, or 2) must be in the same state (all UP or all DOWN). Mixed states within a logical port are not allowed; such configurations are treated as invalid, ignored, and HCL logs an error to both the log and the report file.
At most one logical port may be DOWN at a time.
Lines starting with
#are ignored.Non-numeric characters are ignored.
Extra whitespace around values is ignored.
Each HLS node only processes entries matching its own
module_id.Rows with invalid format or NIC IDs are skipped with a debug-level log message.
At startup, the default NIC port state is
UP. If a NIC port does not appear in the update table, the previous state is kept.
Report File¶
In the report file, you can find all NIC state change events and indications when updates have been applied, as in this example:
moduleId=0, startTimeMs=[10:14:53.484334], nic=20, newState=DOWN, endTimeMs=[10:14:59.097266]
moduleId=2, startTimeMs=[10:14:53.484257], nic=8, newState=DOWN, endTimeMs=[10:14:59.097792]
moduleId=4, startTimeMs=[10:14:53.482186], nic=15, newState=DOWN, endTimeMs=[10:14:59.098569]
moduleId=7, startTimeMs=[10:14:53.484370], nic=15, newState=DOWN, endTimeMs=[10:14:59.099556]
moduleId=6, startTimeMs=[10:14:53.484441], nic=3, newState=DOWN, endTimeMs=[10:14:59.138026]
moduleId=3, startTimeMs=[10:14:53.484435], nic=20, newState=DOWN, endTimeMs=[10:14:59.167210]
moduleId=1, startTimeMs=[10:14:53.484450], nic=8, newState=DOWN, endTimeMs=[10:14:59.209512]
moduleId=5, startTimeMs=[10:14:53.484435], nic=3, newState=DOWN, endTimeMs=[10:14:59.218237]
***********************************************
*** HOST UPDATE COMPLETED [10:14:59.218345] ***
***********************************************
Logging and Diagnostics¶
There are a few possible log messages that you may see when the HCL application starts:
Files found:
[INFO] NIC config file found: HCL_FAULT_TOLERANCE_NIC_CONFIG_FILE. Starting dynamic monitoring.
Files not found (normal operation):
[INFO] NIC config files not found. Dynamic NIC configuration monitoring is disabled.
Update detected:
[INFO] NIC config update file detected. Processing changes...
[INFO] Successfully processed NIC configuration update.
Common Use Cases¶
This feature may be especially useful in the following scenarios:
Temporarily disabling one or more NIC ports for fault injection testing or hardware maintenance scenarios.
Evaluating system behavior when network capacity is reduced.
Isolating network issues by selectively disabling problematic NIC ports.
Allowing non-root external tools to update NIC states.