In NSX 6.2 we introduced a new feature to provide users with a way to check the communication channel between NSX Manager, the control plane agent (netcpa) and the firewall agent. If the channel is broken, the NSX manager will perform a sync operation to attempt to recover. The following communication channels are checked, along with their intervals below.
- NSX Manager to Firewall agent – A heartbeat is sent every 3 minutes, if two iterations are lost a sync will occur
- NSX Manager to Control Plane Agent – A heartbeat is sent every 2 minutes, if two iterations are lost a sync will occur
- Host to controller – Heartbeats are sent every 30 seconds, if three iterations are lost a sync will occur
To verify the connection, log into the vSphere Web Client and navigate to -> Networking & Security -> Installation -> Host Preparation -> Actions -> Communication Channel Health
As you can see from the image, everything in this specific cluster is healthy and all channels are up.
Let’s play around with stopping the firewall agent and the control plane agent to see the new status. Log into the ESXi host and run the following commands.
[root@esx-01a:/var/log] /etc/init.d/netcpad status netCP agent service is running [root@esx-01a:/var/log] /etc/init.d/netcpad stop watchdog-netcpa: Terminating watchdog process with PID 35036 Memory reservation released for netcpa netCP agent service is stopped
As you can see above, I have stopped the netcpad service. If I run the channel check again, you will see the NSX Manager to control plane agent as Down
Note: As mentioned, it will take approximately 4 minutes to update the status as it will have to lose two heartbeats at 120 seconds each.
I will also stop the firewall agent, and you will see the NSX Manager to Firewall Agent as Down
[root@esx-01a:/var/log] /etc/init.d/vShield-Stateful-Firewall status vShield-Stateful-Firewall is running [root@esx-01a:/var/log] /etc/init.d/vShield-Stateful-Firewall stop watchdog-vShield-Stateful-Firewall: Terminating watchdog process with PID 35474 vShield-Stateful-Firewall stopped watchdog-dfwpktlogs: Terminating watchdog process with PID 35454 Resource pool 'host/vim/vmvisor/vsfwd' released.
If you like to use REST API calls to check the communication health, you can do so as shown below. The call is GET https://nsxmanager/api/2.0/vdn/inventory/host/host-28/connection/status
It’s possible to change the default interval that the host will send heartbeats by editing the file and line shown below. The value below is in seconds. You can also change the other iterations, however you will need to call support for this as you will need root shell access to the NSX manager. Sorry 🙂
[root@esx-01a:/var/log] cat /usr/lib/vmware/netcpa/etc/netcpa.xml | grep Heart
To troubleshoot, or find out in the logs if you have had any connection issues, or force sync events you will look in the logs below for the following entries.
NSX Manager Log (show log manager): Messages lost for application:
netcpa.log (/var/log/netcpa.log): Got mismatched VSM seqNum
If you received these error messages you should a sync shortly after. The sync messages reported in the netcpa log is “Received full sync notification message”
I have had customer’s ask me if something existed to check the health of the NSX components and sure enough it does! I am hoping these checks help point everyone in the right direction if something is not working. Please feel free to let me know if you have any questions or comments!