Troubleshoot NSX Controller cluster status, roles and connectivity
VMware’s recommendation is to deploy NSX controllers in odd numbers, three or greater. I have also heard, from my recent NSX ICM class that in most environments, three controllers should be sufficient, there isn’t need for much more. However, I wouldn’t recommend anything less than three because otherwise you won’t have any redundancy, and the NSX Controllers are the control plane for your NSX network, so it’s important to ensure redundancy. The reason you need three and not two, is because there needs to be a majority election of NSX controllers, with two you can run into a split brain scenario.
Troubleshooting NSX Controller connectivity isn’t too difficult. In fact, if you have any issues, you can always delete the rogue controller, and deploy a new one, which is probably the easiest, quickest method, or you can attempt to repair the troublesome controller. The first thing you want to do is confirm the status in the GUI by navigating to Networking & Security -> Installation -> Management. You should see a Status next to each of the NSX controller nodes.
As you can see, one of my controllers is showing Disconnected. The first thing we want to try, is to find the master node. The easiest way to do this, is to run the following commands.
nsx-controller # show control-cluster role Listen-IP Master? Last-Changed Count api_provider Not configured No 05/02 19:17:24 13 persistence_server N/A No 05/02 19:17:24 10 switch_manager 127.0.0.1 No 05/02 19:17:24 13 logical_manager N/A No 05/02 19:17:24 13 directory_server N/A No 05/02 19:17:24 13
The controller above, is NOT the master, as you can see. Logging into another controller and running the same command, you will see I found the master.
nsx-controller # show control-cluster roles Listen-IP Master? Last-Changed Count api_provider Not configured Yes 05/02 19:17:23 3 persistence_server N/A Yes 05/02 19:17:24 3 switch_manager 127.0.0.1 Yes 05/02 19:17:23 3 logical_manager N/A Yes 05/02 19:17:23 3 directory_server N/A Yes 05/02 19:17:23 3
To recover from a failure, you can attempt to run the command below, to see if the node will rejoin the cluster.
nsx-controller # join control-cluster 192.168.18.32 force Clearing controller state and restarting Stopping nicira-nvp-controller: [Done] Clearing nicira-nvp-controller's state: OK Starting nicira-nvp-controller: CLI revert file already exists mapping eth0 -> bridged-pif ssh stop/waiting ssh start/running, process 12242 mapping breth0 -> eth0 mapping breth0 -> eth0 ssh stop/waiting ssh start/running, process 12360 Setting core limit to unlimited Setting file descriptor limit to 100000 nicira-nvp-controller [OK] ** Watching control-cluster history; ctrl-c to exit ** =================================== Host nsx-controller Node 33efdea4-a465-4496-8d0b-81c25c238d7a (192.168.18.31) --------------------------------- 05/02 19:21:20: Joining cluster via node 192.168.18.32 05/02 19:21:20: Waiting to join cluster 05/02 19:21:52: Joined cluster; initializing local components 05/02 19:21:53: Initializing data contact with cluster 05/02 19:22:13: Fetching initial configuration data 05/02 19:22:31: Join complete
nsx-controller # show control-cluster startup-nodes 192.168.18.31, 192.168.18.32