Troubleshoot NSX Controller cluster status, roles and connectivity
VMware’s recommendation is to deploy NSX controllers in odd numbers, three or greater. I have also heard, from my recent NSX ICM class that in most environments, three controllers should be sufficient, there isn’t need for much more. However, I wouldn’t recommend anything less than three because otherwise you won’t have any redundancy, and the NSX Controllers are the control plane for your NSX network, so it’s important to ensure redundancy. The reason you need three and not two, is because there needs to be a majority election of NSX controllers, with two you can run into a split brain scenario.
Troubleshooting NSX Controller connectivity isn’t too difficult. In fact, if you have any issues, you can always delete the rogue controller, and deploy a new one, which is probably the easiest, quickest method, or you can attempt to repair the troublesome controller. The first thing you want to do is confirm the status in the GUI by navigating to Networking & Security -> Installation -> Management. You should see a Status next to each of the NSX controller nodes.
As you can see, one of my controllers is showing Disconnected. The first thing we want to try, is to find the master node. The easiest way to do this, is to run the following commands.
nsx-controller # show control-cluster role Listen-IP Master? Last-Changed Count api_provider Not configured No 05/02 19:17:24 13 persistence_server N/A No 05/02 19:17:24 10 switch_manager 127.0.0.1 No 05/02 19:17:24 13 logical_manager N/A No 05/02 19:17:24 13 directory_server N/A No 05/02 19:17:24 13
The controller above, is NOT the master, as you can see. Logging into another controller and running the same command, you will see I found the master.
nsx-controller # show control-cluster roles Listen-IP Master? Last-Changed Count api_provider Not configured Yes 05/02 19:17:23 3 persistence_server N/A Yes 05/02 19:17:24 3 switch_manager 127.0.0.1 Yes 05/02 19:17:23 3 logical_manager N/A Yes 05/02 19:17:23 3 directory_server N/A Yes 05/02 19:17:23 3
To recover from a failure, you can attempt to run the command below, to see if the node will rejoin the cluster.
nsx-controller # join control-cluster 192.168.18.32 force Clearing controller state and restarting Stopping nicira-nvp-controller: [Done] Clearing nicira-nvp-controller's state: OK Starting nicira-nvp-controller: CLI revert file already exists mapping eth0 -> bridged-pif ssh stop/waiting ssh start/running, process 12242 mapping breth0 -> eth0 mapping breth0 -> eth0 ssh stop/waiting ssh start/running, process 12360 Setting core limit to unlimited Setting file descriptor limit to 100000 nicira-nvp-controller [OK] ** Watching control-cluster history; ctrl-c to exit ** =================================== Host nsx-controller Node 33efdea4-a465-4496-8d0b-81c25c238d7a (192.168.18.31) --------------------------------- 05/02 19:21:20: Joining cluster via node 192.168.18.32 05/02 19:21:20: Waiting to join cluster 05/02 19:21:52: Joined cluster; initializing local components 05/02 19:21:53: Initializing data contact with cluster 05/02 19:22:13: Fetching initial configuration data 05/02 19:22:31: Join complete
Success! I have rejoined the cluster, and the status shows Normal on both of my nodes. You can also run some additional commands to confirm the cluster status.
nsx-controller # show control-cluster startup-nodes 192.168.18.31, 192.168.18.32
8 Comments
Hi Sean
The command “join control-cluster 192.168.18.32 force” can only be executed from the Master Controller or it can be executed from any of the controllers.
Let me know if my understanding is right.
Rajeev, no, it should be executed from the disconnected node. the IP you use, is the IP of the master controller.
I have the same issue in my lab but the join command doenst exist
i m running nsx 6.2.3 .. did the syntax hange ?
I had a single controller in my cluster and showed as disconnected. After running this force join command from the controller with his IP it connected and changed into a master role. I was then able to add the rest of controllers ….
If faulty controller is master. then what is the next action plan. to fix the issue
Hi Sean,
What happens if the master for the particular role fails in the controller cluster ? whether new master will be elected or how the process is.
Hi, I have one controller deployed. Command “show control-cluster roles” via deployed controller’s cli is showing Count of 2 for api_provider, switch_manager, logical_manager, directory_server and Count 1 for persistence_server. Why? Does it include NSX Manager?
What exactly Count means in output of “show control-cluster role” command?