Troubleshoot NSX Controller cluster status, roles and connectivity

 
VMware’s recommendation is to deploy NSX controllers in odd numbers, three or greater. I have also heard, from my recent NSX ICM class that in most environments, three controllers should be sufficient, there isn’t need for much more. However, I wouldn’t recommend anything less than three because otherwise you won’t have any redundancy, and the NSX Controllers are the control plane for your NSX network, so it’s important to ensure redundancy. The reason you need three and not two, is because there needs to be a majority election of NSX controllers, with two you can run into a split brain scenario.
 
Troubleshooting NSX Controller connectivity isn’t too difficult. In fact, if you have any issues, you can always delete the rogue controller, and deploy a new one, which is probably the easiest, quickest method, or you can attempt to repair the troublesome controller. The first thing you want to do is confirm the status in the GUI by navigating to Networking & Security -> Installation -> Management. You should see a Status next to each of the NSX controller nodes.
 
2
 
As you can see, one of my controllers is showing Disconnected. The first thing we want to try, is to find the master node. The easiest way to do this, is to run the following commands.
 

nsx-controller # show control-cluster role
                          Listen-IP  Master?    Last-Changed  Count
api_provider         Not configured       No  05/02 19:17:24     13
persistence_server              N/A       No  05/02 19:17:24     10
switch_manager            127.0.0.1       No  05/02 19:17:24     13
logical_manager                 N/A       No  05/02 19:17:24     13
directory_server                N/A       No  05/02 19:17:24     13

 
The controller above, is NOT the master, as you can see. Logging into another controller and running the same command, you will see I found the master.
 

nsx-controller # show control-cluster roles
                          Listen-IP  Master?    Last-Changed  Count
api_provider         Not configured      Yes  05/02 19:17:23      3
persistence_server              N/A      Yes  05/02 19:17:24      3
switch_manager            127.0.0.1      Yes  05/02 19:17:23      3
logical_manager                 N/A      Yes  05/02 19:17:23      3
directory_server                N/A      Yes  05/02 19:17:23      3

 
To recover from a failure, you can attempt to run the command below, to see if the node will rejoin the cluster.
 

nsx-controller # join control-cluster 192.168.18.32 force
Clearing controller state and restarting
Stopping nicira-nvp-controller: [Done]
Clearing nicira-nvp-controller's state: OK
Starting nicira-nvp-controller: CLI revert file already exists
mapping eth0 -> bridged-pif
ssh stop/waiting
ssh start/running, process 12242
mapping breth0 -> eth0
mapping breth0 -> eth0
ssh stop/waiting
ssh start/running, process 12360
Setting core limit to unlimited
Setting file descriptor limit to 100000
 nicira-nvp-controller [OK]
** Watching control-cluster history; ctrl-c to exit **
===================================
Host nsx-controller
Node 33efdea4-a465-4496-8d0b-81c25c238d7a (192.168.18.31)
  ---------------------------------
  05/02 19:21:20: Joining cluster via node 192.168.18.32
  05/02 19:21:20: Waiting to join cluster
  05/02 19:21:52: Joined cluster; initializing local components
  05/02 19:21:53: Initializing data contact with cluster
  05/02 19:22:13: Fetching initial configuration data
  05/02 19:22:31: Join complete

 
Success! I have rejoined the cluster, and the status shows Normal on both of my nodes. You can also run some additional commands to confirm the cluster status.
 
3
 

nsx-controller # show control-cluster startup-nodes
192.168.18.31, 192.168.18.32

 

3 Comments

  1. Rajeev -  June 26, 2016 - 6:30 am 501

    Hi Sean

    The command “join control-cluster 192.168.18.32 force” can only be executed from the Master Controller or it can be executed from any of the controllers.
    Let me know if my understanding is right.

    Reply
    • Imro -  July 14, 2016 - 8:13 am 505

      Rajeev, no, it should be executed from the disconnected node. the IP you use, is the IP of the master controller.

      Reply
  2. Maher -  August 23, 2016 - 7:27 pm 516

    I have the same issue in my lab but the join command doenst exist
    i m running nsx 6.2.3 .. did the syntax hange ?

    Reply

Leave A Comment

Your email address will not be published. Required fields are marked (required):

Back to Top