NSX controller deployment, deep dive, and connectivity verification

 
I recently had a few customer’s run into some issues surrounding control plan connectivity. I wanted to compile a list of steps to verify that the control plane is in tact and doesn’t have any issues as well as provide useful commands to diagnose any issues. I wrote a previous post on how to Troubleshoot NSX Controller cluster status, roles, and connectivity, but I wanted to dive a little bit deeper this time.
 

Deployment

 
The first thing to mention, is again, that we recommend to deploy a total of three NSX controllers. NSX controllers store VM, ESXi, logical switches, and VLXAN information and control virtual networks and transport zones via three different types of control plane modes – Multicast, Unicast, and Hybrid, which is a combination of multicast and unicast. This is essential for decoupling NSX from the physical network. NSX controllers are very important, hence the need to deploy three of them for both slicing and redundancy. Controller failures to not result in any impacts to data traffic but VMware recommends that you deploy a total of 3 so there can be a majority election of the master controller to avoid split brain. Anything less and you risk split brain and less redundancy, and anything more is not necessary even for large environments as they controllers will share the load. The NSX controllers distribute all network information to the ESXi host and communicate with them via the netcpa process on ESXi.

To further your redundancy, you will also want to set DRS affinity rules to ensure that each controller resides on separate hosts. You can set a rule via the web client as shown below.
 
2
 
Once you deploy your controllers, you can use the GUI to monitor the state and ensure consistency. Unfortunately, this does not always indicate good health in certain scenarios and further troubleshooting may need to be done.
 
1
 

Control Plane and Connectivity Verification

To verify the state of your controllers, you can SSH into both the controllers and the ESXi hosts and run the following commands. Please note where I am running the command below, esx-01a is the host and nsx-controller is the controller, obvious, but the commands only work on one or the other :). The following command is going to show you if the Control plane is out of sync, all of your VXLAN network segments, their IP, the controller that owns the VNI, status of the controller and much more. If you see something out of sync with the control plane, you can always check /var/log/netcpad.log for problems, as well as run the REST API call “PUT https:///api/2.0/vdn/controller/synchronize” to resync the controllers.

 

[root@esx-01a:~] net-vdl2 -l
VXLAN Global States:
        Control plane Out-Of-Sync:      No
        UDP port:       8472
VXLAN VDS:      vds-site-a
        VDS ID: c2 fb 2e 50 fb 09 5f 02-99 94 60 9f 68 ed 95 33
        MTU:    1600
        Segment ID:     192.168.130.0
        Gateway IP:     192.168.130.1
        Gateway MAC:    00:50:56:01:20:a6
        Vmknic count:   1
                VXLAN vmknic:   vmk3
                        VDS port ID:    161
                        Switch port ID: 33554441
                        Endpoint ID:    0
                        VLAN ID:        0
                        IP:             192.168.130.52
                        Netmask:        255.255.255.0
                        Segment ID:     192.168.130.0
                        IP acquire timeout:     0
                        Multicast group count:  0
        Network count:  4
                VXLAN network:  5002
                        Multicast IP:   N/A (headend replication)
                        Control plane:  Enabled (multicast proxy,ARP proxy)
                        Controller:     192.168.110.32 (up)
                        MAC entry count:        1
                        ARP entry count:        0
                        Port count:     1
                VXLAN network:  5001
                        Multicast IP:   N/A (headend replication)
                        Control plane:  Enabled (multicast proxy,ARP proxy)
                        Controller:     192.168.110.33 (up)
                        MAC entry count:        3
                        ARP entry count:        0
                        Port count:     2
                VXLAN network:  5000
                        Multicast IP:   N/A (headend replication)
                        Control plane:  Enabled (multicast proxy,ARP proxy)
                        Controller:     192.168.110.32 (up)
                        MAC entry count:        0
                        ARP entry count:        0
                        Port count:     1
                VXLAN network:  5003
                        Multicast IP:   0.0.0.0
                        Control plane:  Disabled
                        MAC entry count:        0
                        ARP entry count:        0
                        Port count:     1

 
To ensure that your ESXi host and the controllers are communicating over the control plane and management network, you can run the following command. I only have two controllers in my lab environment due to resource constraints, but you should see each controllers connected below through netcpa. You will want to see the connection established rather than waiting.
 

[root@esx-01a:~] esxcli network ip connection list | grep 1234
tcp         0       0  192.168.110.51:44885            192.168.110.33:1234   ESTABLISHED     35068  newreno  netcpa-worker
tcp         0       0  192.168.110.51:14004            192.168.110.32:1234   ESTABLISHED     35070  newreno  netcpa-worker

              
To verify that the controller processes are running correctly, run the following commands. You should see both controller and java-dir-server running.
 

nsx-controller # show process
Name               CPU-Time    Elapsed-Time   Resident-Size    Virtual-Size
controller         00:00:00           30:08         5916 kB        51420 kB
cpp-domain         00:01:02           30:08       257916 kB       852616 kB
python-domain      00:02:49           30:07       104712 kB       189088 kB
java-persistence   00:02:48           30:07       451100 kB      3368672 kB
java-dir-server    00:00:47           30:07       117664 kB      6676704 kB
snmpd              00:00:00           30:17         4456 kB        48336 kB
ntp                00:00:00           30:08         1984 kB        33532 kB
api-server         00:00:00           30:17        18080 kB       101804 kB

 
It’s also possible to list all the logical router instances on a controller as well as a host shown in the next two commands.
 

nsx-controller # show control-cluster logical-routers instance all
LR-Id      LR-Name                                            Universal Service-Controller Egress-Locale                        In-Sync    Sync-Category
0x1388     default+edge-2                                     false     192.168.110.32     local                                N/A        N/A

 

[root@esx-01a:~] net-vdr -I -l

VDR Instance Information :
---------------------------

Vdr Name:                   default+edge-2
Vdr Id:                     0x00001388
Number of Lifs:             4
Number of Routes:           5
State:                      Enabled
Controller IP:              192.168.110.32
Control Plane IP:           192.168.110.51
Control Plane Active:       Yes
Num unique nexthops:        1
Generation Number:          0
Edge Active:                No

 
If the command above returns a Controller IP of 0.0.0.0 you are most likely running into a problem with the control plane. The first step to check, is to ensure the NSX manager is pushing the correct information to the controllers and host. The file below stores the controller and VDR information, make sure this populates the correct IPs.
 

/etc/vmware/netcpa/config-by-vsm.xml

 

<config>
  <connectionList>
    <connection id="0000">
      <port>1234</port>
      <server>192.168.110.31</server>
      <sslEnabled>true</sslEnabled>
      <thumbprint>A5:C6:A2:B2:57:97:36:F0:7C:13:DB:64:9B:86:E6:EF:1A:7E:5C:36</thumbprint>
    </connection>
    <connection id="0001">
      <port>1234</port>
      <server>192.168.110.32</server>
      <sslEnabled>true</sslEnabled>
      <thumbprint>12:E0:25:B2:E0:35:D7:84:90:71:CF:C7:53:97:FD:96:EE:ED:7C:DD</thumbprint>
    </connection>
    <connection id="0002">
      <port>1234</port>
      <server>192.168.110.33</server>
      <sslEnabled>true</sslEnabled>
      <thumbprint>BD:DB:BA:B0:DC:61:AD:94:C6:0F:7E:F5:80:19:44:51:BA:90:2C:8D</thumbprint>
    </connection>
  </connectionList>
  <localeId>
    <id>423A993F-BEE6-1285-58F1-54E48D508D90</id>
  </localeId>
  <vdrDvsList>
    <vdrDvs id="0000">
      <numActiveUplink>1</numActiveUplink>
      <numUplink>4</numUplink>
      <teamingPolicy>FAILOVER_ORDER</teamingPolicy>
      <uplinkPortNames>Uplink 4,Uplink 3,Uplink 2,Uplink 1</uplinkPortNames>
      <uuid>c2 fb 2e 50 fb 09 5f 02-99 94 60 9f 68 ed 95 33</uuid>
      <vxlanOnly>true</vxlanOnly>
    </vdrDvs>
  </vdrDvsList>
  <vdrInstanceList>
    <vdrInstance id="0000">
      <authToken>0f58a2b5-8ee1-482d-aa41-8da85f9596bd</authToken>
      <isUniversal>false</isUniversal>
      <localEgressRequired>false</localEgressRequired>
      <vdrId>5000</vdrId>
      <vdrName>default+edge-2</vdrName>
    </vdrInstance>
  </vdrInstanceList>
</config>

 
Finally, some of the commands below are going to show you the cluster status, the role, and one of my favorites – the history of the controllers. The history is going to show you timestamps on when the nodes were restarted, when they joined the cluster and via which controller node, the roles configured, and interruption to connectivity, etc. This is useful if your controllers get out of sync or show disconnected. It provides you with a timestamp of when the problem occurred and from there you can reference the logs that I note towards the end of this post.
 

nsx-controller # show control-cluster status
Type                Status                                       Since
--------------------------------------------------------------------------------
Join status:        Join complete                                10/23 15:59:53
Majority status:    Connected to cluster majority                10/23 15:59:37
Restart status:     This controller can be safely restarted      10/23 15:59:50
Cluster ID:         4ae430e7-edcb-4554-bb19-6e176a5770e9
Node UUID:          4ae430e7-edcb-4554-bb19-6e176a5770e9

Role                Configured status   Active status
--------------------------------------------------------------------------------
api_provider        enabled             activated
persistence_server  enabled             activated
switch_manager      enabled             activated
logical_manager     enabled             activated
directory_server    enabled             activated

 

nsx-controller # show control-cluster role
                          Listen-IP  Master?    Last-Changed  Count
api_provider         Not configured       No  10/23 15:59:53      9
persistence_server              N/A       No  10/23 15:59:53      8
switch_manager            127.0.0.1       No  10/23 15:59:53      9
logical_manager                 N/A       No  10/23 15:59:53      9
directory_server                N/A       No  10/23 15:59:53      9

 

nsx-controller # show control-cluster history
===================================
Host nsx-controller
Node 4ae430e7-edcb-4554-bb19-6e176a5770e9 (192.168.110.31, nicira-nvp-controller.4.0.6.44780)
  ---------------------------------
  10/23 15:59:16: Node restarted
  10/23 15:59:21: Joining cluster via node 192.168.110.31
  10/23 15:59:21: Waiting to join cluster
  10/23 15:59:21: Role api_provider configured
  10/23 15:59:21: Role directory_server configured
  10/23 15:59:21: Role switch_manager configured
  10/23 15:59:21: Role logical_manager configured
  10/23 15:59:21: Role persistence_server configured
  10/23 15:59:21: Joining cluster via node 192.168.110.32
  10/23 15:59:21: Joining cluster via node 192.168.110.33
  10/23 15:59:23: Joined cluster; initializing local components
  10/23 15:59:23: Disconnected from cluster majority
  10/23 15:59:23: Connected to cluster majority
  10/23 15:59:23: Initializing data contact with cluster
  10/23 15:59:24: Interrupted connection to cluster majority
  10/23 15:59:37: Connected to cluster majority
  10/23 15:59:50: Fetching initial configuration data
  10/23 15:59:50: Role persistence_server activated
  10/23 15:59:53: Join complete
  10/23 15:59:53: Role api_provider activated
  10/23 15:59:53: Role directory_server activated
  10/23 15:59:53: Role logical_manager activated
  10/23 15:59:53: Role switch_manager activated

 
A couple logs that are useful to troubleshoot are as follows:
 
Host connectivity errors “show log cloudnet/cloudnet_java-vnet-controller.log
Storage latency “show log cloudnet/cloudnet_java-zookeeper.log
 
Finally, there is a way to collect the NSX controller logs through command line. This is useful because you can gather all 3 controllers at the same time instead of hitting the “Gather tech support logs” button and waiting for each individual controller to finish. To accomplish this task, run the following commands on each controller. The copy file just uses scp and I have moved it to an ESXi host although you can copy to a different remote server.
 

nsx-controller # save status-report controller1.log
................................................................................                                                                          ................................................................................                                                                          ................................................................................                                                                          ................................................................................                                                                          ................................................................................                                                                          ................................................................................                                                                          ................................................................................
nsx-controller# copy file  root@esxihost/tmp/

 
I hope this helps point everyone in the right direction for troubleshooting controller and control plane issues. Feel free to post any questions or comments below!
 

Posted by:

Sean Whitney

2 Comments

  1. Rajeev -  February 13, 2016 - 5:31 am 409

    Hi Sean

    In the output for “show control-cluster role” shown above the controller is not master for any of the services.
    Do let me know if this is right. My understanding is that each controller should be master for at least 1 of the service components.
    Let me know if my understanding is right.

    Reply
    • Sean Whitney -  February 13, 2016 - 7:47 am 414

      Hi Rajeev,

      That is normal. Some controllers may not be a master for any of the roles, while another may be the master for all of the roles.

      Sean

      Reply

Leave A Comment

Your email address will not be published. Required fields are marked (required):

Back to Top