NSX controller deployment, deep dive, and connectivity verification
I recently had a few customer’s run into some issues surrounding control plan connectivity. I wanted to compile a list of steps to verify that the control plane is in tact and doesn’t have any issues as well as provide useful commands to diagnose any issues. I wrote a previous post on how to Troubleshoot NSX Controller cluster status, roles, and connectivity, but I wanted to dive a little bit deeper this time.
Deployment
The first thing to mention, is again, that we recommend to deploy a total of three NSX controllers. NSX controllers store VM, ESXi, logical switches, and VLXAN information and control virtual networks and transport zones via three different types of control plane modes – Multicast, Unicast, and Hybrid, which is a combination of multicast and unicast. This is essential for decoupling NSX from the physical network. NSX controllers are very important, hence the need to deploy three of them for both slicing and redundancy. Controller failures to not result in any impacts to data traffic but VMware recommends that you deploy a total of 3 so there can be a majority election of the master controller to avoid split brain. Anything less and you risk split brain and less redundancy, and anything more is not necessary even for large environments as they controllers will share the load. The NSX controllers distribute all network information to the ESXi host and communicate with them via the netcpa process on ESXi.
To further your redundancy, you will also want to set DRS affinity rules to ensure that each controller resides on separate hosts. You can set a rule via the web client as shown below.
Once you deploy your controllers, you can use the GUI to monitor the state and ensure consistency. Unfortunately, this does not always indicate good health in certain scenarios and further troubleshooting may need to be done.
Control Plane and Connectivity Verification
To verify the state of your controllers, you can SSH into both the controllers and the ESXi hosts and run the following commands. Please note where I am running the command below, esx-01a is the host and nsx-controller is the controller, obvious, but the commands only work on one or the other :). The following command is going to show you if the Control plane is out of sync, all of your VXLAN network segments, their IP, the controller that owns the VNI, status of the controller and much more. If you see something out of sync with the control plane, you can always check /var/log/netcpad.log for problems, as well as run the REST API call “PUT https://
[root@esx-01a:~] net-vdl2 -l VXLAN Global States: Control plane Out-Of-Sync: No UDP port: 8472 VXLAN VDS: vds-site-a VDS ID: c2 fb 2e 50 fb 09 5f 02-99 94 60 9f 68 ed 95 33 MTU: 1600 Segment ID: 192.168.130.0 Gateway IP: 192.168.130.1 Gateway MAC: 00:50:56:01:20:a6 Vmknic count: 1 VXLAN vmknic: vmk3 VDS port ID: 161 Switch port ID: 33554441 Endpoint ID: 0 VLAN ID: 0 IP: 192.168.130.52 Netmask: 255.255.255.0 Segment ID: 192.168.130.0 IP acquire timeout: 0 Multicast group count: 0 Network count: 4 VXLAN network: 5002 Multicast IP: N/A (headend replication) Control plane: Enabled (multicast proxy,ARP proxy) Controller: 192.168.110.32 (up) MAC entry count: 1 ARP entry count: 0 Port count: 1 VXLAN network: 5001 Multicast IP: N/A (headend replication) Control plane: Enabled (multicast proxy,ARP proxy) Controller: 192.168.110.33 (up) MAC entry count: 3 ARP entry count: 0 Port count: 2 VXLAN network: 5000 Multicast IP: N/A (headend replication) Control plane: Enabled (multicast proxy,ARP proxy) Controller: 192.168.110.32 (up) MAC entry count: 0 ARP entry count: 0 Port count: 1 VXLAN network: 5003 Multicast IP: 0.0.0.0 Control plane: Disabled MAC entry count: 0 ARP entry count: 0 Port count: 1
To ensure that your ESXi host and the controllers are communicating over the control plane and management network, you can run the following command. I only have two controllers in my lab environment due to resource constraints, but you should see each controllers connected below through netcpa. You will want to see the connection established rather than waiting.
[root@esx-01a:~] esxcli network ip connection list | grep 1234 tcp 0 0 192.168.110.51:44885 192.168.110.33:1234 ESTABLISHED 35068 newreno netcpa-worker tcp 0 0 192.168.110.51:14004 192.168.110.32:1234 ESTABLISHED 35070 newreno netcpa-worker
To verify that the controller processes are running correctly, run the following commands. You should see both controller and java-dir-server running.
nsx-controller # show process Name CPU-Time Elapsed-Time Resident-Size Virtual-Size controller 00:00:00 30:08 5916 kB 51420 kB cpp-domain 00:01:02 30:08 257916 kB 852616 kB python-domain 00:02:49 30:07 104712 kB 189088 kB java-persistence 00:02:48 30:07 451100 kB 3368672 kB java-dir-server 00:00:47 30:07 117664 kB 6676704 kB snmpd 00:00:00 30:17 4456 kB 48336 kB ntp 00:00:00 30:08 1984 kB 33532 kB api-server 00:00:00 30:17 18080 kB 101804 kB
It’s also possible to list all the logical router instances on a controller as well as a host shown in the next two commands.
nsx-controller # show control-cluster logical-routers instance all LR-Id LR-Name Universal Service-Controller Egress-Locale In-Sync Sync-Category 0x1388 default+edge-2 false 192.168.110.32 local N/A N/A
[root@esx-01a:~] net-vdr -I -l VDR Instance Information : --------------------------- Vdr Name: default+edge-2 Vdr Id: 0x00001388 Number of Lifs: 4 Number of Routes: 5 State: Enabled Controller IP: 192.168.110.32 Control Plane IP: 192.168.110.51 Control Plane Active: Yes Num unique nexthops: 1 Generation Number: 0 Edge Active: No
If the command above returns a Controller IP of 0.0.0.0 you are most likely running into a problem with the control plane. The first step to check, is to ensure the NSX manager is pushing the correct information to the controllers and host. The file below stores the controller and VDR information, make sure this populates the correct IPs.
/etc/vmware/netcpa/config-by-vsm.xml
<config> <connectionList> <connection id="0000"> <port>1234</port> <server>192.168.110.31</server> <sslEnabled>true</sslEnabled> <thumbprint>A5:C6:A2:B2:57:97:36:F0:7C:13:DB:64:9B:86:E6:EF:1A:7E:5C:36</thumbprint> </connection> <connection id="0001"> <port>1234</port> <server>192.168.110.32</server> <sslEnabled>true</sslEnabled> <thumbprint>12:E0:25:B2:E0:35:D7:84:90:71:CF:C7:53:97:FD:96:EE:ED:7C:DD</thumbprint> </connection> <connection id="0002"> <port>1234</port> <server>192.168.110.33</server> <sslEnabled>true</sslEnabled> <thumbprint>BD:DB:BA:B0:DC:61:AD:94:C6:0F:7E:F5:80:19:44:51:BA:90:2C:8D</thumbprint> </connection> </connectionList> <localeId> <id>423A993F-BEE6-1285-58F1-54E48D508D90</id> </localeId> <vdrDvsList> <vdrDvs id="0000"> <numActiveUplink>1</numActiveUplink> <numUplink>4</numUplink> <teamingPolicy>FAILOVER_ORDER</teamingPolicy> <uplinkPortNames>Uplink 4,Uplink 3,Uplink 2,Uplink 1</uplinkPortNames> <uuid>c2 fb 2e 50 fb 09 5f 02-99 94 60 9f 68 ed 95 33</uuid> <vxlanOnly>true</vxlanOnly> </vdrDvs> </vdrDvsList> <vdrInstanceList> <vdrInstance id="0000"> <authToken>0f58a2b5-8ee1-482d-aa41-8da85f9596bd</authToken> <isUniversal>false</isUniversal> <localEgressRequired>false</localEgressRequired> <vdrId>5000</vdrId> <vdrName>default+edge-2</vdrName> </vdrInstance> </vdrInstanceList> </config>
Finally, some of the commands below are going to show you the cluster status, the role, and one of my favorites – the history of the controllers. The history is going to show you timestamps on when the nodes were restarted, when they joined the cluster and via which controller node, the roles configured, and interruption to connectivity, etc. This is useful if your controllers get out of sync or show disconnected. It provides you with a timestamp of when the problem occurred and from there you can reference the logs that I note towards the end of this post.
nsx-controller # show control-cluster status Type Status Since -------------------------------------------------------------------------------- Join status: Join complete 10/23 15:59:53 Majority status: Connected to cluster majority 10/23 15:59:37 Restart status: This controller can be safely restarted 10/23 15:59:50 Cluster ID: 4ae430e7-edcb-4554-bb19-6e176a5770e9 Node UUID: 4ae430e7-edcb-4554-bb19-6e176a5770e9 Role Configured status Active status -------------------------------------------------------------------------------- api_provider enabled activated persistence_server enabled activated switch_manager enabled activated logical_manager enabled activated directory_server enabled activated
nsx-controller # show control-cluster role Listen-IP Master? Last-Changed Count api_provider Not configured No 10/23 15:59:53 9 persistence_server N/A No 10/23 15:59:53 8 switch_manager 127.0.0.1 No 10/23 15:59:53 9 logical_manager N/A No 10/23 15:59:53 9 directory_server N/A No 10/23 15:59:53 9
nsx-controller # show control-cluster history =================================== Host nsx-controller Node 4ae430e7-edcb-4554-bb19-6e176a5770e9 (192.168.110.31, nicira-nvp-controller.4.0.6.44780) --------------------------------- 10/23 15:59:16: Node restarted 10/23 15:59:21: Joining cluster via node 192.168.110.31 10/23 15:59:21: Waiting to join cluster 10/23 15:59:21: Role api_provider configured 10/23 15:59:21: Role directory_server configured 10/23 15:59:21: Role switch_manager configured 10/23 15:59:21: Role logical_manager configured 10/23 15:59:21: Role persistence_server configured 10/23 15:59:21: Joining cluster via node 192.168.110.32 10/23 15:59:21: Joining cluster via node 192.168.110.33 10/23 15:59:23: Joined cluster; initializing local components 10/23 15:59:23: Disconnected from cluster majority 10/23 15:59:23: Connected to cluster majority 10/23 15:59:23: Initializing data contact with cluster 10/23 15:59:24: Interrupted connection to cluster majority 10/23 15:59:37: Connected to cluster majority 10/23 15:59:50: Fetching initial configuration data 10/23 15:59:50: Role persistence_server activated 10/23 15:59:53: Join complete 10/23 15:59:53: Role api_provider activated 10/23 15:59:53: Role directory_server activated 10/23 15:59:53: Role logical_manager activated 10/23 15:59:53: Role switch_manager activated
A couple logs that are useful to troubleshoot are as follows:
Host connectivity errors “show log cloudnet/cloudnet_java-vnet-controller.log”
Storage latency “show log cloudnet/cloudnet_java-zookeeper.log”
Finally, there is a way to collect the NSX controller logs through command line. This is useful because you can gather all 3 controllers at the same time instead of hitting the “Gather tech support logs” button and waiting for each individual controller to finish. To accomplish this task, run the following commands on each controller. The copy file just uses scp and I have moved it to an ESXi host although you can copy to a different remote server.
nsx-controller # save status-report controller1.log ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ nsx-controller# copy fileroot@esxihost/tmp/
I hope this helps point everyone in the right direction for troubleshooting controller and control plane issues. Feel free to post any questions or comments below!
2 Comments
Hi Sean
In the output for “show control-cluster role” shown above the controller is not master for any of the services.
Do let me know if this is right. My understanding is that each controller should be master for at least 1 of the service components.
Let me know if my understanding is right.
Hi Rajeev,
That is normal. Some controllers may not be a master for any of the roles, while another may be the master for all of the roles.
Sean