VMware NSX: Solving communication issues between VMware ESXi and VMware NSX Controllers

An issue we recently encountered was that we were unable to correctly configure dynamic routing (unable to see OSPF neighbours, verified with running show ip ospf neighbour on the routers) and also had communication issues between deployed Edge Services Gateways (ESGs) and Distributed Logical Routers (DLRs).

Problem

The issue described above is of course very inconvenient so we were eager to find a solution. After some basic troubleshooting and verifying the configuration of our OSPF setup, we quickly came to the conclusion that everything was configured like it should be. One thing we did notice was that when we put the DLRs and ESGs on the same ESXi host, everything worked like a charm! So maybe we had connection issues between the ESXi hosts/controllers?

The next step in troubleshooting was to verify the installation tab in VMware vCenter (Networking & Security > Installation). According to VMware vCenter everything was working like expected and no issues were present at first sight. However, when clicking the Communication Channel Health button on the impacted cluster, we noticed that everything was listed as down.Screen Shot 2016-05-31 at 12.04.16invoke workflow from javascript code

After some googling we found the magnificent blogpost from Cormac Hogan concerning NSX troubleshooting tips. When checking the third step, we noticed that we were not seeing connections from the ESXi hosts to the controllers on port 1234. So the issue was found but what about the solution?

Solution

The solution appeared to be fairly straightforward:

  • Log in on vCenter
  • Find the impacted cluster that is prepared for NSX
  • For each host do the following:
    • Put host into maintenance mode
    • Remove host from the cluster
      • Uninstallation of the NSX agents will be triggered
    • Add host back into the cluster
    • Press the resolve button from the vSphere Web Client (Networking and Security > Installation > Host Preparation)
      • NSX noticed that the agent was not installed
      • Host will be rebooted and agents will be installed

When this procedure completed, we saw output when checking for connections on port 1234 and everything seemed to be fixed!

Screen Shot 2016-05-31 at 08.37.52

Conclusion

The solution to the connection issues between ESXi hosts and the NSX controllers appears to be fairly easy. If there is a better solution, please feel free to let us know. We are currently still searching for the root cause of this issue.

Thanks to William De Keyzer and Rik Herlaar for the troubleshooting assistance!

yannickstruyf