An issue we recently encountered was that we were unable to correctly configure dynamic routing (unable to see OSPF neighbours, verified with running show ip ospf neighbour on the routers) and also had communication issues between deployed Edge Services Gateways (ESGs) and Distributed Logical Routers (DLRs).
The issue described above is of course very inconvenient so we were eager to find a solution. After some basic troubleshooting and verifying the configuration of our OSPF setup, we quickly came to the conclusion that everything was configured like it should be. One thing we did notice was that when we put the DLRs and ESGs on the same ESXi host, everything worked like a charm! So maybe we had connection issues between the ESXi hosts/controllers?
The next step in troubleshooting was to verify the installation tab in VMware vCenter (Networking & Security > Installation). According to VMware vCenter everything was working like expected and no issues were present at first sight. However, when clicking the Communication Channel Health button on the impacted cluster, we noticed that everything was listed as down.
After some googling we found the magnificent blogpost from Cormac Hogan concerning NSX troubleshooting tips. When checking the third step, we noticed that we were not seeing connections from the ESXi hosts to the controllers on port 1234. So the issue was found but what about the solution?
The solution appeared to be fairly straightforward:
- Log in on vCenter
- Find the impacted cluster that is prepared for NSX
- For each host do the following:
- Put host into maintenance mode
- Remove host from the cluster
- Uninstallation of the NSX agents will be triggered
- Add host back into the cluster
- Press the resolve button from the vSphere Web Client (Networking and Security > Installation > Host Preparation)
- NSX noticed that the agent was not installed
- Host will be rebooted and agents will be installed
When this procedure completed, we saw output when checking for connections on port 1234 and everything seemed to be fixed!
The solution to the connection issues between ESXi hosts and the NSX controllers appears to be fairly easy. If there is a better solution, please feel free to let us know. We are currently still searching for the root cause of this issue.