Recently when attempting to add additional vRA appliances to a 7.4 cluster, I came across an issue where the process became stuck.
After clicking Join Cluster
the on screen status became stuck at 48% with the message Finishing rabbitmq-server
. The node was left alone for a couple of hours, but there was no progress.
The node was not dead as /var/log/messages
was still receiving updates, including the join progression status. At this time the only error in the logs was from the postgres DB, which was fine due to the current circumstance, so I started to look at what RabbitMQ was doing.
The command service rabbitmq-server status
showed some information about the current state.
# systemctl rabbitmq-server status
DIAGNOSTICS
===========
attempted to contact: ['rabbit@vra03']
rabbit@rabbit@vra03
* connected to epmd (port 4379) on vra03
* epmd reports node 'rabbit' running on port 25682
* TCP connection succeeded but Erlang distribution failed
* suggestion: hostname mismatch?
* suggestion: is the cookie set correctly?
* suggestion: is the Erlang distribution using TLS?
current node details:
- node name: 'rabbitmq-cli-1045@vra03'
- home dir: /var/lib/rabbitmq
- cookie hash: ***
Attempts to the stop the server were OK, but starting the service would just hang.
A few moments (hours) of troubleshooting (swearing) later, I came across this article on RabbitMQ hostnames. It was a simple thing to test, so I followed the steps.
NOTE: At this point, DNS and network connectivity between the vRA nodes had been verified.
Comparing the file /etc/rabbitmq/rabbitmq-env.conf
between the new and existing nodes showed a difference.
Existing node
NODENAME=rabbit@vra01.corp.local
USE_LONGNAME=true
New Node
NODENAME=rabbit@vra03
USE_LONGNAME=true
At this point, I should state that the new nodes were deployed with the node hostname / IP address as the FQDN.
I updated The NODENAME
setting on the new nodes to be the FQDN instead of shortname and restarted the rabbitmq-server
service. This produced a great success, my first error message was gone. But I had a new, enthralling error message.
attempted to contact: [rabbit@vra03.corp.local]
rabbit@vra03:
* connected to epmd (port 4369) on vra03
* epmd reports node 'rabbit' running on port 25682
* TCP connection succeeded but Erlang distribution failed
Hostname mismatch: node "rabbit@vra03.corp.local" believes its host is different. Please ensure that hostnames resolve the same way locally and on "rabbit@vra03.corp.local"
current node details:
- node name: 'rabbitmq-cli-30@vra03.corp.local'
- home dir: /var/lib/rabbitmq
- cookie hash:
After troubleshooting until my voice was horse, I ran ps ax | grep rabbit
and found that even though the rabbitmq-server
service was stopped, there were still RabbitMQ processes running. Most of the processes appeared to have been started by VAMI as part of the cluster join process.
I killed all the PIDs and restarted rabbitmq-server
, this time it started without issue. The information returned by service rabbitmq-server status
was correct.
The next attempt to join the cluster worked successfully.
On the other node I was attempting to add to the cluster, instead of stopping the services, I rebooed the node and successfully joined the cluster.
If you come across this issue and attempt to resolve it yourself, make sure that your vRA environment is backed up and take take snapshots before attempting anything.
TL:DR - Before adding vRA 7.4 appliances to a cluster, check the rabbitmq-env.conf
file to validate the settings.