Clustering RabbitMQ on Windows

Getting up and running with RabbitMQ (especially when using MassTransit) is a breeze. It may not exactly be the NancyFX Super-Duper-Happy-Path™, but it comes closer to it than any other ServiceBus/Broker I’ve come across.

However, the first time I laid eyes on the RabbitMQ guidance for Distribution, High Availability, and Clustering I started thinking that maybe this wasn’t such a good idea..

Anyway, a few days ago I finally spent a couple of hours together with our Ops guys building a lab environment. Turns out that setting up a RabbitMQ cluster on Windows Server isn’t that hard after all, but we did run into a few gotchas along the way. Thought I’d share the experience with you!

RabbitMQ - Cluster

High Level overview of the setup

We decided to use dedicated virtual machines for running the Rabbit cluster. The alternative would have been to run it on the same boxes that run our application services, but somehow it feels better to have it separated. I’m sure you can argue pros and cons here, but I’m going to leave that to someone else.

The RabbitMQ boxes have a load balancer in front so that the application services don’t need to know which physical machine(s) to connect to. The load balancer is currently configured with a simple round robin algorithm and will automatically remove non-responsive machines from the rotation.

Installing

First thing to do is to install Erlang and RabbitMQ. Just get the latest stable versions from erlang.org and rabbitmq.com. By default, RabbitMQ installs its database in the %APPDATA% folder of whoever installs the service, something that seems far from ideal when in a server environment. So before running the installation we made sure to set the %RABBITMQ_BASE% environment variable to a more appropriate location (for us it’s d:\RabbitMQ). Other than that the installation is just a matter of doing the Next->Next->Next->Confirm dance.

Next we want to enable the management web interface. This is done by first enabling the plugin:

rabbitmq-plugins enable rabbitmq_management

Next, the changes has to be properly applied by reinstalling the service:

rabbitmq-service stop
rabbitmq-service install
rabbitmq-service start

You can make sure that the management interface is working by pointing a web browser to http://localhost:15672

Erlang Cookie

The documentation for RabbitMQ mentions that in order to connect a RabbitMQ cluster, all instances must share the same Erlang cookie. The cookie is a simple text file called .erlang.cookie that resides in your user folder (i.e. c:\Users\[UserName]). In order for the RabbitMQ Service to work, the same cookie-file must also be copied to your %WINDIR% (usually c:\Windows). All this is usually handled automatically by the installer and command line tools. However, there is one important gotcha. If you log on to the box with a new account, it will get a new Erlang cookie that doesn’t match the one that the RabbitMQ Service is running which means it won’t be able to administer the service. The easy solution is to copy the one in c:\Windows to the user directory for everyone who may want to use the RabbitMQ command line tools. We are probably going to set this up by pushing a login scripts to all users of the box.

(Windows) Firewall settings

This information is somewhat scattered in the RabbitMQ documentation, so here goes. You need the following things opened up:

Port  
5672 This is the main AMQP port that clients use to talk to the broker.
15672 The management web interface.
4369 Used by EPMD (Erlang  Port Mapper Daemon). This makes sure that the nodes can find each other.

Finally, you also need to allow the Erlang runtime (erl.exe) to communicate between the nodes. You can either set a port range yourself in rabbitmq.config (inet_dist_listen_min and inet_dist_listen_max) and open up that range in the firewall, or you can do what we did and allow erl.exe to communicate freely. Note that the Erlang runtime installs two bin folders that both contain erl.exe. The one that seems to be used by the service is in the subdirectory named something like C:\Program Files\erl5.9.3.1\erts-5.9.3.1\bin

Joining a cluster

Creating and/or joining a cluster is actually the easiest bit. You just log on to one of the machines and issue these commands (if joining a cluster on RABBIT01):

rabbitmqctl stop_app
rabbitmqctl join_cluster rabbit@RABBIT01
rabbitmqctl start_app

NOTE that the host name of the server (RABBIT01 in this case) should be in UPPER CASE when working with Windows instances.
We decided to go with only disc nodes. Mostly because it doesn’t seem like using RAM nodes would be much of a performance gain anyway. But I may be completely off here..

Mirroring your queues

When the cluster is up and running, queues and exchanges can be created the same way you usually do. However, the queues will still be localized to a single box, and if that box fails the queue is gone. This is likely not what you want if the reason for clustering was to make things Highly Available.

In previous versions of RabbitMQ, a queue was mirrored by setting the “x-ha-policy” as an attribute when creating the queue. As of version 3.0, you now set the policies either using the management web interface or the command line tool:

rabbitmqctl set_policy ha-all "^ha\." "{""ha-mode"":""all""}"

The above line will create a new policy called “ha-all” (what you call it is not important), it matches all queues with a name starting with “ha.” (using a regular expression) and sets the ha-mode policy to all (which will mirror the queue to all nodes in the cluster).

Profit!

After getting everything up and running we started trying to break things. Mostly everything worked just as expected. When stopping the service on one machine (or completely shutting down the machine), the cluster recovered as expected. It is important to realize that messages that arrived in the cluster while a node is down won’t be mirrored to that node when it comes back up. So in a scenario where you need to restart all machines in a cluster, you should follow a procedure such as this:

  1. Restart node 1.
  2. Bring node 1 back to the cluster (usually does this automatically)
  3. Monitor your queues and wait until all unmirrored messages have been processed.
  4. Restart node 2.
  5. Etc, etc.

Please let me know if I’ve missed something crucial or if there are things you’d like to know more about!



12 responses to “Clustering RabbitMQ on Windows”

  1. Hello,

    Thank you for the article! It to sums up very well all the operations to setup a cluster with mirroring. I’m also setting up and testing this.

    We are also setting this up on Windows, which you seem to be doing too. May I ask what you are using as a load balancer in front of the rabbit servers?

    Thanks,

    Like

    1. We already had a hardware load balancer that we use for everything else that we setup for this. I think it’s an F5 but I pretty much anything should do the trick.

      Like

      1. Wow! Thanks for the prompt response!

        Like

  2. Great post! We are still battling to get our nodes to cluster… Hopefully we will get it sorted soon.

    I spotted one mistake in the post 🙂 the EPMD port is 4369 and not 4639. (there is a little dyslexic kid in all of us 🙂 )

    Like

    1. Ahh, thanks for spotting that Brendan! Mistake now corrected.. And good luck clustering! 🙂

      Like

  3. Thanks for putting this together. Love the blog name.

    Like

  4. Just say no to rabbit clustering. Does not support network partitions (including in their own docs). Split brain + messaging = suck

    Like

    1. Yes, that has already bit us a few times (though we are currently on a rather unmanaged environment that goes up/down as it sees fit..)

      What would you suggest instead? Single rabbit nodes with redundancy built on the level above instead? Or something other than Rabbit?

      Like

      1. We have recently been in touch with a RabbitMQ consultancy regarding clustering and mirroring and their stance was “yeah, don’t do that”. We are moving it up a level, as you say.

        Like

      2. Do you use any out of the box solution for level up, or you stay with custom implementation. Moreover I assuem that you still use queue mirroring but only as a hot standby rather then in the load balancer mode.
        Thanks!

        Like

  5. What do you mean by a level above? Can you elaborate?

    Like

Leave a comment

About Me

Consultant, Solution Architect, Developer.

Do note that this blog is very, very old. Please consider that before you follow any of the advice in here!

Newsletter