Socorro releases RabbitMQ into production!
On Tuesday, the Socorro team (mainly led by Selena Deckelmann and Brandon Burton) released RabbitMQ into production for Socorro crashes. This was a huge team effort, and it’s a tremendous accomplishment. I’m proud to have been involved in this process.
RabbitMQ will help us process crashes faster, more efficiently and without as much database traffic. We were also able to shut down Monitor, which had previously been responsible for queuing crashes, removing our “single point of failure” from our production environment.
So how does it all work?
A brief rundown of how Socorro works
A user crashes, and they submit that crash to us through the Breakpad dialogue that comes up. That crash is collected by the (aptly named) Collectors, which write the crash to disk. Collectors are simple: they simply take information, write it to disk, and move on to the next connection.
That crash is then picked up by the CrashMovers, which inserts the crash into a storage system of your choice (currently we use HBase, and we also insert the raw crash into Postgres). The crashes are then queued for processing, to avoid overwhelming our processors; the Processors process the raw crash, insert the data into Postgres and HBase, and the crash is now available for viewing on the Reporter (the web interface).
What RabbitMQ means for all of this
Previously, the queuing of crashes was handled by the Monitor. It relied upon a connection to the database, where it maintained and managed its queue. Besides this increased database traffic, this creates a health issue for the database with so many temporary writes. Our queuing system dates back before there was a real queue solution we could use, and also relates to the lack of resources that Socorro had up until a short few years ago. The Monitor was also a Singleton, meaning that if it went down, processing stopped.
Monitor’s status as a Singleton also means that it can’t easily be scaled; this creates challenges to certain goals, like processing 100% of crashes to help users identify their crash reasons.
RabbitMQ, on the other hand, can be deployed as a cluster, is intended to be used as a queue, is fast and efficient, and reduces the database traffic by eliminating reliance upon the database for queuing. RabbitMQ also features automatic retries (if an ACK message is not received for a particular crash ID), and in our state the queue will persist between restarts (or in the event of client failure by the processors). It is scalable, and designed to be used for high volume systems.
A great team
RabbitMQ wouldn’t have been possible without a great team of developers and IT engineers, including Selena Deckelmann, Erik Rose, Lars Lohn and Brandon Burton. Working with such a smart group of people is truly inspiring, and I can’t wait to see what we can do with RabbitMQ.
There are currently no comments.