We do a lot of interesting work on the engineering side here at MailChimp. We rarely talk about the infrastructure pieces (except when bad things happen, in the interest of transparency). The lack of information isn’t due to any sort of secrecy—we’re just busy. Lately we’ve noticed a rise in interest about how we approach scalability, so I thought it would be useful to post some information on the topic.
Today, I’ll share some high-level numbers and a brief overview. Going forward, we’ll get into the nitty-gritty details.
We have millions of users, and we’re adding about 5,000 new ones every day. If you take duplicate emails across our users into account, it works out to billions of unique addresses (our customers’ customers) receiving our emails, opening them, clicking them, and being accounted for in analytics. Our MTAs send more than three billion emails to those addresses every month.
An average hour involves thousands of requests per second hitting our load balancers that aren’t absorbed by Akamai. Tens of thousands of queries per second bypass all cache and hit our primary database shards if you exclude all replication replay from the count. Most of those queries are background jobs and sending, not front-end request generated. Since a significant portion of our users are international now, we don’t really have slow time in terms of load.
Our MySQL shards make the core layer that carries the data for all those users, emails, and analytics. If you consider just one copy of user data and exclude all caching, backups, search clusters, and logs, we have over 17TB on fast disks (raid 10s of SSDs or 15k SAS) directly accessible at full speed by users.
Considering all those components I just excluded, the data weight is substantial. For the 17TB of user database shards alone, we keep five copies of that in varying states of readiness. This gives us redundancy, the ability to do most maintenance without downtime, and many layers of protection against data loss.
If you’re interested, those five copies of each customer’s data break down like this:
- Two identical MySQL servers ready to serve queries in each shard. These are dual 6-8 core, 96GB RAM, Percona 5.5, raid 10s of fast disks. The app refers to a VIP that’s moved between them as needed.
- A separate MySQL slave is running in a different city and datacenter.
- Nightly streaming binary backups (via Xtrabackup) of each MySQL instance off the standby master.
- Nightly logical backups (via mysqldump) of every individual MySQL database (there are many per instance that users are spread over) off the geographic replication slave.
The database shards in the MailChimp app are just one component in our setup. Around those, we have layers of caching, app servers, job servers, load balancers, search clusters (individual ElasticSearch clusters carrying 20TB+), internal analytics datastores, and all sorts of other interesting things. Beyond MailChimp, we have many other entire projects that need to be reliable and fast: Mandrill, TinyLetter, AlterEgo, Email Genome (570GB of data in Redis/RAM alone), and more. We’ll get into those components in future posts. All together, we have hundreds of servers, mostly bare metal with high-end specs.
Though we like to think MailChimp has a fun and creative vibe both in and outside of the office, you can rest assured that we’re very serious about engineering and infrastructure. We’re constantly and rigorously adding capacity, tuning, protecting, and monitoring everything. Maybe most interesting is the fact that our dev team is sharding off a little itself, and forming a devops team (here’s another company’s take on that) just to manage MailChimp’s growth.