Algolia is a distributed Search-as-a-Service API that processes more than 4 billions user-generated queries per month. Algolia’s DNA is performance, the service is optimized to reply in milliseconds from any location worldwide while maintaining high availability. Today, Algolia’s co-founder and CTO Julien Lemoine provides details on their architecture, showing how they designed their fault tolerant service and how they use Leaseweb.
Under the hood
Building a highly available API distributed over 12 regions isn’t an easy process – especially since our users can select the set of regions they want to use. Under the hood, we automatically replicate their data and synchronize any change on all their regions. This approach is really different from a CDN that just keeps a cache of recently accessed data, we guarantee our users that any of the regions they have selected have all of the data locally and can serve their search requests. In other words, all write operations of our users are applied on all their regions, even for a very high load of write operations per second.
The Core Stack
- Kernel: Linux 3.13 highly tuned in terms of IO/Memory Usage
- Webserver: nginx
- API code: C++ code directly embedded inside Nginx as a module
- DNS: NSOne + Route53
- Redis: for rate limit check & real-time logs
- Website/Dashboard: Rails
- Monitoring: Server Density + custom monitoring infrastructure
- Code management: Github
The hardware
In early 2013, we started with a small setup consisting of: Intel Xeon E3 1245v2, 2x Intel SSD 320 series 120GB in RAID-0 and 32GB of RAM. This hardware was reasonable in terms of price, more powerful than cloud platforms and allowed us to start the service in Europe and US-East.
This setup enabled us to tune the kernel for I/O scheduling and virtual memory which was critical to take advantage of all available physical resources. Even so, we soon discovered our limits were the amount of RAM and I/O. We were using around 10GB of RAM for indexing (application of write operations) which left only 20GB of RAM for caching of files used for performing search queries. Our goal has always been to have customer indices in memory in order to have a service optimized for millisecond response times. This hardware setup was designed for 20GB of index data which was too small.
After this first setup, we tried different hardware machines with single and dual socket CPUs, 128GB and 256GB of RAM and different models/sizes of SSD.
We finally found an optimal setup with a machine containing an Intel Xeon E5 1650v2, 128GB of RAM and 2x400GB Intel S3710 SSD in RAID-0. The model of the SSD was very important for durability. We burned a lot of SSDs before finding the correct model that could operate in production for years with very good performance.
High availability
In order to be highly available, we duplicate all the data of our users on three different machines that act as perfect replicates. Any of those machines can be the master and accept write operations. In order to make sure the three machines are always synchronized, any write operations received are copied on all machines and we trigger a consensus vote to affect a job identifier to this new write operation.
The job identifier has the good property of been a sequential ID, there is no missing ID in the sequence. This process of assignation makes the check of consistency between all the machines easy: we need to have the whole sequence and we apply it in the same order.
Building a consensus algorithm is a really complex task that requires extensive expertise. We first tried to use ZooKeeper but the timeout based approach was not good enough for us (while we need to react in few milliseconds, it can take several seconds to react to a problem). We decided to implement a well-known algorithm directly in our indexing code: the RAFT consensus algorithm that is easier to understand than the PAXOS algorithms.
Now what happens in the case of a failure?
- If one of the three machine is down: it doesn’t have any impact for our end users, the two remaining machines are still able to apply write operations and reply to search queries. Those machines keep all write operations locally and send them to the third machine when it comes back to life.
- If two of the three machines are down: the search is still available but indexing operations are not applied because there is no consensus possible.
Our goal is to make sure the two or three hosts failure scenario cannot happen! We have started with three machines hosted in the same data center. While they were on different network equipment, it was not enough because the network was the major source of issues. We further decided to improve our setup by splitting machines across two different data centers & network providers to reduce the probability of failure. We are even going further in the US where we have 60% of our customers: we spread the machines across three different data centers & network providers. Leaseweb is one of the three providers we are using in the US. With three physical datacenters, three independent service providers and three independent Autonomous Systems, we can focus on making a resilient software and not be worried about the underlying infrastructure or BGP politics.
As the consensus requires several round-trips between machines for each write, it can become a bottleneck. To mitigate this possible congestion, we implemented a batching process just before the consensus and we made sure the three different providers are close together (< 5ms between them).
The compromise
According to the CAP theorem, on a distributed system it is impossible to have Consistency, Availability and Partition Tolerance at the same time.
Partition-Tolerance is the most critical as it can generate desynchronized content (also known as split-brain). Our consensus approach makes our system tolerate partition and be highly available.
After a consensus, each machine applies the write operation locally, without communication with other machines. So according to this theorem, we compromise on Consistency. We don’t guarantee that all nodes see exactly the same data at the same time but they will all receive the operations. In practice, there is less than one second between the time of application on the first and the last machine, so it is normally not visible to end users.
Distribution
Services are becoming more and more global. Serving search queries from only one worldwide region is far from optimal. For example, having search hosted in US-East will have a big difference in usability depending on where users are searching from. Latency will go from a few milliseconds for users in US-East to several hundred milliseconds for users in Asia, and that’s without taking into account the bandwidth limitations of saturated oversea fibers.
We have seen some companies using a CDN on top of a search engine to address this issue. This ends up causing more problems than value in our opinion because invalidating cache is a nightmare and it only improves the speed for the small percentage of queries that are frequently made. It was clear to us that in order to solve this problem we would need to replicate indices in different regions and have them loaded in memory in order to answer user queries efficiently.
What we need is an inter-region replication on top of our existing cluster of three machines replication. The replica can rely only on one machine since it will only be used for search queries. All write operations will still go to the original cluster of the customer.
The implementation of this architecture is modeled on our consensus-based stream of operations. Each cluster sends its own stream of write operations after consensus to all replicates as a batch of operations to avoid as much latency as possible. Sending jobs one by one would result in too many round trips with the replicates. Our consensus approach has simplified a lot of the development of this feature as we already have a stream of operations that is consistent!
The last part is to redirect the end user queries to the closest location. We are using an anycast DNS to perform this operation via Geo-IP: we have one custom record per customer that contains the list of regions that they have selected. We also use the DNS to load-balance the traffic between several hosts of a region with a health-check to remove the down host from the pool. Since the health-check is not enough to be very highly-available, we have also implemented a retry in all our API clients that are aware of the DNS records of the three master hosts. Actually, we can even fallback on another DNS provider and TLD to be sure there is no single point of failure in our architecture!
Read more
We see more and more services that are currently facing problems similar to us: having a worldwide audience with multi-region infrastructure but with some worldwide consistent information like login or content. Having a multi-region infrastructure today is mandatory to achieve an excellent user experience. This approach can be used for example to distribute read-only replicates of a database that will be consistent worldwide.
You can read more about our architecture in this High Scalability blog post or in the talk I will give the 4th of June at Leaseweb Tech Summit.