Whenever we build a new CDN SuperPoP or improve the infrastructure in another way, we constantly think about how to keep our systems scalable. You see, we did not just ‘create a CDN’. We created a CDN that can compete with the big players in the market. Therefore, we built it with big numbers on our mind. We have reserved half a Terabit of capacity for streaming, using about 1500 SSDs, each with half a Terabyte of capacity. With this setup, we can handle hundreds of thousands of requests per second. But this is where things get a little complicated. How do we exactly keep track of who is doing what on our CDN nodes?
One way would be to use virtualization to separate customers and do accounting based on the interface counters (preferably on the hypervisors). We did not choose this route, mainly because of the I/O overhead we expected.
The solution we chose is to do log analysis on the access logs. This has the positive side effect that it can be used for collecting analytics on the CDN traffic. Browser versions, device types, popular URLs and other things can be acquired. Our dashboard can display these numbers in real-time (the maximum is a five-minute delay). This means that the data flows directly from the log lines to the database and the Hadoop aggregation tasks run constantly. The aggregated data is stored and is accessible from our control panel. The data can be viewed in real time and interactively filtered by various variables on the CDN control panel.
Right now we are running 15 database servers that all work together as one distributed database. These servers do not run a relational ACID compliant database like MySQL. We are using Cassandra to provide a scalable NoSQL key-value data store. On this data, we are constantly running aggregation jobs to calculate the real-time data we are displaying in our CDN control panel.
We are doing big data as a key feature of our CDN product and we are very proud of our top-notch engineers that have developed it.