Online infrastructure must be able to handle unprecedented amounts of traffic during Black Friday and other large-scale traffic surges. Here are our best practices, what you can expect, and how you can prepare.
This guest blog was written by Gabor Vincze, Senior Consultant of Infrastructure and Services at Yusp by Gravity R&D.
The Why
It is a fair assumption that more traffic means more value for your business. This, in turn, usually then manifests into more revenue. This is particularly true for SaaS providers – and because of this, dealing with events such as Black Friday is what we call “a good problem”.
The use case is very common at SaaS providers who host services where unpredicted traffic surges might happen. These surges are quite common, and what they are and how to deal with them are the primary focus of this blog.
Events such as Black Friday, live sports broadcasts, and limited-time offerings attract significant audiences which – if you are lucky – puts real pressure on your platform.
While Gravity has been dealing with such cases for a long time (and can say there is no one good recipe for handling them), we have created our best practices to prepare and maintain our infrastructure to be able to serve personalized recommendations uninterruptedly despite unusual traffic patterns.
The How
In a perfect world, you would have unlimited processing power. However, as much as we would like to, we do not live in a perfect world – and regardless of your architecture set up, this will never be the case.
With Leaseweb, we have been continuously working to find a sweet spot for scaling our infrastructure up and down before, during, and after these events. (And interestingly enough, the problem described in this article was the original issue that led to the birth of hyperscalers).
So, how do we do try to find this sweet spot?
Warmup. Analyze. Optimize. Analyze again. Optimize. Monitor. Keep it up and running. Scale down. Analyze. Go back to the setup with organic growth.
Warmup means dedicating and starting instances to have more processing power and must be done before the event. If the warmup is not finished by the beginning of the event, there may be some serious issues.
What About Autoscaling?
Autoscaling can handle typical traffic patterns where the system has some time to adjust assigned resources. With autoscaling, significant cost optimization can be done if used properly.
For example, typical traffic patterns for different industries include primetime in the evenings for streaming providers and morning surges in publishing. Even this pattern follows the sleeping patterns of the audience.
As regular traffic patterns have smaller spikes, they can be described with a “sinus like” graph.
Let’s compare this with a Black Friday traffic surge:
Handling this spike with autoscaling is usually not possible, as starting up new instances, warming up caches, and transferring settings and data sets to the new instances simply cannot be done on time to follow the rate of such a large traffic increase.
The Avalanche Effect
In a microservice architecture, different components (services) are connected to each other and the cooperation of these services can serve complex functions.
This idea stems from real life. When you order a burger, the components come from different areas: someone grew the crops, another processed it to create wheat flour, someone else shipped it to a place where someone baked a bun out of it. There is a very extensive microservice architecture behind a diner.
Now, let’s imagine that our diner can serve 300 burgers a day (which means preparing 300 pieces of meat, 300 buns, 900 slices of pickles, and so on). We can say it takes less time to prepare some pickles, compared to cooking one piece of meat then preparing 1 slice of pickle. This way, the throughput of the system will be the time consumed by the lowest process – which in this case will be cooking meat.
The avalanche effect comes into play when you have a rush hour and try to serve more burgers in a certain amount of time than you can prepare all ingredients. The result will be huge lines at the front, people are waiting too long, and eventually, leave to eat somewhere else.
Graceful Degradation
Graceful degradation is an effective method to keep your platform running when dealing with unprecedented traffic surges (and therefore also avoid the avalanche effect). The idea is to gradually and temporarily remove (automatically or manually) live features and functionalities to ease the load on the backends (computing applications, database clusters, document search, etc).
A simple example is when you provide five items in response to a search request instead of 10 items. This lower output takes a bigger load off the backend of the search query and makes the user interact to receive the additional search results if they so choose.
Graceful degradation in real life would mean that your diner will serve only medium cooked meat, as it takes less time and alleviates some pressure from the “backend” (the cooks). The next level would veggie burgers with virtually no cooking time. As you can tell, graceful degradation saves some room – however, there are limits here too.
The What
Let’s explore why it is important to work with occasional events (such as large-scale Black Friday promotions) and what the business value is. The following chart is a visualization of the four possible outcomes of an event such as Black Friday:
I know it looks a bit chaotic – but hear me out.
It is a fact that promotions such as Black Friday can attract more customers. If it’s done right, new customers will become long-term customers. The more customers you serve with attractive content means the longer effect the event will have.
As shown in the chart above, there can be at least four outcomes at the end of the day:
- Flawless Operation – you did as well as you could have predicted, served a broad audience, and increased your baseline on a long-term basis. As the name may suggest, this is the most ideal outcome.
- Graceful Degradation – you were able to serve the entire crowd, but some intended service quality degradations were issued in order to be able to serve them. The quality of the service was therefore lower overall. Our baseline still will be elevated, but lower than the “flawless” level.
- Temporary Failure – you crashed every time there was huge traffic, but after each crash, you were able to recover the system. There were even more lower-quality service periods in this outcome. Because of this, there were fewer conversions than average on this day and the long-term effect hits just slightly below your baseline.
- Long Blackout – you crashed when the first storm hit and were not able to fully recover. Because of this, your service was down most of the day. Attracted users will try other services since you were not able to serve them. While this scenario is not ideal, users are aware that Black Friday can cause services to crash, and do not necessarily churn because of a failed Black Friday. Due to this, your baseline stays as it was and does not dip any lower.
For those just starting on their Black Friday path, it is completely normal to start at “Long Blackout”. Good planning, continuous optimization, and responsive teams and vendors are key to Black Friday success. Keep working towards “Flawless Operation” and remember – the journey may not necessarily be easy, but in the long run events such as Black Friday are most definitely worth your time and investments.