In this 2-part mini-series, Joshua Hoffman examines some of the common issues companies face when designing for scalability. Read part 1 here.
In my previous blog I looked at what I call the first three sins of web scale – pride (the refusal to use tools not invented here), envy (the desire for a more exciting project) and gluttony (ignoring scope and capacity). Today I’ll discuss the other four sins you need to be aware of when building and deploying your app or product. So without further ado, let’s check them out.
Lust: Premature Optimisation
Continuing with our examples of the seven deadly sins of web scale, this next scenario comes from a company I’ll call Audiogarden that needed to build a timeline service. A timeline service is a backend service that generates the “activity feed” for each user on the site and it is a critical part of of the user experience on any “social” site.
When you first start out with building your web application and you’ve never done this kind of thing before you do what’s called the “naive” design. By this I mean that you write your app in your chosen language and you have a database and every time someone posts you put it in the database and when another person logs in you look that information up and display the it to the user. It’s simple enough but it doesn’t scale. In order to scale you need to decouple things and break them apart and if you continue to grow you’ll eventually need to build a service just to perform this task. Pure database queries aren’t going to be able to handle the same load that a caching model can.
To address this issue a talented engineer was tasked with building a replacement to support the growth and scale that was anticipated. The engineer went off and spent months in isolation researching and writing code. Eventually he produced a service built with two tiers. It was a great piece of software: it handled the timeline service well
, it was scalable and reliable in the right ways. So what went wrong? The engineer had committed what I call the sin of Lust or premature optimization. By the time the new timeline service had been put into production (a challenging migration) the requirements had changed. It was so optimized for the original problem that it just didn’t fit anymore and had to be replaced. The result was months of wasted work on a great piece of software that was no longer usable. This is why you should never try to optimize anything until you know what you need to do. It also stresses the importance of staying in touch with stakeholders throughout the development process and not working on projects in isolation.
Wrath: Insufficient Testing
The next example takes us back to Hipster and involves a job that should should be really simple: removing a driver. An engineer was tasked with unloading the IPMI driver from all of the hosts. At the time this happened the company had about 1,000 servers. The IPMI driver is used to communicate to the management board via the host OS rather than talking to the network interface. It was discovered that if you loaded the driver and sent it some commands there was a possibility, because of a bug, that the driver could deadlock the system. If it was never loaded, it wasn’t an issue. There was no functionality that required communicating to the IPMI interface through the host OS so the decision was made to push a config to all servers to blacklist the driver so that it would never be loaded again. This was done without incident.
The next step was to find all the servers where the driver was loaded and unload it. The engineer picked a host to test the procedure and it worked flawlessly. He chose a second host and repeated the same steps and it worked flawlessly as well. After deciding that two hosts were sufficient to test he kicked off a job to the entire fleet to unload the driver. What wasn’t known at the time was that the two test servers were the unusual case and the more common case was that unloading the driver triggered the bug that deadlocked the motherboard. To make things even more complicated, all of the servers were a 4-in-1 type chassis where four hosts are sharing two power supplies. The only way to recover a system in this state was to power cycle it. The problem with this is that if only two of your servers are locked you still have to power cycle the entire system and one or both of the healthy servers still in use might be something important like a master database.
Back in the office engineers started seeing servers going down left and right on the dashboards. I made the call to put the site into read-only mode so that we could find out what was happening. The damage was slightly contained in that the host running the job locked up after about the 800th server had gone down but now we had an even bigger problem. We had done a lot of data center automation and only had two or three technicians on site at the time. Everyone available in the office had to pile into cars and drive thirty minutes to the data center where the rest of the day was spent identifying the locked servers, manually pulling them out of their chassis to power cycle them, and shoving them back in. Eight hours later the site was back online.
I call this the sin of Wrath or insufficient testing. The result was approximately 800 physical servers that required a hands-on fix before the site was up and running. It was a significant outage that even made the news and took a lot of time and manpower to fix simply because the engineer hadn’t done adequate testing.
Greed: Making Stuff Tightly Coupled or Monolithic
In the next case study we go back to Pink Shoe Linux. As an early part of their effort to deliver an enterprise platform they created an online service to allow the management of the software on the servers. It was very helpful for people with fleets of machines who wanted a product like this and it worked very well. The engineers built the service with a bespoke content management system and chose to use the existing database then in use at the company which was Oracle. While they were writing the software they decided to hard code all of the Oracle database queries throughout the entire code base. It wasn’t an issue until customers started to request an on-site version for themselves. This was a lucrative opportunity but the potential for a difficult situation arose if the customers didn’t have an Oracle license and didn’t want to get one.
This is what I call the sin of Greed or making stuff tightly coupled or monolithic. If I had to give one piece of advice to anyone creating a new application it would be to never tightly couple your data source to your application. As you scale things up this will hurt you again and again because depending on the mechanism you’ve chosen the software may be doing things you can’t easily find and fix. You cannot tease out and separate very easily the interactions with data source and application. This is number one challenge in taking something that worked well at a small scale and bringing it up to a very large web scale.
The result of this sin was years of work were needed to clean out and abstract away all of the Oracle database queries before it was a clean and separate code base. In the meantime, in order to satisfy customers, the company had to pay for Oracle licenses to ship with the product for the big customers in order for them to be able to use the service without first having to buy the license themselves.
Sloth: Avoiding Maintenance and/or Documentation
Our last example comes from a company I’ll call Americans on the Internet. This was a company that was such an early adopter of technology that nothing like standardized protocols we use today existed. Everything that was built was proprietary. When they first released their service it ran on one Stratus server which, at the time, were the same kind of servers used by banks and hospitals because they were they were very reliable. That level of reliability came with a price though and these machines were very expensive – costing up to half a million dollars. They could support a lot of users but the problem was that if you went over the capacity even by just a little you had to make another costly purchase in order to run your service.
The decision was made to move to an HP-UX platform; unix servers were dropping in price and a plan was made to migrate the Stratus data onto the new hosts. As this was not a simple task developers immediately started to build new software on unix. In order to make the transfer easy one of the original Stratus engineers decided to build a gateway service to broker the proprietary Stratus protocol to TCP/IP so that all of the new stuff being built could talk to the gateway service.
After a few years, using HP-UX became too expensive because of the need to buy HP servers and licenses so the decision was made to move to linux on commodity hardware. The problem was that no one could find the documentation or the source code for the gateway service. There was no way to be sure exactly what this binary service was doing or how it accomplished its task. The original engineer who had written the gateway program had retired and left the company and no one could find him. This is the sin of Sloth or avoiding maintenance and/or documentation. The fix for this took months of work and there were multiple outages due to many failed attempts before a working replacement was created with good docs and source code.
I hope you find something to take away from these case studies I’ve shared with you to inform the work you do next. If you are starting a new company or a new project, hopefully now you can avoid the Seven Deadly Sins of Web Scale.