Throughout my career I’ve had the opportunity to work at a variety of different companies both large and small. They each had their own set of unique challenges regarding growth but one thing I noticed with time and experience was that the solutions to the problems they faced were not specific to the company itself. The approaches that were taken and the lessons that were learned could be extrapolated and applied to many of the situations facing a company looking to expand and grow technically.
There is a concept in some religions that before you save a sinner you have to tell them how they have sinned. In other words, if someone doesn’t know what the problem is they won’t be able to change. For a company just starting out, there are no wrong ways to build and deploy your app or product. Once you begin to grow however, you realize there are things you didn’t know and that some or all of the decisions that you made at the beginning were mistakes. This is the point where you need to decide how to address these issues. New companies are started all the time so I decided to draw from my experience to put together what I call the Seven Deadly Sins of Web Scale using seven real world examples from my career.
Pride: Refusal to Use Tools Not Invented Here
The first case study involves a company we’ll call Boohoo that operates many data centers. Like most companies that started early on in the technology scene they had to build their own management platform because there wasn’t a lot of software available that met the needs of running a large internet application or had the ability to manage a large amount of hardware reliably and efficiently. The first data center was built to suit the requirements at the time but as they grew they needed to expand. This was a fork in the road for Boohoo: did they want to enhance and build on their first platform or start over with something completely different?
A new team of engineers was tasked with building out the second data center and instead of deciding to enhance or fix the current technology they wanted to build something completely new. They would build it right this time, it would fix all of the problems, and then they would migrate everything from the old platform to the new and everything would work great.
These engineers committed what I call the sin of Pride or the refusal to use tools not invented here. Instead of using in-house technology or any of the available open-source options they decided they could build it better each time.
The end result is that years later they now have eight data centers that run on eight different data center management platforms and have learned nothing in the process. The job for the engineers who have to support them is much more difficult because not only do they have to code for multiple APIs in order to communicate between the data centers and their different platforms but they must also support multitudes of other software that have been built to abstract on top of them. It also means that troubleshooting any particular team’s software is challenging at best.
Envy: The Desire For a More Exciting Project
Our next case study comes from a company we’ll call Pink Shoe Linux. Pink Shoe Linux wanted to migrate from a bespoke document publishing system to Docbook. Docbook is a specification that can be used for authoring material on the web or published in printed form. You write the content in XML and it gives you some advantages such as having a single source that can generate multiple outputs as well as being able to do things like make an instructor course manual that has quiz answers and a student version that does not. The challenge is in how you generate the output and there are many tools to choose from.
Pink Shoe was already using many Docbook features but with a proprietary build system that needed to be maintained by the people who were authoring the coursework. The decision was made to migrate to an open source standard tool chain for Docbook called Publican. The advantage with Publican was that there would be no need to maintain internally built tools and there were already available resources from within the organization. An engineer was chosen and it was estimated that the task would take a few weeks to get the toolchain working, define the workflow, and write up some documentation.
After several weeks had gone by I happened to take a look at the code and noticed something that rang a few alarm bells. It appeared that the code as written was able to function with or without Docbook. Digging into the situation a little further I learned that the engineer had always wanted to write his own publishing system and had taken this project as an opportunity to do that. In order to meet the requirements he added in the ability to support Docbook. The engineer had committed what I call the sin of Envy or the desire for a more exciting pro
ject. A lot of day to day work isn’t glamorous or fun but keeping it simple and effective is, to me, the right answer. This a challenge because you want to keep engineers happy and give them exciting projects to work on but if you’re committed to the success of the company and you want to build out everything the company will need to handle the load it’s going to face then you have to have discipline in situations like these.
The result of this was a new bespoke tool that replaced the old one that mostly worked with Docbook however it broke a critical feature that was used to export portable translation objects. These objects are sent to a translation company which translates the material into other languages and then sends it back so that it can be imported and you have multiple language versions of your document. Two weeks before the publishing deadline to send the objects to the translator and almost four months after the project started I was called in to fix the situation. Most of the existing code had to be thrown out and the standard tool was able to be implemented within the deadline. What should have been a relatively simple project ended up with months of work in the trash.
Gluttony: Ignoring Scope and Capacity
The next example comes from a company I’ll call Hipster. This is a company that started out small and experienced such rapid growth that the peak traffic for the front end from when I first started had become the low point after only a few months. Everything had been built so quickly that there hadn’t been time to implement network metrics. The decision was made to instrument everything and turn it on. At the time there we were using an in-house metrics platform built on OpenTSD running on an Hbase cluster. The Network team set everything up and started shipping metrics without telling anyone. Over 2000 network ports started sending every frame to the Hbase cluster which was already being used to monitor critical systems that were running the site. The whole thing fell over because of the sin I call Gluttony or ignoring scope and capacity. The sheer volume of new metrics killed dashboard performance for everyone which then made troubleshooting actual issues almost impossible.
If you work in engineering everything you do requires capacity planning. It requires an awareness because there is no unlimited resource. Whether you are using an internal resource or a vendor such as a CDN or cloud provider like LeaseWeb you need to plan at every stage. We’ll cover the next four deadly sins of web scale in part 2.