Recently, I’ve been reading lots of articles and stories about Docker, container orchestration, microservice deployment, and how it has saved lives and tons of time for a particular company. While I genuinely believe that containers, microservices, and container orchestration are great, I always stand for the simple rule: the right tool for the right task. Here’s a DevOps project I worked on recently for Mouseflow which exemplifies this rule.
A little background about the project
Over 500 Linux and Windows servers hosted at two data centers. Most applications are in .NET, with less than 10 services, some of which are super loaded some are not, HBase is the main database, ElasticSearch store series, we use Ansible to deploy servers and of course OddEye for monitoring.
We are using TeamCity to build and MSDeploy to deploy our .NET applications on Windows servers. Usually, deployment takes less than 10 minutes, of course during this procedure service remains online with zero downtime.
What are the goals?
- No SPOF
- Maximum performance
- Fewer servers (we already have more than 500)
- Easy infrastructure management
- As much prediction of possible issues as possible
- As much automation of tasks as possible
- As little time as possible spent on maintenance per machine and overall infrastructure.
Lots of people will say that the only, or at least the most optimal solution to achieve these goals is to deploy Dockers, orchestrate it with Kubernetes, or at least go to AWS and order tons of cloud instances. But one of our goals is to achieve “Maximum performance”. This means that we must be as close as possible to hardware, so cloud instances with noisy neighbors is not an option for us.
Getting close with bare metal
Containers are probably the most performant way to virtualize environments, but every abstraction is loss of performance, maybe just a little, but it is. Also, neither Hadoop/HBase nor ElasticSearch, which are our main services, play well with containers. So the best solution left is to work with dedicated servers. The thing is that lots of people forget about battle tested old tools, and forget that most of these tools are still in use by the biggest players.
We do not want to waste hundreds of man-hours and keep lots of people in a team to manage these servers. We want to do this efficiently with a small team. So how to achieve that? We needed a reliable data center which provides good servers with reasonable prices in the US and Europe. After some research, we choose to work with Leaseweb.
There is an important task for the local personnel of the data centers we’re working with: they should deliver servers with settings defined by us for KVM, so we can have offline access to servers if things go wrong, and somehow we lose contact with OS over SSH. We ordered an initial amount of racks and servers and started configuring our system. The first thing was to create a TFTP server with install images, which should have some basic configurations and SSH keys. When that was done, we could boot the machine via PXE and automatically install the OS on it.
The procedure of delivery of servers is the following:
- We ask the data center to deliver a certain amount of servers and configure KVMs with desired IP addresses
- When the servers are delivered, we power them on via KVM and boot them via PXE
- After several minutes OS is installed and ready to use.
When the basic OS is installed and configured, we need to deploy software and services. Most of our servers are running HDFS/HBase or ElasticSearch. So the task is to automate the installation and configuration process. It did not take a long time for us to choose Ansible as the preferred automation tool. The reason for selecting Ansible was obvious: Ansible relies on SSH and doesn’t require any agent installation at the target systems. As we install SSH keys during OS installation, nothing else is left to do at the end machines in order to use them as Ansible clients. We have installed Ansible on the head machine and created several playbooks to automate HDFS/HBase and ElasticSearch installations.
As I have mentioned above, many of our applications are running on .NET on Windows, so our development team is using MSDeploy to handle the correct deployment of applications. When we switch to .NET core on Linux, which is on our roadmap, we will make several Ansible playbooks for that as well.
- 2x DevOps
- 4x Monitoring (doing 24/7/365 human monitoring in shifts)
- Over 500 servers
- Over 2PB of HBase data
- Over 50B documents in ElasticSearch
My message isn’t new: Choose the right tool for the right task, what is trending may not be the best fit for your case. And of course, it is definitely possible to manage a metal monster with a micro team.