As an IaaS provider, we host multiple storage systems to support our Clouds. Unfortunately, earlier this week one of these platforms was causing issues, resulting in major headaches for both us and, more importantly, some of our customers.
We want to be totally transparent. Therefore, this blog will divulge the root of the problem, and the steps we’re taking to prevent this from happening in the future.
To explain it in technical manner, we traced the problem back to the amount of spare capacity required by one of our ZFS storage platforms. These platforms store part of our Public Cloud’s data. This system requires an amount of free capacity in order to keep the platform operating at peak efficiency. When this amount is exceeded, the platform will suffer from performance and latency issues.
When we designed the platform, we estimated the used/free capacity, along with the supplier, to be around 80% / 20%. However, we decided to take a more conservative approach, and chose to set an even lower ratio. As the platform was growing, we kept a close eye on capacity, to be able to plan further expansion in time.
As soon as we saw the platform approaching 60% capacity, we started building the new storage setup. The new system is ready for production, with the current system at a 70% capacity. Unexpectedly, the platform started showing problems at 70%, triggering the chain of events which has been our main focus the last few days – ultimately resulting in downtime on large parts of this Cloud platform. We have obviously set the trigger limits for these platforms much lower, and will be increasing capacity at a much earlier stage.
When we managed to fix the inherent problem, we were still assessing the impact of this chain of events and the outage that followed. Once the situation became clear, everybody from the organization pulled together to help resolve it.
We have now worked through all the customer tickets one-by-one, and have fixed the majority of the instances, with less than 100 instances unfortunately still affected. We are now working on contacting and updating these customers on the status of their infrastructure. Some of these will need to be brought back online by reinstalling the operating system.
We feel it is important to explain what happened and, concurrently, how we have dealt with it. We have set up an email address (firstname.lastname@example.org), which is being constantly monitored, in order for our customers to send us their comments and feedback. We appreciate any and all responses, as that will help us to further improve our services.
However, we understand that situations like these have a much higher impact than years of stable service. We sincerely apologize for the inconvenience caused by your infrastructure being unavailable. We hope this post provides insight in what happened, and what steps we have takes to prevent this from happening again.