Do we need this ELK stack?

Published in

Teemo Tech Blog

4 min readNov 26, 2018

Do we need this ELK stack? That’s the question which I asked myself as a CTO while analyzing where my team spends its time.

1. Why did we install ELK in our system?

To give you some background: one element of our job is buying advertising inside mobile apps. In our early days, we built what we have called a “bidder”. It is just a server in charge to bid on advertising opportunities in real-time. In other words, it is spending money in real time and we needed to have easy to read graphs about our spendings. Back in 2015, we had free Google Cloud Platform credits so it was a no-brainer to decide to opt for the ELK solution and install it on Google Compute Engine instances. ELK is a great product. Everything is very well documented. There is all you need to get you started even if you have no experience in logging/monitoring systems.

So we started to insert in ELK all the logs which we wanted to derive graphs from them. It was straightforward. It worked and we were the happiest in the world.

2. First feelings of “something is wrong”

After a year, our stack began to be a bit legacy. The Elastic team is working hard and there is approximately one major version released per year. What company wants to be outdated? Not us. So one of our team members spent a few weeks to update everything. ELK was plugged everywhere so we needed to go through each project to update every library while having two ELK stacks in parallel to perform our migration. Team members were enthusiastic about ELK and built even more dashboards.

3. Entering a nightmare

Fast forward one other year, the business was growing at an exponential rate and we stored a much higher volume of logs inside our ELK stack. We began to have scaling issues. The worst is that we had these issues during peak hours. So it was not uncommon for us to have outages on our ELK stack during the time when we needed it most. It could also cause incidents on perfectly sane services. Most of these outages happened because we did not create enough instances. The other part is because of misconfiguration or mistakes in grok expressions resulting in errors while parsing logs entries. To sum up: we had a piece of infrastructure which was everywhere, where we put everything, falling down when we needed it the most and taking most of our time.

4. Primitive human reaction: we need to scale it!

We wanted to scale ELK, to put as many servers as required. However, we knew that we were not experts in ELK management so we thought about an external hosting like Elastic Cloud. However, it seemed expensive for what we tried to achieve. We also had to keep an instance of Logstash locally using this solution which would not resolve our mistakes in grok parsing.

5. Asking the right question: what are we trying to achieve?

So, we decided to take a step back from the situation, and only by doing this we understand that we just wanted to use and have graphs to monitor our system. As a matter a fact, we did not need to put a huge amount of data into ELK just to draw a bunch of graphs. As ELK is a very complete product, as a company we were tempted from the beginning to put all your logs in it just to derive a few metrics which are then displayed on a graph. In our case, we just needed metrics! There is actually a conceptual difference between two kinds of data: logs and metrics. Any developer knows what are logs: things that happen. Metrics are a bit more subtle. They are a measurement of something (for example, CPU utilization, the number of HTTP requests, etc.). They are usually the output of an aggregate mathematical function (like mean, median, decile, etc.) and published at a fixed frequency to a monitoring system (for example one data point per minute). It means that the number of data points is not proportional to the load.

6. Transitioning to a new paradigm

We wrote a few lines of codes in each of our microservices to calculate the metrics we needed. We chose to display these metrics in Stackdriver Monitoring. It was quite convenient for us as we are already in Google Cloud Platform. It is a completely managed service so it is not a burden for our operation team. However, we could have switched to another metrics reporting service. That’s it! We could have also kept ELK for this usage but we prefer to go for a more managed solution. To get more details on our error logs, we relied already on Sentry. For “usual” logs which were reduced only to what’s necessary, we are now using Stackdriver Logging to centralize them. We lost the powerful “drill down” feature of ELK. We are not able anymore to dig on a red spike on a graph until we spot the exact bunch of logs which are causing the issue.

Conclusion

ELK is a great product. It is actually the best solution to store and analyze logs. It is well documented and easy to deploy. However, when we needed to scale up we faced multiple issues and we had to rethink our monitoring strategy. We had also to scope our needs. It was not possible anymore to store everything and to wait until later until we decided what we wanted to do with the data. By planning ahead what we want to display, we moved from a heavy, time-consuming monitoring stack to a very light and reliable one.