Distributed computing is hard. Nodes fail services die and kill off other related services messages get lost in transit over flaky networks race conditions lead to incorrect states and so on. It’s a big bad world out there. As developers we keep our sanity by making sure we are armed with the best tools for monitoring alerts and logging. At HyperTrack we use a bunch of open source and paid tools to make sure everything is always up. One of those tools is the ELK stack (Elasticsearch/Logstash/Kibana) for logging.
Logging in a distributed world
Debugging issues across services is hard. There are often multiple services that talk to each other and while debugging it’s hard to figure out which service is causing an issue.
Lets take our MQTT architecture for communicating with our SDKs as an example. We have mobile SDKs talking to an MQTT broker over IoT which invokes AWS Lambda that calls our API. Those are 5 different services talking to each other. When we are debugging an issue we often need to correlate events across services to identify the root cause.
As soon as we had more than 3 services running we started feeling the frustration of having to debug across different environments. Debugging would be cumbersome frustrating and time consuming. We started looking for tools that could help us become more efficient with debugging. We wanted a centralized logging tool that all our services could send their logs to so that we could search across all related services together and correlate their logs.
What’s out there
This is what the existing solution space looks like:
Paid logging services
We used several different ones for a while (papertrail logentries). They are great to get started with and use for a single service. However with different log formats and multiple services they were hard to customize didn’t really solve many of our problems and got expensive really quickly.
Several options here. ELK or EKK (replace logstash with Kafka for moar scale!) and Graylog are the top contenders here.
Since we had some experience with using ELK we decided to go ahead and set that up.
Our Solution: The ELK Stack
The ELK stack as the name suggests has 3 different components:
- Elasticsearch – Stores and indexes logs making them searchable
- Logstash – Ingests logs from various sources tags the logs converts them to structured JSON
- Kibana – A frontend for elasticsearch allows you to visualize and search your logs.
For Elasticsearch and Kibana we just use AWS Elasticsearch. It’s cost efficient easy to scale deploy and operate. Logstash runs on an EC2 instance and all our services send their logs to it (over rsyslog and some over HTTP). Logstash has 3 parts to it:
- Input – Define what protocols logstash should listen on for ingesting logs
- Filter – Structure the log lines into JSON
- Output – Dump the structured logs to an output (Elasticsearch in our case)
This is what our logstash config looks like:
Now that we have our logging infrastructure setup the next step is to build dashboards on top of the metrics in ElasticSearch. Kibana does a great job of visualizing data by using simple search filters.
Another important use case for us is to have alerts on top of our logs. These could be alerts like when the response times are high memory usage is high or if we hit some exception events that we want to be notified for. Elastalert is a fantastic open source library from Yelp that does exactly this. You can specify rules for alerts in a YAML file and configure alerts to be sent over email slack pagerduty etc.
Like what we are working on? Come work with us we are hiring!
Subscribe to HyperTrack Blog: Imagine. Build. Repeat.
Get the latest posts delivered right to your inbox