This summer I had the opportunity to present my practical fault detection concepts and hands-on approach as conference presentations.
First at Velocity and then at SRECon16 Europe. The latter page also contains the recorded video.
If you’re interested at all in tackling non-trivial timeseries alerting use cases (e.g. working with seasonal or trending data) this video should be useful to you.
It’s basically me trying to convey in a concrete way why I think the big-data and math-centered algorithmic approaches come with a variety of problems making them unrealistic and unfit, whereas the real breakthroughs happen when tools recognize the symbiotic relationship between operators and software, and focus on supporting a collaborative, iterative process to managing alerting over time.
NOTICE: DISCONTINUATION OF SNAP TELEMETRY PROJECT. The Snap Telemetry project will no longer be maintained by Intel. Intel will not provide or guarantee development of or support for Snap, including but not limited to, maintenance, bug fixes, new releases or updates. Patches to this project are no longer accepted by Intel. If you have an ongoing need to use Snap, are interested in independently developing it, or would like to maintain patches for the community, please create your own fork of the project.
For several years I’ve worked with Graphite, Grafana and statsd on a daily basis and have been participating in the community. All three are fantastic tools and solve very real problems. Hence my continued use and recommendation. However, between colleagues, random folks on irc, and personal experience, I’ve seen a plethora of often subtle issues, gotchas and insights, which today I’d like to share.
I hope this will prove useful to users while we, open source monitoring developers, work on ironing out these kinks.
Collectd is a program that you can run on your systems to gather statistics on performance, processes, and overall status of the system in question. When you send these statistics to a time series database like Graphite, you’ll need some way to access and visualize all that data – after all, if you collect all that data but don’t have any way to use or access it, it’s not going to do you a whole lot of good.
It used to be that each machine had one purpose. Your load balancer machines talked to your web server machines which talked to your database machines. Instrumentation from these systems didn’t tend to be spectacular, so the most practical approach was to rely on and page on machine metrics such as CPU. Sure it wasn’t perfect, but it caught most of the times the service got overloaded.
There’s the odd false positive.
Some of the most fun I had last year was the few weeks I spent in Colorado and Utah, learning how to fly sailplanes. It’s really challenging, especially for pilots who are used to flying with engines.
It’s a very raw, yet zen-like experience, and I’m hooked. You have to learn to rely on your senses a lot more than your instruments.
The experience has made me think more about parallels between monitoring and aviation, a topic I’ve written before
Grafana is used by hundreds of thousands of users on a wide variety of data sources. Among these there is a division in approaches to collecting the data. These are logging as exemplified by Elasticsearch as part of the ELK stack (Elasticsearch, Logstash and Kibana), and metrics as exemplified by Prometheus.
What do I mean by monitoring? Monitoring means knowing what’s going on inside your system, how much traffic it’s getting, how it’s performing, how many errors there are.
I’m starting to think there’s a lot that the monitoring world can learn from the aviation world, and vice-versa.
Those that know me are probably starting to roll their eyes; I really like talking about airplanes. I love flying, especially when I’m the one doing it. I got my private pilots license many years ago, and am in the middle of a glider rating. I wish I had more time to go up.