DDoS protection: Observability, automation, and curiosity

• 2024-09-20 • 15 min

DDoS (distributed denial of service) attacks have been launched by sophisticated state actors, teenage hackers, and even malicious people with little technical know-how. Their criteria for success is simple: If your service ends up being denied, it’s a win.

Even if engineers work 24/7 to monitor for attacks, they may only have minutes to respond — and it still may be too late. DDoS mitigation requires an understanding of internet engineering and a lot of forward-thinking.

Fortunately, there are ways to use observability and automation as tools to detect and mitigate DDoS attacks. In the latest episode of “Grafana’s Big Tent” podcast, host Matt Toback, Grafana Labs VP of Culture, discusses the topic with Dee Kitchen, Grafana Labs VP of Engineering for Databases (Loki, Mimir, Tempo, Pyroscope), and Alex Forster, an engineering manager at Cloudflare, a company whose services include cloud cybersecurity and DDoS mitigation.

Forster works on the customer-facing parts of Cloudflare’s DDoS mitigation, such as educating the community on how mitigations work and providing visibility into what the mitigations are doing. Fun fact: Kitchen was the engineering manager for the DDoS feature team at Cloudflare before coming to Grafana Labs.

You can read some of the show’s highlights below, but listen to the full episode to find out more about DDoS and the history of attacks (the first one might be more recent than you think), hear about the big Rapid Reset attack in 2023, and learn about whether it’s smart to automate based on third-party data. (You can also find the full transcript here.)

Note: The following are highlights from episode 3, season 2 of “Grafana’s Big Tent” podcast. The transcript below has been edited for length and clarity.

Coupling control and observability

Matt Toback: Dee, how would you describe what we’re talking about today?

Dee Kitchen: I came up with this topic, and it was mainly because it’s one that’s been lingering for more than a decade. I was thinking of Fail2Ban, a Python program that looks at your logs files and the failed authentication attempts for your SSH logs, web application, or whatever, and then it does some action. Like if it spots three failures from a given IP in a 50 minute window, it will drop in an IP tables rule maybe for 45 minutes. The attacker goes away and it slows everything down.

It led me to realize that Fail2Ban is really interesting because it’s an observability signal that is directly coupled to a control mechanism in action. It doesn’t provide you a dashboard, it doesn’t provide you an alert so much — it might send you an email for notifications — but it’s just tightly coupling control to observability. I found that interesting because we’re in observability, where you go beyond monitoring into having a look at everything and being able to answer any question.

But I think there’s more beyond that, which is: What do you do when you can ask any question?

Matt: Do you think that there’s a component of observability that, because it came from monitoring, it was less owned by a different group or is less tactical?

Dee: I do think there’s this myth that we would go and sit and look at a dashboard all day. We kind of wanna avoid that. You shouldn’t need to go to the dashboard — and that’s what we’ve been saying at Grafana for a while, right? You alert on it; act on it; trigger the actual workflow. But when you’re doing ops and you’re on-call, when you receive the same critical alert or warning alert and you take the same action, you should automate it away and never look at that dashboard again.

Matt: So if denial of service is self-evident, shouldn’t there be this deep fear about making automated observability tools? You could end up blocking traffic yourself as a result of an ill-put tool.

Dee: There’s definitely a risk and fear there, but I think the benefits are great because you can sleep at night and not do the ops.

Alex Forster: To frame this in specific terms of DDoS detection and mitigation, we often have other teams inside Cloudflare, or even maybe customers or press, come to us and say, “Have you been tracking this attack campaign?” or “Have you seen some nation state doing this?” And the answer is always, “Well, we haven’t been looking, so we’re going to have to go look.”

We would consider it a failure on our part if we had teams staring at graphs 24/7 in order to pilot our DDoS mitigation systems. That would mean they are not reliable enough to work on their own without a human needing to babysit them. It’s not an easy task to get to a place where you can trust your systems to react correctly without human intervention. But you can get there — and when you do, it is a great place to be.

Automation > human reactions

Matt: Observability equals control. Alex, this idea of leveraging automation and not being afraid of what automation can do as it relates to that — is it advisable?

Alex: It’s definitely advisable if you can pull it off. It is tricky to get right. For instance, you can automate the take-down of the majority of your capacity, then cause yourself an outage. There’s a lot of guardrails that need to be put in place if you’re going to automate based on real-time information. That being said, it’s the only way to get to really high availability and go from three or four 9s to five and beyond 9s. You need to be able to react faster than a human could, and the only way to do that is with automation.

Matt: In order to do that, you have to have a serious span of control as far as where you can make changes. Does that become challenging from an organization standpoint?

Alex: Absolutely. It depends on the types of systems you run, but the larger you are, the larger a unified control plane gets.

Dee: Three 9s is 43 minutes or thereabouts. When your manager turns around and goes, “Three 9s is not good enough, we need four 9s,” what that means is that you only get four minutes of downtime per month. Now, most on-call can’t answer the actual page. It takes 30 to 60 seconds to actually be enacted, then you have to notice that you have to answer, and you have to keep your laptop out or turn on your computer or quickly rush over. You can’t respond in the four minutes that you’ve got. You’ve gotta remove humans from that decision-making response process, which means you have to automate the hell out of this.

All of these observability signals are now available not for a human to understand, but for a human to automate such that a computer can understand that. Because you really want a machine to do this for you, even with the risk.

Alex: I think DDoS is, again, a great example of this. We try to kick in within seconds, not minutes. And of course, you want your DDoS mitigation to kick in quickly, but imagine how you would achieve that if you had to have humans triaging DDoS attacks. SOC analysts wouldn’t have enough time to respond in order to keep a really high aggressive SLA.

Dee: The more we automate, the calmer your life is as an ops person. We should be seeking out where to automate continuously everywhere. There are signs of this all over the place, but a lot of engineers don’t realize. AWS Auto Scaling is just automation based on those signals, Fail2Ban is an automation based on a signal. Whatever these are, they’re all protecting us, protecting our sleep — an important thing.

I run my own website, and when I was on holiday, I got a DDoS attack. It was 130,000 HTTP requests in about a five-second window. It knocked me straight offline. It took me ages to figure out that the commonality of all the traffic was this HTTP header that I wasn’t logging. Not only wasn’t I logging it, but I also didn’t have a simple, programmable endpoint that I could block it. I had to write code to block it.

Matt: Alex, pretend Dee doesn’t have deep knowledge. How would you help?

Alex: This would be where observability is key. Oftentimes, you need to research these things historically. For instance, if the attack is intermittent and you’re not able to catch it in real time, you need to be recording what you need to know ahead of time — even before you see what the attack is doing.

Matt: Or… time travel.

Dee: Yeah, that’s the other way out. But this is why I think these things are observability, because observability is a record enough that you can answer all questions, and not just monitor some things that you already know you’re gonna ask about.

Managing mitigations and planning for the future

Matt: When you put observability in place, or you start to have automatic mitigations in place, is there a long tail of people trying these things anywhere or everywhere all the time? Or does it feel like a whack-a-mole, where one thing is popular for a bit and then it disappears, and then there’s a new thing that’s popular?

Alex: It’s both. At the cutting edge of this industry, there is a whack-a-mole game going on, but the vast majority of DDoS mitigations are pretty standard methods that are well understood.

Dee: There are tools that someone smart once wrote to launch a certain type of attack — but a lot of other people don’t really know how to write such a thing, so they package it. A lot of attacks come from package generators, which means once you’ve figured them out and blocked them, you’re blocking everything that all of those various tools are using. But there’s always more that can be done.

Matt: Alex, when you go to the protocols team and say, “We need to instrument this,” or “We need to start watching this,” do you also follow up and say, “Now that this has happened, can you think of five more things that we should observing that we’re not?”

Alex: Yes! Very much. When we’re caught in a situation where we don’t have observability of something that we didn’t think we needed but actually did, it always makes us step back and say, “Where else are our blind spots? Where else could we be collecting data that may come in handy in the future?” You don’t want to collect data that you are confident that you won’t need, but then how can you really ever be confident that you won’t need that data? It’s usually a gut check.

Matt: I’ve heard people talk about putting developers on a “data diet,” or that they can’t log everything because it’s far too expensive. This is like that on steroids, isn’t it? How do you think about the value of what’s worthwhile and what’s not, when you have no idea what’s worthwhile and what’s not . . . until you do?

Alex: There are compromises to be made there. In Cloudflare’s case, DDoS protection is core value prop. So for us, it isn’t too hard to get resources if we need them, but we still need to justify why we need to record things.

One example: Our DDoS mitigation systems act entirely on traffic inbound to our network. It’s very hard for us to justify recording information — from a DDoS perspective — leaving our network. It’s not something we need, even though in some cases, maybe for research purposes, it would be interesting. Even when your service is a core part of your business, those sorts of considerations do come into play.

Matt: Dee, how do we think about it here at Grafana Labs?

Dee: Tensions, trade-offs. On one hand you want to observe everything. And then at some point in time, we ask, “How much does the Loki team logging cost us?” You’ve gotta make judgment calls, and sometimes you turn around and see something you are logging and go, maybe I will never, ever need that.

The question about control is an interesting one because it gives you a context to ask, “Is this valuable? Are you ever gonna use this to drive something? Would you use it to scale something out? Does it tell you some signal?” If it doesn’t, perhaps you don’t need it.

What individuals should be logging and observing

Matt: In terms of observability, if someone is running their own network or endpoints service, where is a practical place to start?

Alex: Cover your basics. Your most important systems should be logging. If they’re network gear, you should be collecting SNMP counters. If they are Linux services, you should be collecting CPU and memory usage data, and you should be storing it. Even in giant distributed systems, those are some of the most common places we go to troubleshoot.

If you are specifically developing software and you have control over what sorts of metrics a system is outputting, you are going to want to put some thought into observing any place where you could imagine something spinning out or taking too long, and adding instrumentation there. It will make your life as a developer much easier. But if you are a systems administrator or network engineer working on third-party or other software you do not control, make sure you have the basics there.

Dee: Ability to get PCAPs, ability to get logs off of absolutely every machine will help. But let’s go for the most common ones. I would argue for being able to have a gateway ahead — sort of internet-facing to your website stuff, where you can log every single aspect of a HTTP request and everything that’s going on, and that provides a control point there as well. You need the observability and the control plane basically in one, ahead of your service. Decoupling them is gonna put you at risk.

Rate limits as the gateway into automating

Matt: How do you start automating?

Dee: Well, a simple way is a rate limit. It takes a signal and counts it, just like the Failed2Ban one at the very beginning of the conversation. Figure out what your threshold is for “normal” and set the rate limit above, and that’s your first automated defense. And if that never fires, great. That five minutes costs you nothing. But it will fire.

Alex: I would fully endorse that, and having a positive security model, which includes rate limiting and includes firewalling off things that shouldn’t be accessible. Understanding what “normal” is so you won’t allow your systems to go out of the range is the most significant and probably easiest thing you can do to harden yourself up. You also want to implement them in a way that won’t infect your good traffic.

Dee: We need to explain what a positive security model is. The default model on the internet — which presumes trust and everything is good and was not built for security — is allow everything unless someone is explicitly denied. That’s crazy. You need to switch this more towards deny everything unless it is explicitly allowed. Most people can understand that a single web server is not doing more than 20,000 requests per second, so you can immediately go, “If it’s above a few hundred requests per second on a single server and I am not a huge website, that’s a good, safe rate limit.”

Matt: Sure. But I take that and say use the observability you already have, or the metrics you’ve already been collecting, as a stepped approach to informing what’s normal and how you should make decisions around that. Of course there’s gonna be outliers, but those could be the first places where you start.

Dee: I had a manager once, the CTO, and he would come up to me and ask, “Is everything good at the moment?” It’s a terrifying question ‘cause you can’t possibly answer “yes” to it, because what if he knows something that you don’t know? Writing dashboards that show you how to answer yes is a really good way of highlighting most of the observability of a system to the point where you can confidently say, “Nothing is wrong. We know what normal is.” It’s a good way to lay a foundation to being able to spot bad.

The power of curiosity

Matt: How do you remain informed and how do you continue to get smarter about this subject as time goes on?

Dee: I think curiosity is everything — asking, “How did that happen?” And this is the same with any incident on-call, right? It’s not enough to go, “When we scale up, it solves the problem.” Do you understand what happened and why? And if you go down that rabbit hole, we become better engineers.

Alex: I agree with that completely. There are two forcing functions that are going to make you get observability into your systems. One of them is pride and curiosity in yourself as an engineer. The other is angry customers yelling at you and telling you to fix what you run. I would suggest being the former so that when the latter happens, you’re prepared.

I don’t think the work is as much as people think it is, and it will pay off in the long run, even if you don’t know exactly when. Having a curious mind into these observability spaces is really an all-around positive for you as an engineer.

“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at bigtent@grafana.com. You can also catch up on the first and second season of “Grafana’s Big Tent” on Apple Podcasts and Spotify.

Feedback

Relevant sources:

Feedback

DDoS protection: Observability, automation, and curiosity

Coupling control and observability

Automation > human reactions

Managing mitigations and planning for the future

What individuals should be logging and observing

Rate limits as the gateway into automating

The power of curiosity

Related content

DDoS protection: Observability, automation, and curiosity

Coupling control and observability

Automation > human reactions

Managing mitigations and planning for the future

What individuals should be logging and observing

Rate limits as the gateway into automating

The power of curiosity

Related content

3D printing and observability: How Prusa Research monitors its huge printer farm with Grafana

OpenTelemetry: past, present, and future

2025 observability predictions and trends from Grafana Labs