Skip to main content

Command Palette

Search for a command to run...

How I Built a Real-Time DDoS Detection Engine from Scratch

Updated
10 min read
How I Built a Real-Time DDoS Detection Engine from Scratch
K

I am a Cybersecurity and Cloud Security enthusiast passionate about automation, DevSecOps, and securing cloud infrastructures. I focus on building resilient and secure systems through security best practices and automation.

Introduction

Imagine you run a website that stores files for thousands of users. Everything is running smoothly until one day, your server slows to a crawl and legitimate users can't get in. You check the logs and see one IP address sent 10,000 requests in the last minute. You've just been hit by a DDoS attack.

A DDoS (Distributed Denial of Service) attack is when someone floods your server with so many fake requests that it has no resources left to serve real users. It's like someone sending thousands of people to crowd a small shop so actual customers can't get in.

In this post I'll walk you through how I built an automated system that watches all incoming traffic, learns what normal looks like, and blocks attackers automatically — all without a human having to watch a screen.

What the Project Does

The system sits alongside a Nextcloud server (a self-hosted file storage app, like Google Drive) and does five things continuously:

  1. Reads every HTTP request as it arrives

  2. Counts requests per IP and globally

  3. Learns what normal traffic looks like

  4. Flags anything that deviates from normal

  5. Blocks attackers at the firewall level and alerts the team on Slack

The entire thing runs as a background daemon — a program that never stops — and makes decisions in real time without any human involvement.

The Architecture

Before diving into the individual pieces, here is how everything connects:

Internet Traffic
      ↓
   Nginx (front door — logs every request as JSON)
      ↓
   Nextcloud (the actual app users interact with)
      ↓
   Your Detector (reads logs, runs math, blocks bad guys)
      ↓
   Dashboard + Slack Alerts + iptables blocks

Nginx sits in front of Nextcloud as a reverse proxy. Think of it like a receptionist — every visitor goes through it first. It writes down every visitor in a log file. Your detector reads that log file in real time.

Piece 1 — Reading the Logs

Nginx is configured to write logs in JSON format. Every single HTTP request produces one line like this:

{
  "source_ip": "41.58.2.1",
  "timestamp": "2024-11-14T15:04:05Z",
  "method": "GET",
  "path": "/login",
  "status": 200,
  "response_size": 4321
}

JSON is used instead of the default Nginx format because it is easy for Python to read. Instead of splitting strings, you just ask for a field by name:

data = json.loads(log_line)
ip = data["source_ip"]
status = data["status"]

The detector opens this file and reads it line by line forever. When there is no new line yet, it waits 50 milliseconds and tries again. This is called tailing a file — the same thing the tail -f command does in Linux.

Piece 2 — The Sliding Window

Once you have a log line, you need to answer one question: how many requests has this IP sent recently?

"Recently" is defined as the last 60 seconds. The naive approach would be counting requests per minute — but that has a problem. If an attacker sends 1000 requests in the last 5 seconds of one minute and the first 5 seconds of the next, each minute only shows 500. The per-minute counter misses the attack completely.

A sliding window fixes this. It always looks at the last 60 seconds from right now, regardless of where the clock minute boundary falls.

The data structure used is a deque (double-ended queue). Think of it as a list where you can efficiently add to one end and remove from the other.

Here is how it works:

from collections import deque
import time

window = deque()

# When a new request arrives, add it to the right
window.append((time.time(), is_error))

# Remove entries older than 60 seconds from the left
# Entries are in time order so oldest is always on the left
now = time.time()
while window and now - window[0][0] > 60:
    window.popleft()

# Current request rate = how many entries remain
rate = len(window)

As time moves forward, old entries fall off the left automatically. The window slides with time. At any moment, len(window) gives you the exact request count for the last 60 seconds.

Two windows are maintained simultaneously:

  • Per-IP window — one deque per IP address

  • Global window — one deque for all traffic combined

Piece 3 — The Baseline

Knowing the current rate is not enough. You need context. Is 15 requests per second a lot? It depends entirely on what normal looks like for your server.

At 3am, your server might normally get 2 requests per second. At noon it might get 30. A hardcoded threshold of "block anyone over 50 req/s" would miss the 3am attack and block legitimate noon traffic.

The solution is a rolling baseline — the system watches your actual traffic and learns what normal looks like.

Every second, the detector records how many requests arrived that second. It keeps a rolling 30-minute window of these per-second counts. Every 60 seconds it calculates two numbers from that data:

Mean — the average requests per second:

mean = sum(data) / len(data)

Standard deviation — how much the numbers normally vary from the mean:

variance = sum((x - mean) ** 2 for x in data) / len(data)
stddev = math.sqrt(variance)

These two numbers are your baseline. They update every 60 seconds so they always reflect recent reality. If your site gets a surge of legitimate traffic at noon, the baseline adjusts within a minute and stops treating that traffic as suspicious.

The system also keeps per-hour slots. Traffic at 9am tends to look different from traffic at 3am. When there is enough data for the current hour, it prefers that over the full 30-minute average — because the current hour is the most relevant comparison.

Floor values are applied to prevent extreme sensitivity during quiet periods:

effective_mean = max(mean, 1.0)    # never below 1 req/s
effective_stddev = max(stddev, 0.5) # never below 0.5

Without floors, a server with almost zero traffic at 3am would have a mean of 0.1 and a stddev of 0.05 — meaning the first legitimate morning user could trigger a ban.

Piece 4 — The Detection Logic

Now you have two things: the current rate for an IP, and the baseline for what normal looks like. The detection logic compares them using a Z-score.

A Z-score answers one question: how many standard deviations away from normal is this number?

zscore = (current_rate - mean) / stddev

If normal is 12 req/s and stddev is 2:

  • 14 req/s → Z-score of 1.0 → within normal range

  • 18 req/s → Z-score of 3.0 → suspicious

  • 50 req/s → Z-score of 19.0 → definitely an attack

The system fires an alert when either of two conditions is true — whichever happens first:

# Condition 1 — statistically anomalous
if zscore > 3.0:
    flag_as_anomaly()

# Condition 2 — raw rate too high relative to baseline
elif current_rate > 5 * mean:
    flag_as_anomaly()

The second condition catches edge cases where the standard deviation is large (traffic varies a lot normally) and the Z-score alone might be too lenient.

There is also an error surge check. If an IP is getting a lot of 4xx or 5xx responses — errors — that suggests probing behavior. Someone trying every door hoping one is unlocked. When an IP's error rate exceeds 3x the baseline error rate, the detection thresholds for that IP are tightened automatically.

Global traffic is checked every 5 seconds using the same logic. A global spike means many IPs are attacking simultaneously — you cannot block all of them, so the system sends a Slack alert for the team to respond manually.

Piece 5 — Blocking with iptables

When an IP is flagged, it needs to be stopped immediately. Sending a message back saying "you're blocked" wastes server resources. The attack traffic needs to disappear before it even reaches Nginx.

iptables is the Linux kernel's built-in firewall. It sits at the very bottom of the network stack — below Docker, below Nginx, below everything. When you add a DROP rule for an IP, that IP's packets are thrown away at the kernel level. The attacker doesn't get an error message. Their connection simply never happens.

import subprocess

def ban(ip):
    subprocess.run([
        "iptables", "-I", "INPUT",
        "-s", ip,
        "-j", "DROP"
    ])

Breaking that command down:

  • iptables — the firewall tool

  • -I INPUT — insert at the TOP of the incoming traffic chain

  • -s 41.58.2.1 — match traffic from this source IP

  • -j DROP — silently discard the packet

The -I flag inserts at the top rather than appending to the bottom. This matters because iptables checks rules in order — if your DROP rule is at the bottom and there is an ACCEPT rule above it, traffic gets accepted before reaching your block.

The ban happens within 10 seconds of detection. A Slack message is sent simultaneously telling the team who was blocked, why, and how long the ban lasts.

The Unban Schedule

Not every block needs to be permanent. The system uses a backoff schedule:

Offense Ban Duration
1st 10 minutes
2nd 30 minutes
3rd 2 hours
4th+ Permanent

A background thread checks every 10 seconds whether any ban has expired. When one has, it removes the iptables rule, updates the system state, and sends a Slack notification. If the same IP gets caught again, the next ban is longer.

To remove a ban:

subprocess.run([
    "iptables", "-D", "INPUT",
    "-s", ip,
    "-j", "DROP"
])

The Live Dashboard

A Flask web server runs on port 8080 and serves a page that refreshes every 3 seconds automatically using a simple HTML meta tag:

<meta http-equiv="refresh" content="3">

The dashboard shows:

  • Currently banned IPs with countdown timers

  • Global requests per second

  • Top 10 source IPs

  • CPU and memory usage

  • Current baseline mean and stddev

  • System uptime

Every page load collects fresh data from all components and rebuilds the HTML. Simple and reliable.

What I Learned

The most important insight from this project is that detection only works when it is relative. A hardcoded threshold is always wrong for someone. Traffic patterns are unique to every server, every time of day, every day of the week. The baseline approach means the system adapts automatically — it gets smarter the longer it runs.

The second insight is that separation of concerns makes complex systems manageable. Each file has exactly one job. The sliding window does not know about Slack. The notifier does not know about iptables. When something breaks, you know exactly which file to look at.


Source Code

The full source code is available at: https://github.com/RichardBenjamin/hng14-Stage3

The live metrics dashboard is available at: http://hngstagekene3.duckdns.org:8080


Built as part of the HNG DevSecOps track — Stage 3.