Skip to content
Red Teaming 14 min read

Detection Engineering Against MITRE ATT&CK: Rules That Survive a Real Attacker

Most detection rules we audit are tuned to the alerts vendors ship and the IOCs from last year's incident. They catch the easy stuff and miss everything a real attacker does. This is the detection engineering workflow we use — built around ATT&CK techniques, tested with a purple team, and measured with metrics that aren't just 'rules created'.

Why most detection programs fail their first real attacker

If you've ever inherited a SIEM, you know what's in it: a few hundred rules shipped by the vendor, a couple dozen rules someone wrote during an incident two years ago, and an alert queue full of noise that nobody triages. The rules catch the easy stuff — known bad hashes, specific malware signatures, "the password 'password' was used" — and miss almost everything else. When a real attacker shows up, the SOC's first warning is the customer email saying their data is on a leak site.

The reason isn't that detection engineering is hard. It's that most teams approach it backwards. They start from "what alerts does the vendor ship?" and try to tune them, instead of starting from "what techniques do attackers use?" and writing detections for those. The result is a detection set that's optimized for the demos vendors give, not the attacks that actually happen.

This is the workflow we use to fix that. It's built around MITRE ATT&CK techniques, tested with a purple team, and measured with metrics that mean something. None of it requires a specific SIEM or a specific budget — it works with whatever you have.

Diagram

1. What ATT&CK actually is, and why it matters

MITRE ATT&CK is a catalog of techniques attackers use, organized by tactic (the "why") and technique (the "how"). At time of writing the Enterprise matrix has over 200 techniques and over 470 sub-techniques, covering everything from initial access to exfiltration. Each technique has a unique ID (T1059 for command and scripting interpreter, T1078 for valid accounts, and so on), a description of how it's used, examples from real intrusions, and suggested data sources for detection.

The reason ATT&CK matters for detection engineering is that it gives you a finite, enumerable target. Instead of trying to "detect attacks" in the abstract, you're trying to detect a specific list of techniques. You can measure your coverage by counting which techniques have at least one detection rule against them, identify the gaps explicitly, and prioritize the gaps that matter most for your environment.

The trap is treating ATT&CK coverage as the goal. Coverage is a metric, not the goal — the goal is to catch attackers. We'll come back to this.

2. IOC-based vs behavior-based detection

Most legacy detection is indicator-based: this hash is bad, this IP is bad, this domain is bad. The problem with IOCs is that they're trivially mutable. An attacker can change the hash by recompiling, change the IP by spinning up a new VM, change the domain by registering another. By the time the IOC ends up in a feed, the attacker has moved on. IOC-based detection catches yesterday's attackers.

Behavior-based detection looks at what an action is, not what it looks like. "A PowerShell process spawned from Word with a network connection to a non-corporate domain" is a behavior. It catches the attacker even if they've never used that specific binary before, because the behavior is the same regardless of the payload.

The shift from IOC-based to behavior-based detection is the single biggest leverage point in a detection program. Every rule you write should be one or the other; we recommend behavior-based whenever possible, and IOC-based only as a complement for high-confidence indicators with short useful lives.

3. Data sources, the prerequisite nobody discusses

A detection rule is only as good as the data it sits on top of. Before you can detect anything meaningful, you need the telemetry. The minimum set we look for on every engagement:

  • Endpoint: process creation events with full command lines, parent process, and user. EDR (CrowdStrike, SentinelOne, Defender for Endpoint) gives you this; raw Sysmon on Windows or auditd on Linux gives you a workable subset for free.
  • Authentication: every login event from your IdP (Okta, Azure AD, Google Workspace) and your VPN. Attribute on user, source IP, device, and result.
  • Cloud control plane: CloudTrail, Azure Activity Log, GCP Audit Logs. Every API call, who made it, from where, what the result was.
  • Network: at minimum, DNS query logs and proxy/egress logs. Full packet capture is nice but rarely necessary; the metadata catches most of what behavior-based rules need.
  • Application: auth logs from your critical apps, including SaaS. Every "failed login" and "permission changed" event.

If any of these are missing, the rules you write against them will silently fail. Audit your telemetry coverage before you write rules — the first deliverable of a serious detection program is a data-source map, not a rule.

4. The rule-writing workflow, end to end

Here's the loop we run for every new detection. It looks bureaucratic written down; in practice it takes a few hours per rule once the team is used to it.

Step 1: Pick a technique

Start with the ATT&CK technique you want to cover. For the first few rules, pick from the "top 20 most observed" list — techniques like T1059.001 (PowerShell), T1078.004 (cloud accounts), T1110.003 (password spraying), T1486 (data encrypted for impact). Don't try to cover the long tail first; cover the techniques you'll actually see.

Step 2: Enumerate the procedures

A technique is the abstract concept ("password spraying"). Procedures are the concrete ways attackers do it ("authenticating against /api/login with a list of common passwords from a rotating set of source IPs"). You need to write the rule against the procedure, not the technique. Read the ATT&CK procedure examples for the technique, plus any vendor threat-intel reports you have. List the specific behaviors you want to catch.

Step 3: Identify the data source

Which telemetry would show this behavior? Endpoint process events? Cloud audit logs? Authentication events? Pick one. If you don't have the data source, this is where you stop and go fix the data first.

Step 4: Write the rule

Write it in your SIEM's query language. Aim for specificity over volume — a rule that fires on five things a week is more useful than a rule that fires on five hundred. Your goal is high signal, not high volume.

Step 5: Test against the attack

This is the step everyone skips. Before you ship the rule, you have to verify it actually catches the attack you wrote it for. The way to do that is run the attack in a controlled environment and see if the rule fires. Tools that help: Atomic Red Team (a library of commands that execute specific ATT&CK techniques), Caldera (a more elaborate framework for chained techniques), and your own homegrown scripts.

If the rule doesn't fire, it's broken. Fix it before you ship.

Step 6: Test against false positives

Equally important: test the rule against a week of historical normal-traffic logs. How many times does it fire? If it fires 200 times a week, you have a noise problem. Tune until you have a manageable rate (we aim for <5 fires per rule per week as a starting point; tighter for low-priority rules, looser for the most critical ones).

Step 7: Ship with documentation

Every rule needs documentation: which ATT&CK technique it covers, what the rule logic does, what the expected fire rate is, and — most importantly — what the analyst should do when it fires. A rule without a runbook is a noise generator.

Step 8: Feedback loop

After the rule is in production, monitor it. If it never fires, either you have no attacks of that type or the rule is broken. If it fires too much, tune it. If the fires it produces are always false positives, kill it and write a different one. Detection is iterative.

5. Purple teaming, where the leverage actually is

A purple team is a joint exercise between attackers (red) and defenders (blue) where the goal isn't competition — it's calibration. The red team runs ATT&CK techniques in a controlled way, the blue team watches their telemetry, and together they figure out which detections fire, which don't, and why.

The format we use:

  1. Pick 10 techniques to test in a single session, balanced across tactics (initial access, execution, persistence, lateral movement, exfiltration).
  2. The red team runs each technique, one at a time, with a clear timestamp and source attribution. "At 14:32, from this host, I will run a kerberoast against this account."
  3. The blue team watches. They have one of three results: detected and alerted, detected in telemetry but not alerted, not detected at all.
  4. For every "not detected" or "telemetry but not alerted," the team writes a detection on the spot. Same loop as the rule-writing workflow above, compressed.
  5. At the end, the red team reruns the techniques. The new rules should fire.

A single full-day purple team session is worth more than a month of solo detection engineering. The reason is that the feedback loop is tight — you write a rule, you test it against the real attack five minutes later, and you know whether it works.

6. Metrics that mean something

Most detection programs measure the wrong things: number of rules written, number of alerts per day, mean time to triage. Those are activity metrics, not effectiveness metrics. Here are the metrics we use instead.

  • Coverage by technique: what percentage of ATT&CK techniques relevant to your environment have at least one tested detection. Use the MITRE ATT&CK Navigator to visualize.
  • Detection-as-code coverage: what percentage of your rules are version- controlled, peer-reviewed, and tested in CI. (If the answer is "what's CI for detections," see section 7.)
  • Time to detect (TTD): from a tested attack execution, how long until the first alert fires. Measured in seconds for endpoint events, minutes for cloud control plane, ideally under 10 minutes for everything.
  • True positive rate per rule: of the alerts a rule produces in a quarter, what percentage were real incidents (not benign true positives or false positives).
  • Rules killed in the last quarter: a healthy detection program retires bad rules. If you're only adding, you're accumulating noise.

Don't measure all five at once. Pick two — coverage by technique and TTD — and report on them monthly. Add the others as the program matures.

7. Detection as code

Every detection rule should live in Git. Pull requests, code review, branching, the whole thing. The CI pipeline runs syntactic checks (does the rule parse?), unit tests (does it produce expected output on synthetic data?), and integration tests where possible (does it fire when we replay a known-attack capture?). Promotion to production happens through CI.

The benefits are the same as for application code: change history, peer review, the ability to roll back, the ability to refactor without losing state. The friction is the same too — it takes a quarter to get a team comfortable with the workflow. The investment pays back the first time someone tries to figure out why a rule was changed in an incident review.

Tools that help: Sigma (a vendor-neutral rule format), Panther (SIEM with detection-as-code built in), and the various open-source rule libraries that ship with Splunk, Elastic, and the others.

8. The noise problem and how to actually fix it

Every SOC complains about alert fatigue. The fix is rarely "tune the rules." The fix is usually "kill the rules." Most alert fatigue comes from a small number of rules generating the bulk of the noise. Audit your alert volume by rule, identify the top 20 producers, and for each one ask: in the last quarter, how many of these alerts turned into real incidents? If the answer is zero, the rule is noise. Kill it.

You'll feel anxious doing this — the rule must be there for a reason — but the reason is usually that someone wrote it during an incident two years ago and nobody has reviewed it since. Kill it. Your SOC will thank you. Coverage will go down by one technique; you can write a better rule for that technique later.

9. Integration with incident response

Detection without response is theatre. Every rule should have a runbook attached: when this fires, here's what the analyst does. The runbook should be specific. Not "investigate the host" — "run these three queries, check these four things, escalate if any of them are non-empty." A good runbook turns a detection into a decision tree, and a decision tree is executable by an analyst at 3 a.m. who has never seen this rule fire before.

Tier the runbooks. For high-confidence alerts (a known critical detection on a sensitive asset), the runbook can include automation — quarantine the host, disable the user, snapshot the disk. For medium-confidence alerts, the runbook escalates to a human. For low-confidence alerts, the runbook puts the alert in a triage queue. Don't put low-confidence alerts in front of your analysts; they'll learn to ignore them.

10. Putting it together — what good looks like

A mature detection engineering program has these properties: every rule is mapped to an ATT&CK technique, every rule has a documented test that proves it catches the attack, every rule has a runbook attached, the rules live in Git with peer review, the team tracks coverage by technique and TTD as their primary metrics, the team runs purple team exercises at least quarterly, and the team kills bad rules as aggressively as it adds new ones.

None of this requires a specific tool. We've seen great programs running on Splunk, Elastic, Panther, Sentinel, and homegrown ELK stacks. The tool is the lowest-leverage part of the decision; the workflow is what matters.

The teams that get this right go from "we hope our SIEM catches things" to "we know exactly which techniques we cover, which we don't, and which to invest in next." That's the difference between detection theatre and a detection program that actually catches attackers.

Want us to level up your detections?

Purple team workshops, ATT&CK-aligned detection content, and a measurable handoff to your SOC. Includes a follow-up retest of every gap we find.