Agentic AI for Incident Response

Introduction

Security operations and telemetry collection at large have quietly entered a strange era. Organizations now collect more telemetry than they can realistically analyze. SaaS audit logs, identity telemetry, endpoint logs, and network telemetry all flow into centralized platforms where collecting and warehousing this data has become the new norm for many environments. As security practitioners we have long believed that more data is always a blessing, sometimes only in retrospect, but we are still figuring out how to meaningfully collect, correlate, and tag this data for security context.¹

You might argue that discrete alerts and SIEM correlation solve this information sprawl, and you would be generally correct. This is precisely how these platforms were designed to extract security value from massive datasets. But I challenge you to find a practitioner who has not felt the pressure of a false negative, an inaccurate correlation, or a rule that means well but simply will not be quiet long enough to fix. That frustration is not just anecdotal: qualitative work with SOC analysts has shown that “false positive” is often shorthand for alerts triggered by benign but poorly contextualized behavior, which is exactly the sort of gap agentic systems may be able to narrow if they are given environmental context and tight evaluation loops.¹

Worse yet, analysts can realistically only investigate one alert at a time. Outside of elaborate orchestration playbooks or complex VQL scripts, of which I have authored some monsters, most investigations still occur on a signal by signal basis. In practice we compensate through delegation: “Jeff, take a look at those event logs and see what sticks out. I’ll start cutting apart the network telemetry.”

This is concurrency, but it is bounded and it scales poorly. Tools like Plaso can help tear through artifacts in bulk, but in environments spanning hundreds or thousands of hosts the problem remains. Exploring the data is only part of the challenge. Warehousing it, moving it, and querying it quickly becomes an operational burden of its own.

Agentic AI, buzzword though it may be, is beginning to find a legitimate niche in exploring this kind of abstract telemetry. Recent surveys of LLM-based multi-agent systems describe the same pattern many of us are discovering operationally: single agents struggle once the work requires decomposition, specialization, and iterative feedback, while multi-agent designs become more useful when tasks can be split into independently verifiable subtasks.²

The short version: We already parallelize incident response with people. Agentic systems are interesting because they let us parallelize methodology, not just labor.

Let’s acknowledge the uncomfortable part plainly: AI is one of the most overhyped technologies security has seen in years, and a fair amount of that hype is attached to breathless claims about replacing practitioners outright. I do not find that framing especially useful. If anything, this moment should push security practitioners to become better tinkerers of methodology rather than narrower experts in one implementation detail or one tool. The implementations will keep changing. The durable skill is knowing how to break a problem apart, what evidence matters, where skepticism belongs, and how to shape a workflow that survives contact with the real world. That is not a reason to fear new technology. It is a reason to get your hands on it early and decide how it ought to be used before someone less thoughtful does it for you.

Key Takeaways

If you only read one section of this article, read this one. The rest is just me unpacking the details:

Agentic AI becomes useful in incident response when it can decompose investigative work into bounded tasks, operate across large telemetry sets, and return structured outputs that other agents or humans can challenge.
The real value is not “AI intuition.” It is contextual analysis at scale: discovery data, historical case memory, and investigative tools can be combined so alerts are interpreted against the environment they occurred in.
False positives remain one of the biggest risks. Narrow agent scope, structured artifacts, and explicit devil’s advocate review are more important than clever prompting.
Human decision gates still matter for high-impact actions, especially containment and production detection deployment. The goal is faster, better-supported judgment, not blind automation.
As organizations take lessons learned and institutional memory seriously, they often discover they need something more durable than ad hoc notes. Agentic systems benefit from memory layers that can be searched, compared, and reused across cases.
In practice, the most useful multi-agent patterns are not exotic. They are orchestrator-worker decomposition, specialist teams, adversarial review, and disciplined convergence before action.

The Foundational Concepts

Agentic AI sounds far more formal than it usually is. In practice, it often starts with a collection of markdown files that define the scope within which an agent should operate. These agent definitions typically include tools, skills, and codified methodology. This structure allows an orchestrator to establish clear procedures for different agents to follow, while also enabling programmatic checks and balances. Those controls become critically important when agents interact with production systems. The strongest public writeups on production agent systems keep arriving at the same conclusion: orchestration and context management matter more than clever prompts, because most failure modes come from poor task decomposition, duplicated work, or missing guardrails rather than a lack of raw model intelligence.²

At its core, the idea is fairly simple. A team of concurrently running agents can be deployed to apply methodological expertise across large datasets and distill their findings into structured outputs for further analysis. Instead of relying entirely on static correlation rules, these agents can discover relationships between signals in a manner that more closely resembles how human analysts actually work through investigations.

When considering incident response workflows, the PICERL model provides a useful structure to build from. In this framework we might deploy an agent, or a team of agents, dedicated to each phase: preparation, identification, containment, eradication, recovery, and lessons learned. Later sections will discuss some of the limitations of mapping agents directly to this model, but as a conceptual starting point it serves as a useful way to reason about agentic workflows. NIST’s incident handling guidance still provides the most useful public baseline here: even though the newest revision reorganizes recommendations around CSF 2.0 rather than the older lifecycle language, the underlying discipline of preparation, analysis, containment, eradication, recovery, and post-incident improvement remains directly relevant.³

Figure 1. A conceptual PICERL-aligned agent pipeline, including the devil's advocate control between identification and later response phases.

A useful caveat: PICERL is a great way to reason about responsibilities. It is a terrible way to imagine how a real incident will politely behave on a timeline.

Preparation Agent

The preparation agent is largely responsible for evaluating an organization’s security posture. In practice, this may represent a team of agents dedicated to performing and evaluating vulnerability scanning and light penetration testing. Systems that approximate this model already exist in platforms such as Pentera, Vonahi, and NodeZero. While these tools do not and likely will not replace full penetration testing for some time, they serve as useful indicators during the preparedness phase. In many cases they are capable of identifying and distilling low-hanging fruit in ways that make them more advanced than traditional vulnerability scanners. MITRE’s CALDERA project is a useful public analogue for this phase because it demonstrates how adversary emulation can be automated in a way that reduces routine assessment effort while still being grounded in ATT&CK-modeled behavior.⁴

Truth be told, this is a meaningful step forward for organizations that cannot afford regular penetration tests or that simply need to evaluate their environments more often than consultants can reasonably be brought in.

Providing timely and accurate preparation guidance is also genuinely difficult. When new infrastructure, software, or services are introduced into an environment, evaluating their impact across the broader system landscape is rarely straightforward. It is not always as simple as gathering the IT, security, and executive teams together to workshop the risks. These changes often require investigation across multiple systems and dependencies, which is where agentic AI can assist. Agents can inventory the new system, trace trust relationships and dependencies, compare its configuration against known baselines, and surface likely logging gaps or risky access paths. That does not replace human risk judgment, but it gives the people making those decisions a much clearer view of what they are actually introducing into the environment.

The development of incident response playbooks and tabletop exercises that accurately reflect real organizational topology is also notoriously difficult. Most practitioners have encountered canned tabletop scenarios and thought, “This doesn’t quite match our environment, but we will make it work.” Agentic AI offers a compelling alternative. By incorporating real system discovery and environmental context, it becomes possible to generate tabletop scenarios tailored to the organization itself. CISA’s tabletop exercise materials are valuable precisely because they are customizable; the missed opportunity in many environments is not a lack of templates, but a lack of current environmental data to tailor those templates in a way that reflects the systems people actually have to defend.⁵

For example, a Windows Server 2008 host discovered during system inventory might become the simulated beachhead for an adversary. From there the scenario could evolve into an attempt to execute Get-VeeamCredentials against a Veeam instance in the environment. These contextualized scenarios can produce far more valuable exercises than generic tabletop templates.

Why this matters: Generic tabletop exercises test whether people can follow a script. Contextualized exercises test whether your environment is actually ready.

Identification Agent

Again, think of this as a team of agents rather than a neat little box. The identification agent can be fed with data from the preparation agent’s corpus, which we will discuss further when we examine Agentic Data Contracts. This creates a capability that security practitioners have been trying, and often failing, to implement well for years: real contextual awareness.

User and Entity Behavior Analytics (UEBA) has approximated this concept for some time, but agentic systems expand on it significantly. In this model, agents are capable of recognizing externally facing systems, their operational quirks, and sometimes even the historical decisions that placed them in less secure states. For example, an organization may be required to keep legacy NTLM authentication enabled in order to communicate with an ancient piece of medical equipment in the office next door. Alerts will rarely contain this level of operational context. Agentic systems, however, can reference discovery data and environmental documentation to understand why these configurations exist. That emphasis on contextualized alarms aligns closely with what SOC analysts themselves say they need from security tooling: not simply more alerts, but alerts that are explainable and grounded in legitimate organizational behavior.¹

From Context To Detection

The identification agent is capable of far more than contextual awareness. While humans remain well suited to investigating discrete events, identification agents can simultaneously identify alerting behavior, classify it, and codify those findings into shared knowledge bases that enable further investigation across the environment.

In practical terms, that means a strong identification layer should be able to do at least four things well:

recover the environmental context that an alert usually omits
classify and cluster related behaviors across the available telemetry
turn repeated patterns into candidate detections while the investigation is still active
push newly identified behaviors back into retrospective hunting across historical data

One way to think about this process is as rapid rule creation during investigation. Instead of manually writing detection logic after an investigation concludes, the system continuously identifies clusters of behaviors and correlations during analysis. Humans naturally perform this kind of pattern recognition during investigations. The difference is that detection rules are often cumbersome to author and deploy in real time. Allowing agentic systems to pursue and codify these patterns creates an environment that supports an immediate feedback loop between investigation and detection.

When properly tuned, alert triage becomes far more manageable. At the same time, retrospective threat hunting becomes significantly more powerful. Agents can take newly identified behaviors and search for them across historical telemetry or across additional hosts in the environment.

A major enabler of this capability is the use of Model Context Protocol services and command line tooling. Providing agents with tools such as EVTXECmd for Windows event log analysis or enrichment capabilities for external IP addresses allows them to perform meaningful investigation rather than simply speculate about possible causes. That distinction matters: recent guidance on evaluating LLMs for cybersecurity tasks argues that cyber assessments should emphasize realistic, tool-using tasks rather than trivia-style factual recall, because real defensive work depends on adapting to operational detail rather than merely reciting security knowledge.⁶

One challenge organizations encounter when adopting these systems is the potential for false positives. Agentic systems can generate convincing explanations for activity that ultimately proves benign. In practice, the most useful counterweights tend to be fairly simple:

limit the scope of individual agents so they operate within well defined analytical boundaries
require structured outputs instead of letting every finding remain a persuasive blob of freeform prose
introduce adversarial review agents whose job is to challenge conclusions and surface benign explanations

Those controls are not simply implementation preferences. They are core reliability mechanisms. Limiting scope reduces spurious reasoning, while structured outputs and adversarial review make it easier to compare competing interpretations instead of accepting the first plausible story the system tells.⁷

Devil’s Advocate Agent

The devil’s advocate agent is a critical control in the agentic AI process. AI systems tend to be overly optimistic. When they generate a finding, they will often commit to that conclusion even when plausible alternative explanations exist. This behavior stems from how these models operate. They are pattern matching systems that strongly associate certain behaviors with certain outcomes— stochastic parrots. When a pattern aligns with a known signal of malicious activity, the model may not initially consider contrarian indicators.⁸

As a result, asking an AI system to critique its own reasoning is often difficult. Once a conclusion has been reached within the same reasoning context, the model is predisposed to reinforce that conclusion rather than challenge it. That is one reason debate-style and reflection-style agent architectures are so interesting in practice: they create explicit mechanisms for critique, memory, and revision rather than assuming a single pass will get the answer right.⁷

This is where devil’s advocacy agents become valuable. By separating the reasoning process and removing the original investigative context, the system can approach the same problem from a different analytical perspective. In practice, I often want a devil’s advocate agent doing something as simple as:

Determine valid links between legitimate administrative activity and a given alert.

Simple prompts like this have repeatedly saved me from significant frustration while iterating on detection rules. By forcing the system to evaluate a finding without the original investigative bias, it becomes much more capable of exploring alternative explanations within the same dataset.

We use this pattern in practice when balancing seemingly suspicious activity against the broader detection ecosystem. A security investigator may surface behavior that strongly resembles attacker tradecraft, while the devil’s advocate agent is explicitly tasked with asking whether an existing security control, EDR workflow, or administrative process explains the same signal more convincingly.

In one redacted example, the investigator concluded that a remote management agent pattern had likely spread more broadly than the few confirmed service installs we could prove directly. The devil’s advocate review pushed back and identified a high-confidence benign explanation: the same Defender polling behavior was already being generated across the environment by a pre-existing security agent. That does not make the original signal useless, but it does change what we do with it. Instead of escalating a likely false compromise narrative into the detection pipeline, we can mark the candidate as benign, preserve the rationale, and focus engineering attention on signals that survive adversarial review.

### Signal: SIG-005 (Detection Gap)
**Title**: FindRemoteManagementAgentBroadReach — remote management agent Defender polling across 17 hosts implies wider compromise than 3 confirmed service installs
**Discovered**: 2026-03-10T00:00:00Z
**Status**: DA_REVIEWED
**Behavior**: The characteristic Defender reconnaissance pattern (`Get-MpPreference | ConvertTo-Json -Compress`, `Get-MpComputerStatus | ConvertTo-Json -Compress`) was observed on 17 distinct hosts beginning as early as 21:00:02 UTC — 10 minutes before the first confirmed service install event at 21:10:58. Only 3 of those hosts had a corresponding service creation record.
**Events**: EventId 400 (Windows PowerShell), 14 hosts showing the pattern with no corresponding 7045; earliest anchor host: `HOST-01` at `2026-03-09T21:00:02Z`.
**Payload fields**: See SIG-003 `HostApplication` values.
**Proposed logic**: This gap is best addressed by combining SIG-001 (novel 7045 for remote management agent) with SIG-003 (PowerShell 400 Defender polling). The combined signal would identify hosts where SIG-003 fires but SIG-001 never fired — implying a stealth install without 7045. This is a correlation gap, not a single-event gap.
**Severity**: 80 — confirmed attacker-pattern activity on domain controllers and production servers with no corresponding install record.
**Suppressors**: No known legitimate tool produces this exact pattern in this environment. Would need a per-environment remote management inventory baseline to suppress.

**DA Assessment**: CANDIDATES_FOUND
**DA Notes**: High-confidence benign candidate: an existing security agent. The "17-host pattern with no 7045" is explained entirely by pre-existing Defender polling from that agent. The pattern on the 14 non-7045 hosts predates the incident by 4–26 days per host (earliest: `HOST-02` on `2026-02-11`), which strongly weakens the stealth-install hypothesis.
**DA Findings**: `reports/redacted/da/SIG-005.md`

I generally avoid asking devil’s advocate agents for simple yes or no conclusions. Instead, the agent should examine the telemetry, alert, or hypothesis produced by another agent and introduce plausible explanations that may weaken or invalidate the original conclusion. The goal is to balance the system’s natural optimism by injecting skepticism into the analytic process.

In practice, I often place devil’s advocacy between security analysis agents and detection engineering agents. This helps prevent the introduction of rules that would trigger on common administrative behavior. These agents frequently identify activity generated by EDR platforms, legitimate administrator workflows, or routine background processes. By surfacing these explanations early, the detection pipeline avoids introducing fragile or noisy rules without requiring the analysis agents to manually comb through large volumes of event logs.

Context management is critical in these systems. By carefully controlling what each agent sees and how conclusions are challenged, agentic pipelines can remain both exploratory and disciplined in their reasoning.

Containment Agent

Deploying containment actions through agentic systems is a terrifying concept. I am not pretending otherwise. Do not do it blindly.

Practitioner note: If your organization cannot tolerate a bad automated containment decision, then your system should not be making one.

However, if the agent defers containment decisions to human approval gates, the human becomes an analyzer of the incident as a whole rather than an investigator of isolated pieces of telemetry. This creates a useful control point while still allowing organizations to benefit from the scalability of agentic systems. That bias toward human review is also consistent with current public work on LLMs in cybersecurity, which repeatedly emphasizes evaluation against realistic tasks, careful interpretation of outputs, and safeguards around high-impact use cases rather than blind operational trust.⁶

Some organizations may choose to allow automated actions such as account lockouts if they have strong enough safeguards in place. That decision ultimately depends on organizational tolerance for risk and the consequences of a malfunctioning system. In practice, I have found it far more valuable to ask LLMs to produce containment recommendations in the form of concise executive summaries. This allows me to rapidly evaluate whether an alert represents a true or false positive.

In my own agentic workflows these summaries are paired with auditable artifacts and prebuilt queries that allow quick validation of the system’s conclusions. The goal is not blind trust in the AI’s output, but rapid verification. Regardless of how advanced these systems become, availability remains a core pillar of cybersecurity. We should not become the architects of our own failure by delegating business-impacting containment decisions entirely to autonomous systems.

Instead, containment agents should augment skilled analysts by surfacing the evidence needed to make decisions quickly. In an active compromise, delaying containment may be unacceptable. At the same time, poorly targeted containment can cause significant operational disruption. Effective containment requires precision, and agentic systems can help analysts achieve that precision without removing the human decision point.

Eradication Agent

The eradication agent represents one of the more powerful components of an agentic incident response workflow. Once investigators accumulate a large set of indicators of compromise and related artifacts, distilling that information into an actionable remediation plan becomes its own problem.

Telling a CEO that the Kerberos TGT theft requires a double reset is rarely helpful, and assigning system administrators to manually search for residual malware artifacts across the environment is often an inefficient use of time immediately following an incident.

Agentic systems are well suited for this stage. Once equipped with validated indicators and investigative context, they can execute remediation playbooks at scale. Tasks that previously relied on large distributed PowerShell scripts or manual remediation efforts can instead be executed by agents that actively search for indicators, remove them where appropriate, and report back when human intervention is required. There is already early published work pointing in this direction: recent research on automated response playbook generation uses structured incident data to produce CACAO-aligned response procedures, suggesting a plausible path from investigative output to reproducible remediation artifacts.⁹

Rather than treating eradication as a purely manual cleanup process, the model shifts toward enabling agents to perform targeted threat hunting with full knowledge of the incident context. This frees system administrators and security practitioners to focus on higher impact remediation efforts and strategic recovery tasks that should not be delegated entirely to automated systems. For example, agents might automatically deploy memory collection tools across affected hosts, analyze the collected telemetry for signs of in-memory implants, and surface confirmed indicators that require deeper remediation.

Another way to phrase it: dedicate highly specialized human resources to critical tasks/systems, and let agentic AI handle tedium and implementation details.

Instead of collecting artifacts in bulk and warehousing them for later analysis, eradication becomes a live and interactive process. The environment can be searched dynamically, indicators can be validated in near real time, and remediation actions can be performed quickly while analysts maintain oversight.

Figure 2. Traditional eradication tends to batch analysis before action, while agentic eradication compresses the cycle into a tighter hunt and remediation loop.

Recovery Agent

Recovery represents the resumption of normal business operations without introducing additional risk. The latter requirement is often the more difficult one in practice.

Recovery agents become particularly valuable at this stage because of their ability to execute repetitive operational tasks at scale. An agent may take responsibility for rebuilding low value or low risk workstations from known good images, applying baseline configurations, and installing required software. It can also deploy remediation controls intended to reduce the risk profile of restored systems, or immediately audit those systems for lingering indicators of compromise.

The range of possible recovery actions is broad, but the level of oversight can be adjusted to match the criticality of the systems involved. In practice, this becomes a form of resource consolidation: agentic systems are well suited to routine deployment and validation tasks, while human responders stay focused on high value systems and specialized recovery work.

The end result is faster recovery, better allocation of human expertise, and a reduced likelihood that systems are returned to service with unresolved security concerns.

Lessons Learned Agent

The lessons learned phase is often the most difficult point in the timeline. It can be uncomfortable to discuss openly. Sometimes the conclusion is that a system was deployed without fully considering certain risks, leaving a gap in the organization’s security posture. But if the only lesson learned is that systems should have been deployed more securely, the conclusion is probably incomplete. Rarely is a security incident caused by one clean failure. More often, a handful of weaknesses line up at exactly the wrong time.

For example, how did the adversary obtain administrative access after the initial compromise? Was that escalation path something that could have been prevented earlier? Is it something that should now be incorporated into the organization’s defensive model? Agentic systems are particularly effective at identifying these patterns in retrospect and surfacing them in ways that are easier to understand.

They can also translate technical findings into summaries appropriate for different audiences. Using the earlier NTLM example, it can be difficult to explain to executives why continued reliance on legacy protocols introduces risk. Agentic systems can distill these findings into concise executive summaries that translate complex technical issues into actionable business intelligence after an incident.

Another area where these agents excel is identifying missing telemetry. Language models are strong pattern recognition systems. When reviewing an incident they can quickly identify where expected visibility was absent. If a particular logging source, monitoring tool, or defensive control would normally appear in similar investigations, the system can surface that gap and recommend ways to address it. In many cases this results in practical leads for system administrators and procurement teams to evaluate improvements to the environment.

In practical terms, a good lessons learned agent should be able to retain and resurface at least a few categories of knowledge:

investigative facts worth revisiting later, such as attack paths, exploited conditions, and key indicators
defensive gaps, such as missing telemetry, brittle detections, or places where human escalation lagged
operational follow-ups, such as candidate tabletop scenarios, remediation tasks, and regression checks
rejected hypotheses, so future agents do not keep rediscovering the same dead ends without new evidence

Lessons learned agents can also feed their findings back into a shared knowledge base. This context becomes increasingly valuable as the incident response cycle repeats. If a major vulnerability was recently remediated in a Microsoft SQL deployment, that condition should be monitored for regression in the future. It should also become a candidate for future tabletop exercises and defensive testing. Architectures that explicitly preserve prior failures and feedback in memory are still an active research area, but the direction is clear: systems improve when they can retain structured reflections rather than forcing every investigation to start from scratch.⁷

If you follow that line of thought far enough, you arrive at a mildly embarrassing truth: once institutional knowledge is expected to support future investigations rather than merely summarize past ones, it starts to look suspiciously like a database. That is especially true for agentic systems. The moment we expect agents to retain prior cases, compare current incidents to historical ones, and surface reusable remediation knowledge, we are no longer just writing notes for humans. We are building a memory layer that needs to be structured, queryable, and durable, which is exactly where many organizations discover that a real database serves them better than the loose pile of markdown documents they started with. The delightful “Dear Sir, You Have Written a Database” piece is hard not to think of here.

That evolution usually happens in stages. Teams often begin with human-readable notes, graduate into a more structured case warehouse, and eventually discover that supporting cross-case retrieval for both humans and agents pushes them toward something more queryable and durable:

Memory approach	Works well when	Starts to break when
ad hoc notes and markdown documents	the goal is lightweight human-oriented summaries and one-off retrospectives	querying becomes common, consistency matters, or agents need to reuse material across cases
case warehouse with structured artifacts	investigations need repeatable structure, preserved evidence, and cleaner case separation	cross-case comparison and historical retrieval become routine rather than occasional
queryable institutional memory layer	prior incidents, remediations, and hypotheses must be searched and surfaced reliably by both humans and agents	the organization is unwilling to take on the schema and operational discipline that this implies

These agents can also highlight attack paths that were available but not used. It is common to see an adversary enter through one access vector while other exposures remain present. An attacker might compromise an environment through an SSL VPN, even though RDP was publicly exposed the entire time. The fact that the adversary did not use a particular access path does not mean it was not viable. Lessons learned agents can surface these missed opportunities quickly so that they can be addressed before the next incident.

Finally, lessons learned agents can serve as interactive resources for stakeholders. Because they operate at the end of the incident response cycle, they often possess distilled context from every stage of the investigation. This allows them to answer questions about the incident holistically and provide decision makers with clearer insight into what occurred, why it happened, and what changes are necessary moving forward.

Another way to phrase it: the lessons learned phase is where the organization decides whether the incident was an expensive interruption or an expensive education.

Parallelizing Agentic AI

While PICERL is often presented as a largely synchronous process, reality tends to look very different. As the saying goes, everyone has a plan until they get hit in the face. Incident response behaves much the same way. The lifecycle is described as a structured sequence, but real incidents often involve identification, eradication, containment, and recovery happening at once in a feedback loop that looks more like controlled chaos than a clean flow diagram.

Fortunately, concurrency in agentic systems is relatively inexpensive when compared to human capital. Detection rules can be authored and deployed while agents simultaneously pursue additional threat hunting hypotheses. Containment actions can be prepared or presented for approval while other agents continue mapping the environment and gathering context. As agents expand their understanding of an incident, they can spread outward through the environment in parallel, increasing investigative coverage far faster than a single analyst or small team could reasonably achieve.

By giving agents the ability to create and delegate background tasks, individual agents can effectively act as miniature orchestrators for their own investigative workflows. A single analysis agent might simultaneously investigate multiple hosts, generate several variants of a detection rule, or launch targeted hunts across different telemetry sources.

Figure 3. A concurrency model for branch exploration, parallel hypothesis testing, and adversarial path shaping with human decision gates.

Common Integration Patterns

When people discuss “agent teams” or “parallel agents,” the implementation is usually much less mystical than the branding would suggest. Most real systems tend to converge on the same handful of patterns, just with slightly different names depending on which framework, vendor, or research paper is currently in vogue.

The most obvious pattern is simple subtask fan-out. One parent agent receives the initial context, carves the problem into smaller bounded questions, and dispatches those questions to concurrent workers. Perhaps one worker inspects a single host, another validates a persistence mechanism, and another checks whether a specific indicator appears anywhere else in the telemetry corpus. This is probably the most practical integration pattern because it allows the system to widen investigative coverage without forcing every agent to drag around the full context of the incident. Both the academic survey literature and public engineering writeups from deployed systems point to this same orchestrator-worker pattern as the practical default once work can be decomposed into parallel threads with clearly bounded outputs.²

Another common pattern is the specialist team. Instead of spawning several copies of the same worker, the orchestrator dispatches the work to agents with different operational scopes. One might focus on endpoint telemetry, another on identity, another on network telemetry, and another on detection engineering. This mirrors how human teams actually work. We do not generally ask one person to be the world’s foremost expert in Kusto, Windows internals, Azure sign-in logs, and remediation planning all at once, so there is little reason to force that expectation onto a model either.

There is also the reviewer pair or adversarial pair pattern, which I think is where things start to become genuinely useful. One agent proposes a conclusion, a remediation, or a detection rule. Another agent is then tasked with trying to tear that conclusion apart. Not politely, either. It should actively look for benign administrative explanations, contradictory telemetry, or practical reasons the recommendation would be noisy or harmful. This is still a concurrent pattern, but importantly it is not concurrency for the sake of raw speed. It is concurrency used to inject skepticism.

Finally, many systems end up with something that looks suspiciously like map-reduce, even if no one wants to call it that. Agents spread out across hosts, entities, or telemetry stores, gather partial findings, and then some coordinating process has to merge those findings back into a coherent incident narrative. This convergence step matters a lot. If you skip it, you have not built a parallel investigative system. You have simply built a much faster way to generate noise.

So while the terminology varies— background tasks, tool workers, agent swarms, subagents, whatever else we decide to call them this quarter— the operational model is usually the same. Constrain the scope, let agents do independent work concurrently, force them to return structured outputs, and then make them converge before anything consequential happens.

If you strip away the branding, most of these systems are just combinations of four reusable moves:

fan out bounded subtasks so no single agent carries the whole case
specialize agents by telemetry source, domain, or operational responsibility
assign at least one reviewer or skeptic to challenge the first conclusion
force convergence back into a single structured narrative before action is taken

The anti-pattern: spawning more agents than you have synthesis for. That does not create insight. It creates a distributed denial of service against your own attention span.

Branch Exploration

As humans, we are generally limited to exploring a single hypothesis at a time. Unless we are preemptively writing multiple detection rules or queries, we tend to follow one investigative thread until it either proves useful or leads to a dead end. A telemetry source may point us toward another artifact, which leads to another system, and so on until the path is exhausted.

Occasionally an investigation presents several plausible branches of inquiry at once. In these cases we must evaluate each hypothesis sequentially, deciding which path to explore first while other possibilities remain untested.

Agentic systems change this dynamic. Concurrent investigations allow multiple hypotheses to be explored in parallel, each with the same initial investigative context. What might normally require a full team of analysts can instead be handled by a set of specialized agents executing subtasks concurrently within the same investigation.

Each branch becomes an independent investigation, and sytems/agents converge only after those hypotheses have been evaluated. This is another notable point to introduce devil’s advocate agents and human decision gates.

Adversarial Path Shaping

Another interesting capability emerges when we consider adversarial path shaping in our concurrency model. A long standing goal in security operations has been the ability to influence how an attacker moves through an environment. If defenders can guide an adversary toward highly observable systems or restricted pathways, they gain valuable time and visibility.

For example, if agents observe an adversary relying heavily on SMB for lateral movement, temporary firewall rules could be deployed to restrict SMB traffic across segments of the network. While this may not immediately stop the intrusion, it can slow the attacker and force them into alternative behaviors that generate additional telemetry. Each delay gives defenders more time to investigate, contain, and understand the scope of the incident.

This approach turns the response model from reactive investigation into active influence over the adversary’s operating conditions. If it can be done safely, that is a meaningful defensive advantage.

Agentic AI in Practice

The following architecture reflects a real-world implementation rather than a theoretical model.

At Huntress, I have personally worked on building agentic AI capabilities for incident response. Our implementation differs in some ways from the conceptual model described earlier, but many of the core principles remain the same.

Orchestration of the overall AI session is handled by a primary session agent. AGENTS.md files define telemetry sources, investigation scenarios, and behavioral constraints for the various agents. Because our threat hunting platform is built on Kusto and Azure Data Explorer, the examples here naturally reflect that environment.

When an investigation begins, we provide as much context as possible to an orchestrator agent. The orchestrator does not perform deep analysis. Instead, it distills the incoming signal into structured metadata and maintains operational control of the multi-agent workflow. It is important to keep the orchestrator’s context intentionally minimal. Its role is to reduce an event to a set of objective facts and then supervise the agents responsible for deeper analysis.

This design lets the system recover from failures and creates natural feedback loops between agents. Agents can export artifacts that other agents evaluate, propose remediations around, or use to generate alternative hypotheses.

Figure 4. A practical multi-agent incident response pipeline showing how orchestration, investigation, detection engineering, and reporting interact in a production environment.

Data Warehouse Agent

Once the initial context has been established, a data warehouse agent is spawned. This agent manages the lifecycle of investigation artifacts, prevents cross contamination between cases, and ensures that each incident maintains a predictable structure. Importantly, other agents never receive the raw artifact data itself. They receive references such as filenames or structured pointers, which keeps their working context small while still allowing them to access what they need.

Once the warehouse agent has scaffolded the incident and created the initial artifact structure, the investigative phase begins.

Security Investigator Agent

The security investigator agent performs the initial investigative analysis. It is equipped with Model Context Protocol integrations and access to investigative tooling such as Kusto, Azure Data Explorer, ElasticSearch, Validin, Censys, and VirusTotal. Its scope is intentionally broad. Rather than aggressively filtering for false positives, we allow it to explore telemetry and surface hypotheses freely.

To manage the risk of context overload, the investigator agent can spawn background subtasks that execute independently. These subtasks perform targeted hunts or queries and return summarized results to the parent agent. This keeps the investigator focused on high value findings rather than raw telemetry exploration and makes the process remarkably fast. Investigating an environment with more than seventy hosts can often complete in under twenty minutes.

Institutional Knowledge Through KQL Functions

We assist the investigator agent with a library of helper tools implemented as KQL functions. These queries were originally authored by human analysts and represent years of practical investigative methodology. AI was later used to standardize their output formats.

By wrapping these techniques into predictable functions, we allow the agent to apply institutional knowledge without requiring it to invent new analytical logic from scratch. The result is more deterministic and reliable output during investigations.

The investigator agent can also reference past cases stored in the warehouse structure. We maintain a shared knowledge base containing indicators of compromise, adversary techniques, and previous investigative outcomes. This includes hypotheses that were rejected in earlier investigations so that agents avoid repeatedly exploring unproductive paths without new evidence. MITRE ATT&CK remains the most useful public backbone for this sort of organizational memory because it gives defenders a common vocabulary for tactics, techniques, detections, and coverage discussions across cases.¹⁰

Security Report and Detection Suggestions

The investigator agent produces two primary artifacts:

A security report describing what the system observed and how it arrived at those conclusions. This report is often imperfect, which is intentional. It reflects the system’s exploratory reasoning.
A structured list of detection suggestions. Each suggestion describes a behavioral pattern that the system believes may represent malicious activity and should potentially be monitored in the future.

These artifacts become the inputs to the next control stage.

Devil’s Advocate Agent

The investigator output is passed to a devil’s advocate agent. This agent has access to the same investigative tools as the investigator but receives none of its reasoning context.

Its purpose is to explore alternative explanations for the behaviors identified by the investigator. In practice, this agent frequently surfaces legitimate explanations such as remote management tools, endpoint detection activity, legitimate drivers, or common administrative workflows.

By assuming legitimacy rather than maliciousness, the devil’s advocate agent counterbalances the optimistic bias of the investigator. Because the investigator exports its findings using a structured data contract, multiple devil’s advocate agents can be spawned concurrently to evaluate different signals.

The output of this stage is another structured artifact describing the signal under review, any benign hypotheses discovered, and the evidence examined.

Detection Engineer Agent

The next agent in the pipeline is the detection engineer agent. Like the previous stage, detection engineers can be spawned concurrently for individual signals. The detection engineer receives only two inputs: the investigator’s detection suggestion and the devil’s advocate analysis. Without additional context from earlier stages, the agent focuses purely on evaluating whether a reliable detection can be built.

The agent explores telemetry across a representative dataset to evaluate potential false positives. Instead of focusing on the specific example that triggered the detection suggestion, it attempts to infer the underlying behavioral pattern.

For example, the investigator might suggest that rundll32.exe spawning a particular executable is suspicious. The detection engineer may instead recognize that the real pattern is that rundll32.exe typically spawns a predictable set of child processes. The detection rule can then be generalized to identify deviations from that pattern.

The detection engineer tests candidate rules across historical data to evaluate both true positives and false positives. If a rule cannot be tuned to an acceptable false positive rate, the agent is fully empowered to reject it.

Approved detections are exported as a structured contract describing the detection logic, the reasoning behind it, and the telemetry required.

KQL Contract Agent

The detection contract is passed to the KQL contract agent, which converts the detection description into production ready KQL functions. Instead of requiring the detection engineer to manage function interfaces or output schemas, the KQL contract agent handles this translation from example telemetry and optional negative test cases. The result is a .kql file containing a complete detection query.

Production Engineer Agent

The generated KQL file is passed to the production engineer agent. This is the only agent authorized to perform write operations against the KQL cluster.

All rule deployments are gated by human approval. The system presents the artifacts associated with the detection, including the investigative context, devil’s advocate analysis, and detection logic.

This gate is non-negotiable: the fastest way to lose trust in an agentic detection system is to let it silently ship bad rules into production.

If the rule is approved, the production engineer deploys it to the cluster. If it is rejected, the orchestrator routes the feedback back to the detection engineer for refinement or discards the rule entirely.

This stage is where most human interaction occurs, particularly when refining detection logic.

Threat Intelligence and Adversary Analysis

Once a rule is deployed, several concurrent processes begin. The production engineer validates the rule against live telemetry, while an adversary tactics and threat intelligence agent analyzes the full case bundle. This enrichment process identifies potential adversary techniques and clusters similar behaviors across investigations for future reuse.

Reporting Agent

The final stage in the workflow is the reporting agent. This agent compiles the artifacts generated during the investigation and produces the final incident summary, including the incident narrative, detections introduced, artifacts collected, and actions taken. This becomes the primary organizational record of the investigation.

Conclusion

Agentic AI is not magic, and it is not a replacement for sound security engineering, disciplined investigation, or experienced responders. What it does offer is something far more practical: a way to apply security methodology concurrently, consistently, and at a scale that human teams alone often cannot sustain.

That is the real promise here. Not autonomous systems making reckless decisions in production, but structured teams of agents that can investigate, challenge one another, distill findings, and surface actionable conclusions for humans to validate. When designed carefully, these systems do not reduce the role of the practitioner. They amplify it. They free analysts from repetitive investigative drag, accelerate feedback loops between hunting and detection, and allow organizations to respond to incidents with a level of parallelism that better matches the scale of modern telemetry.

Security incidents are rarely linear, tidy, or kind enough to respect the boundaries of a lifecycle model. They sprawl, branch, and evolve while you, the security practicioner, are still trying to understand them. If agentic AI has a legitimate place in incident response, it is here: helping defenders impose structure on chaos without losing the nuance and skepticism that good security work demands.

The goal is not to build a machine that replaces the investigator. The goal is to build systems that let investigators move faster, reason more broadly, and defend environments with a depth that would otherwise be operationally out of reach.

Bushra A. Alahmadi, Louise Axon, and Ivan Martinovic, “99% False Positives: A Qualitative Study of SOC Analysts’ Perspectives on Security Alarms,” 31st USENIX Security Symposium (USENIX Security ‘22), 2022. This is one of the strongest public studies on alert fatigue because it shows that many painful “false positives” are actually legitimate behaviors explained by missing context, and it argues that useful alarms should be reliable, explainable, analytical, contextual, and transferable. ↩ ↩² ↩³
Taicheng Guo et al., “Large Language Model Based Multi-agents: A Survey of Progress and Challenges,” Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024; Anthropic, “How we built our multi-agent research system,” June 13, 2025. The survey is the scholarly anchor; Anthropic’s engineering writeup is a useful implementation companion because it describes the same orchestrator-worker and context-management tradeoffs in production. ↩ ↩² ↩³
Paul Cichonski et al., “Computer Security Incident Handling Guide,” NIST SP 800-61 Rev. 2, 2012; NIST, “Incident Response Recommendations and Considerations for Cybersecurity Risk Management,” SP 800-61 Rev. 3 announcement, April 3, 2025. I am using Rev. 2 for the familiar lifecycle framing and Rev. 3 to acknowledge the current NIST posture as of April 3, 2025. ↩
MITRE, “CALDERA.” CALDERA is a strong public reference point for the preparation phase because it shows how automated adversary emulation can support security posture validation and detection improvement without pretending to replace full red teaming. ↩
CISA, “CISA Tabletop Exercise Packages.” These materials matter here because they are explicitly designed to be customized; agentic systems become most interesting when they can tailor scenarios to the organization’s real topology instead of reusing generic exercises. ↩
Jeff Gennari, Shing-hon Lau, Samuel Perl, Joel Parish, and Girish Sastry, “Considerations for Evaluating Large Language Models for Cybersecurity Tasks,” Software Engineering Institute, 2024. This is a good grounding reference for cyber-specific LLM evaluation because it argues for realistic task design, careful scoping, and minimizing misleading results in high-stakes workflows. ↩ ↩²
Yilun Du et al., “Improving Factuality and Reasoning in Language Models through Multiagent Debate,” Proceedings of Machine Learning Research 235, 2024; Noah Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning,” 2023. Debate is the best public analogue I know for the “devil’s advocate” pattern, while Reflexion is useful for the idea that systems can preserve structured feedback and learn from prior failures without retraining the model itself. ↩ ↩² ↩³
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?,” FAccT ‘21, 2021. I am using this citation narrowly for the caution that coherent language should not be mistaken for grounded reasoning or accountability. ↩
Alina-Elena Oprea et al., “Automated Generation of Cybersecurity Response Playbooks via Large Language Models,” Procedia Computer Science 270, 2025. This is early work, but it is directly relevant because it connects structured incident data to CACAO-aligned response playbooks rather than treating remediation as free-form text generation. ↩
MITRE, “MITRE ATT&CK.” ATT&CK is the shared public knowledge base I would reach for first when building organizational memory around adversary techniques, hypotheses, and coverage gaps across investigations. ↩

Introduction

Key Takeaways

The Foundational Concepts

Preparation Agent

Identification Agent

From Context To Detection

Devil’s Advocate Agent

Containment Agent

Eradication Agent

Recovery Agent

Lessons Learned Agent

Parallelizing Agentic AI

Common Integration Patterns

Branch Exploration

Adversarial Path Shaping

Agentic AI in Practice

Data Warehouse Agent

Security Investigator Agent

Institutional Knowledge Through KQL Functions

Security Report and Detection Suggestions

Devil’s Advocate Agent

Detection Engineer Agent

KQL Contract Agent

Production Engineer Agent

Threat Intelligence and Adversary Analysis

Reporting Agent

Conclusion

Footnotes