AI Agent Observability: Seeing What Your AI Is Actually Doing
Introduction
The scariest thing about autonomous AI agents isn’t that they fail — it’s that you don’t know they’re failing until it’s too late.
An AI agent is a system where an LLM makes decisions, calls tools, browses the web, reads files, and executes actions — all with varying degrees of autonomy. When everything works, it’s magic. But when something goes wrong — an agent exfiltrates data, deletes a database, hallucinates a transaction, or follows an injection payload — how do you even know?
You need observability: the ability to trace every decision, measure every output, and detect anomalies before they become incidents.
In our previous posts, we covered the attack side — Insecure Agent Design (the problem of excessive agency), Prompt Injection (the root cause of most exploits), and MLSecOps Pipeline Security (securing the infra). This post is the defense: observability for AI agents.
Why observability matters more for AI agents than traditional software
Traditional software is deterministic — given the same input, it produces the same output. An LLM agent is stochastic, tool-calling, and stateful. A production bug in a web server returns a 500 error. A production bug in an AI agent deletes your database and credits the attacker’s crypto wallet.
The Observability Stack for AI Agents
Traditional observability (logs, metrics, traces) catches infrastructure failures. AI agent observability must catch behavioral failures — the agent doing the wrong thing even though every individual step succeeded.
Here’s the layered stack you need:
| Layer | What It Tracks | Tooling |
|---|---|---|
| Tracing | Every LLM call, tool invocation, and decision step | LangFuse, LangSmith, Arize Phoenix |
| Monitoring | Latency, token usage, error rates, refusal rates | Datadog, Grafana, Prometheus |
| Evaluation | Output quality, safety, alignment | DeepEval, RAGAS, custom evals |
| Auditing | Full history of agent actions for compliance | Custom logging, vector DB, S3/Blob |
| Alerting | Anomalous behavior detection | Rule-based + ML anomaly detection |
Not “if” but “when”
You will deploy an agent that does something unexpected. The question is whether you’ll find out from your observability dashboard — or from a user on Reddit.
Architecture Overview
The following diagram shows how observability fits into a production agent system:
flowchart TB
subgraph "Agent System"
A[User Input] --> B[Agent Orchestrator]
B --> C[LLM Call]
C --> D[Tool Selection]
D --> E[Tool Execution]
E --> F[Response Generation]
F --> G[User Output]
end
subgraph "Observability Layer"
H[Tracer] -->|Spans| I[(Trace Store)]
J[Metrics Collector] --> K[(Prometheus)]
L[Eval Pipeline] --> M[(Eval Results)]
N[Audit Logger] --> O[(Audit Archive)]
end
subgraph "Alerting & Dashboards"
P[Anomaly Detector] --> Q[Alert Manager]
Q --> R[PagerDuty / Slack]
S[Grafana Dashboard] --> T[Human Operator]
end
C -.->|trace| H
D -.->|trace| H
E -.->|trace| H
E -.->|latency, tokens| J
F -.->|eval| L
B -.->|every action| N
K -.->|query| S
M -.->|score| S
I -.->|query| T
O -.->|compliance audit| T
M -.->|threshold breach| P
K -.->|metric spike| P
Every LLM call, every tool execution, every decision rationale is captured. Metrics flow to Prometheus for real-time dashboards. Eval pipelines score outputs offline. Anomaly detection triggers alerts. Everything also goes to an immutable audit log for compliance.
Implementation: Tracing Agent Decisions
A Custom Tracer Decorator
The foundation of agent observability is tracing — capturing every step as a structured span with inputs, outputs, timestamps, and rationale. Here’s a Python tracer you can wrap around any agent function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import json
import time
import uuid
from datetime import datetime, timezone
from functools import wraps
from typing import Any, Callable, Optional
class TraceSpan:
"""A single span in a trace — represents one step of agent execution."""
def __init__(self, name: str, trace_id: str, parent_id: Optional[str] = None):
self.span_id = uuid.uuid4().hex[:16]
self.trace_id = trace_id
self.parent_id = parent_id
self.name = name
self.start_time = datetime.now(timezone.utc)
self.end_time: Optional[datetime] = None
self.input: Any = None
self.output: Any = None
self.metadata: dict = {}
self.error: Optional[str] = None
def close(self):
self.end_time = datetime.now(timezone.utc)
@property
def duration_ms(self) -> float:
if self.end_time is None:
return 0.0
return (self.end_time - self.start_time).total_seconds() * 1000
def to_dict(self) -> dict:
return {
"span_id": self.span_id,
"trace_id": self.trace_id,
"parent_id": self.parent_id,
"name": self.name,
"start_time": self.start_time.isoformat(),
"end_time": self.end_time.isoformat() if self.end_time else None,
"duration_ms": self.duration_ms,
"input": self.input,
"output": self.output,
"metadata": self.metadata,
"error": self.error,
}
class AgentTracer:
"""Collects and exports traces for an agent session."""
def __init__(self):
self.spans: list[TraceSpan] = []
self._current_span: Optional[TraceSpan] = None
def trace(self, name: str):
"""Decorator: wrap any agent step (LLM call, tool exec, decision)."""
def decorator(func: Callable):
@wraps(func)
def wrapper(*args, **kwargs):
parent_id = self._current_span.span_id if self._current_span else None
span = TraceSpan(name, trace_id=uuid.uuid4().hex[:16], parent_id=parent_id)
span.input = {"args": repr(args), "kwargs": {k: repr(v) for k, v in kwargs.items()}}
self.spans.append(span)
previous = self._current_span
self._current_span = span
try:
result = func(*args, **kwargs)
span.output = result
return result
except Exception as e:
span.error = str(e)
raise
finally:
span.close()
self._current_span = previous
return wrapper
return decorator
def export(self) -> list[dict]:
"""Export all spans for ingestion (LangFuse, S3, etc.)."""
return [s.to_dict() for s in self.spans]
# ── Usage example ──────────────────────────────────────────────────
tracer = AgentTracer()
class MyAgent:
@tracer.trace("llm_call")
def query_llm(self, prompt: str) -> str:
# Simulated LLM call
time.sleep(0.1)
return f"Response to: {prompt[:50]}..."
@tracer.trace("tool_exec")
def read_file(self, path: str) -> str:
with open(path, "r") as f:
return f.read()
@tracer.trace("decision")
def choose_tool(self, task: str, tools: list[str]) -> str:
# Simulated tool selection
return tools[0] if tools else "none"
# Run the agent
agent = MyAgent()
response = agent.query_llm("What files should I read?")
tool = agent.choose_tool("read logs", ["read_file", "search_web"])
content = agent.read_file("/tmp/app.log")
# Export traces
for span in tracer.export():
print(json.dumps(span, indent=2, default=str))
Anomalous Tool Call Sequence Detection
One of the most powerful observability patterns is detecting suspicious tool call chains — sequences that wouldn’t make sense for the agent’s legitimate task:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from collections import defaultdict
from typing import list
SUSPICIOUS_CHAINS = [
# Data exfiltration patterns
["read_file", "compose_email", "send_email"],
["list_directory", "read_file", "http_request"],
["read_database", "write_file", "upload_file"],
# Privilege escalation patterns
["read_config", "execute_shell", "create_user"],
["list_instances", "create_instance", "modify_firewall"],
# Credential harvesting
["read_env", "http_request"],
["read_file", "base64_encode", "http_post"],
]
class ChainDetector:
def __init__(self, window: int = 5):
self.window = window
self.history: list[str] = []
def observe(self, tool_name: str) -> Optional[str]:
"""Record a tool call and check for suspicious chains."""
self.history.append(tool_name)
if len(self.history) > self.window:
self.history.pop(0)
for chain in SUSPICIOUS_CHAINS:
if len(chain) > len(self.history):
continue
# Check if the tail of history matches this chain
tail = self.history[-len(chain):]
if tail == chain:
return f"SUSPICIOUS CHAIN DETECTED: {' → '.join(chain)}"
return None
# ── Example ────────────────────────────────────────────────────────
detector = ChainDetector()
# Normal agent activity
for tool in ["read_file", "search_web", "read_file", "query_llm"]:
alert = detector.observe(tool)
if alert:
print(f"⚠️ {alert}")
# Anomalous activity
for tool in ["list_directory", "read_file", "http_request"]:
alert = detector.observe(tool)
if alert:
print(f"🚨 {alert}")
Detection is table stakes
Every agent deployment should have chain detection. The difference between a normal “helpful agent” trace and a data exfiltration trace is often just the order of 3-4 tool calls.
What to Monitor
Beyond traces, you need real-time metrics. Here are the seven signals that matter most for AI agent security:
1. Refusal Rate
The rate at which your agent refuses to execute a user request. A sudden drop means alignment is being bypassed.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Prometheus-style metric pseudocode
REFUSAL_RATE = gauge(
name="agent_refusal_rate",
help="Percentage of requests the agent refused",
)
# Alert rule: refusal rate drops below 1% (baseline is ~5%)
ALERT_RULE = """
- alert: RefusalRateDrop
expr: agent_refusal_rate < 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Agent refusal rate dropped below 1% — possible jailbreak"
"""
2. Tool Call Frequency
Spikes in dangerous tools are red flags:
| Metric | Normal | Suspicious |
|---|---|---|
read_file calls/min | 2-10 | 50+ |
http_request calls/min | 1-5 | 30+ |
execute_shell calls/min | 0-1 | 10+ |
delete_* calls | 0 | Any |
3. Context Window Usage
Prompt injection attacks often stuff huge payloads into the context window. Monitor both token count and compression ratio:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
CONTEXT_USAGE = gauge(
name="agent_context_tokens",
help="Tokens consumed per LLM call",
labels=["agent_id", "model"],
)
# Alert: context usage > 80% of model limit
ALERT_RULE = """
- alert: ContextWindowSpike
expr: agent_context_tokens / agent_context_limit > 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "Context window above 80% — possible injection payload"
"""
4. Response Entropy
Jailbroken models produce more variable, less predictable responses. Statistical entropy of token probabilities is a strong signal:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import math
from collections import Counter
def response_entropy(tokens: list[str]) -> float:
"""Compute Shannon entropy of token distribution."""
total = len(tokens)
freq = Counter(tokens)
entropy = -sum(
(count / total) * math.log2(count / total)
for count in freq.values()
)
return entropy / math.log2(total) if total > 1 else 0.0 # Normalized
# Baseline runs (benign queries): entropy ~0.3-0.5
# Jailbroken agent: entropy ~0.7-0.9
# Anomaly alert threshold: entropy > 0.65
5. Latency Distribution
Model extraction attacks have distinct latency signatures — attackers probe the model with structured inputs that produce unusually consistent response times:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
LATENCY_P95 = gauge(
name="agent_llm_latency_p95",
help="P95 latency of LLM calls in milliseconds",
)
# Alert: latency drops consistently below 200ms
# (indicates attacker is probing with cached/simple queries)
ALERT_RULE = """
- alert: LowLatencyProbe
expr: agent_llm_latency_p95 < 200
for: 10m
labels:
severity: warning
annotations:
summary: "Consistently low latency — possible extraction attack"
"""
6. Token Throughput
Data exfiltration shows up as massive token throughput spikes — the agent is reading and emitting far more data than its task requires:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Total tokens in + out per session
TOKEN_THROUGHPUT = counter(
name="agent_token_throughput",
help="Total tokens processed per session",
labels=["session_id"],
)
# Session-level threshold: 500K tokens in 1 minute
ALERT_RULE = """
- alert: TokenExfiltration
expr: rate(agent_token_throughput[1m]) > 500000
labels:
severity: critical
annotations:
summary: "Session processing >500K tokens/min — possible exfiltration"
"""
7. Tool Call Chains (Detailed)
We already covered chain detection above — but in production, you want this running as a streaming pipeline:
flowchart LR
A[Tool Call Event] --> B[Sliding Window Buffer]
B --> C[Chain Matcher]
C --> D{Match?}
D -->|Yes| E[🔴 Critical Alert]
D -->|No| F[✓ Normal Flow]
E --> G[Auto-Kill Agent Session]
E --> H[Freeze Tool Permissions]
E --> I[Notify Security Team]
Real Incidents Where Observability Would Have Helped
1. Auto-GPT Buying Domains (2023)
Auto-GPT autonomously purchased domains with real money because there was no budget guardrail and no tool call audit trail. With observability:
- A budget counter metric would have fired at the first dollar spent
- A tool call trace would have shown:
search_domain → add_to_cart → checkout - A chain detector would have flagged
add_to_cart → checkoutas suspicious without human confirmation
2. GitHub Issue Agent Cross-Repo Theft (2025)
An agent reading GitHub Issues was tricked into exfiltrating an entire private codebase. With observability:
- File access monitoring would have shown the agent reading 1000+ files in 30 seconds — far beyond any legitimate triage task
- Token throughput would have spiked as it encoded and exfiltrated those files
- Tool chain detection would have flagged
read_file → base64_encode → http_request
3. Email Assistant CVE-2024-5184
This vulnerability in an AI email assistant allowed prompt injection via email content to trigger unintended tool calls. With observability:
- Refusal rate would have dropped as the injected prompts overrode safety instructions
- Context window would have spiked with the injected payload
- Tool calls would have deviated from the agent’s normal pattern
Every incident in the AI security landscape shares a common thread: the absence of observability. Trace, monitor, evaluate, and alert — or accept that you’ll discover failures from your users, not your dashboards.
Integration with Existing Tools
You don’t need to build everything from scratch. The ecosystem has matured significantly:
W&B (Weights & Biases) for Experiment Tracking
W&B isn’t just for training runs — it’s excellent for tracking agent behavior across sessions:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import wandb
# Initialize a W&B run for your agent session
run = wandb.init(
project="ai-agent-observability",
config={
"agent_type": "code-assistant",
"llm_model": "gpt-4o",
"max_tool_calls": 100,
"guardrails": ["nemo", "custom-content-safety"],
},
)
# Log every agent step
for step in agent_steps:
wandb.log({
"step": step.number,
"tool_called": step.tool_name,
"latency_ms": step.latency_ms,
"tokens_in": step.tokens_in,
"tokens_out": step.tokens_out,
"refusal": int(step.was_refused),
"eval_safety_score": step.safety_score,
"eval_helpfulness": step.helpfulness_score,
})
run.finish()
LangFuse for LLM Tracing
LangFuse provides purpose-built tracing for LLM applications, including agent-specific features:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
langfuse = Langfuse(
secret_key="sk-lf-...",
public_key="pk-lf-...",
host="https://cloud.langfuse.com",
)
@observe()
def agent_step(task: str, tool: str, input_data: str) -> str:
"""Each agent step is automatically traced with LangFuse."""
# The @observe decorator captures timing, input, output
# Add custom metadata for tool-specific details
langfuse_context.update_current_observation(
input=input_data,
metadata={
"tool_name": tool,
"task_description": task,
"agent_version": "v2.1.0",
}
)
result = execute_tool(tool, input_data)
# Score the output automatically
langfuse.score(
name="safety",
value=compute_safety_score(result),
comment="Automated safety evaluation",
)
return result
Prometheus + Grafana for Metrics
Standardize on OpenMetrics format for all agent metrics:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Define agent metrics
AGENT_LLM_CALLS = Counter(
"agent_llm_calls_total",
"Total LLM calls",
["agent_id", "model", "status"],
)
AGENT_TOOL_CALLS = Counter(
"agent_tool_calls_total",
"Total tool calls",
["agent_id", "tool_name", "status"],
)
AGENT_LATENCY = Histogram(
"agent_step_duration_ms",
"Duration of agent steps",
["agent_id", "step_type"],
buckets=[50, 100, 200, 500, 1000, 2000, 5000],
)
AGENT_CONTEXT_TOKENS = Gauge(
"agent_context_tokens",
"Context tokens used per call",
["agent_id"],
)
# Start metrics server
start_http_server(8000)
# Use in your agent loop
def run_agent_step(agent_id: str, model: str, prompt: str):
AGENT_LLM_CALLS.labels(agent_id=agent_id, model=model, status="started").inc()
with AGENT_LATENCY.labels(agent_id=agent_id, step_type="llm").time():
response = llm_call(prompt)
AGENT_LLM_CALLS.labels(agent_id=agent_id, model=model, status="success").inc()
AGENT_CONTEXT_TOKENS.labels(agent_id=agent_id).set(count_tokens(prompt + response))
Custom Guardrails
Guardrails act as the last line of defense — they inspect agent outputs before execution:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
from typing import Any
class AgentGuardrail:
"""Check an agent action before it executes."""
def check_tool_call(
self,
tool_name: str,
tool_args: dict[str, Any],
context: dict[str, Any],
) -> tuple[bool, str]:
"""Return (allowed, reason)."""
raise NotImplementedError
def check_output(
self,
output: str,
context: dict[str, Any],
) -> tuple[bool, str]:
"""Return (passed, reason)."""
raise NotImplementedError
class BudgetGuardrail(AgentGuardrail):
"""Prevent agents from spending more than budget."""
def __init__(self, max_cost_per_session: float = 0.01):
self.max_cost = max_cost_per_session
self.total_spent = 0.0
def check_tool_call(
self,
tool_name: str,
tool_args: dict[str, Any],
context: dict[str, Any],
) -> tuple[bool, str]:
if tool_name == "checkout":
cost = tool_args.get("amount", 0)
if self.total_spent + cost > self.max_cost:
return False, f"Budget exceeded: ${self.total_spent + cost:.2f} > ${self.max_cost:.2f}"
return True, ""
class DataExfilGuardrail(AgentGuardrail):
"""Block large-scale data movement."""
def __init__(self, max_bytes: int = 1_000_000):
self.max_bytes = max_bytes
def check_tool_call(
self,
tool_name: str,
tool_args: dict[str, Any],
context: dict[str, Any],
) -> tuple[bool, str]:
if tool_name == "http_request":
data_size = len(str(tool_args.get("data", "")))
if data_size > self.max_bytes:
return False, f"Data size {data_size} exceeds limit {self.max_bytes}"
return True, ""
Building Your Observability Maturity Model
Organizations don’t go from zero to fully observable overnight. Here’s a maturity model to guide your journey:
| Level | Name | What You Have | What You’re Missing |
|---|---|---|---|
| 1 | Blind | Basic request logging | Traces, evals, real-time alerts |
| 2 | Aware | LLM call tracing (LangFuse) | Tool call monitoring, anomaly detection |
| 3 | Monitored | Full tracing + metrics (Prometheus) | Evals, guardrails, automated alerting |
| 4 | Autonomous | All layers + auto-remediation | Nothing — you’ve arrived |
Start at Level 1, target Level 3
Level 1 is dangerous but fast. Level 3 is safe but takes investment. Skip Level 4 until your organization has mature MLOps and security operations — autonomous remediation of agent failures is an advanced capability.
References
- LangFuse Documentation — Open-source LLM observability and tracing. https://langfuse.com/docs
- Arize AI — LLM Observability — Tracing, evaluation, and monitoring for LLM applications. https://arize.com/llm-observability/
- Guardrails AI — Guardrails for LLM outputs. https://www.guardrailsai.com/
- NVIDIA NeMo Guardrails — Open-source toolkit for adding guardrails to LLM systems. https://github.com/NVIDIA/NeMo-Guardrails
- OWASP LLM06 — Excessive Agency — The OWASP Top 10 for LLM Applications vulnerability classification. https://genai.owasp.org/llm-top-10/
- DeepEval — LLM evaluation framework. https://github.com/confident-ai/deepeval
- Prometheus & Grafana — Metrics and dashboarding. https://prometheus.io / https://grafana.com
- Weights & Biases — Experiment tracking for ML and LLM agents. https://wandb.ai
- The Auto-GPT Incident (2023) — First major autonomous agent failure in the wild. https://news.ycombinator.com/item?id=35220568
- CVE-2024-5184 — Email assistant prompt injection vulnerability. https://nvd.nist.gov/vuln/detail/CVE-2024-5184
This is Blog Post 5 of the AI Hacking & Defense Series. Read the companion posts: Insecure Agent Design (the vulnerability that observability mitigates), Prompt Injection (the attack vector), and MLSecOps Pipeline Security (securing the infrastructure).
