Agent Monitoring
Real-time visibility into agent execution with comprehensive session tracking, performance analytics, cost analysis, and resource optimization insights - all without a single line of instrumentation code.
Overview
Agent Inspector provides zero-instrumentation monitoring for AI agents during development and testing. Simply route your LLM calls through the proxy, and gain instant visibility into every session, tool call, token consumed, and dollar spent. Unlike production observability platforms, Agent Inspector is purpose-built for pre-production environments where you need deep, granular insights to validate agent behavior before deployment.
The monitoring dashboard transforms raw execution data into actionable insights: identifying performance bottlenecks, detecting cost inefficiencies, tracking model usage patterns, and highlighting optimization opportunities - all in real-time as your agents run.
Session Timeline
The Sessions view provides a chronological record of every agent execution captured by Agent Inspector. Each session represents a complete agent interaction, from initial prompt to final response, with full traceability of all LLM calls, tool invocations, and state transitions.
Each session entry displays critical execution metadata:
- Session ID: Unique identifier for tracing and debugging
- Timestamp: When the session was executed
- Status: Completed successfully, failed, or timed out
- Duration: Total execution time from start to finish
- Messages: Number of LLM interactions in this session
- Tokens: Total tokens consumed (input + output)
- Tools Used: Which tools were invoked and how many times
- Model: LLM model and version used
Detailed Session Inspection
Click into any session for complete execution transparency. The session detail view reconstructs the entire conversation flow, showing the exact sequence of events, tool calls with their parameters and results, token consumption per step, and timing breakdown.
This level of visibility is critical for:
- Debugging unexpected agent behaviors
- Understanding why certain decisions were made
- Identifying inefficient tool usage patterns
- Tracking down performance bottlenecks
- Validating tool call parameters and responses
Token Usage Analytics
Token consumption directly impacts both cost and performance. Agent Inspector provides comprehensive token tracking to help you understand spending patterns, optimize prompts, and make informed decisions about model selection.
Key Metrics
The token usage dashboard shows:
- Total Tokens: Cumulative consumption across all sessions - your overall agent activity volume
- Estimated Cost: Total spending based on current API pricing
- Models Used: Number of unique models, indicating setup complexity
- Input vs. Output Distribution: Visual breakdown showing where your tokens (and money) go
- Token Usage by Model: Top 5 models ranked by consumption with input/output split
Input vs. Output Tokens
This chart shows how your tokens are split between input (sent to the model) and output (generated by the model). Input tokens are your prompts, context, and tool results. Output tokens are the model's responses.
For most modern LLMs, output tokens are often priced higher per token than input, so this ratio has a direct impact on your total spend (exact prices depend on the specific model you're using).
What "Healthy" Ratios Look Like
These are typical patterns, not strict rules:
- 40-60% input / 40-60% output
Balanced usage. Common for many general-purpose agents. - High output (70%+ of total tokens)
Response-heavy agents: chat assistants, content generators, report writers.
โ You're paying mostly for generation. - High input (70%+ of total tokens)
Context-heavy agents: analyzers, RAG/search pipelines, classifiers, auditors.
โ You're paying mostly to feed the model large contexts.
Cost Considerations
If you see very high output share on expensive models, it likely means:
- Long, verbose responses are driving most of your cost.
- You can often optimize by:
- Tightening response instructions (shorter or more structured answers).
- Using cheaper models for high-volume, verbose tasks.
- Reserving top-tier models for the steps where quality really matters.
High output isn't "bad" by itself. What matters is whether the pattern matches the job of the agent and your budget.
Optimization Opportunities
If input tokens are very high:
- Reduce system prompt length - remove unnecessary instructions
- Summarize conversation history instead of sending full context
- Remove redundant context or examples
- Use more concise tool descriptions
If output tokens are very high:
- Set stricter
max_tokenslimits - Instruct the agent to be more concise in the system prompt
- Consider if the agent is generating unnecessary explanations
- Use cheaper models for verbose but simple responses
Red Flags to Watch For
- ๐จ Sudden cost spikes - Investigate what changed in your agent or prompts
- ๐จ Very high output ratio with expensive models - Paying maximum rates unnecessarily
- ๐จ Single model dominating when cheaper alternatives exist for some tasks
- ๐จ Tokens per session increasing over time - Context or prompt bloat
Model Usage & Performance
Multi-model agents are increasingly common, but understanding which models are being used, why, and how they perform is complex. Agent Inspector tracks model selection patterns and performance characteristics across all sessions.
Model Distribution
See which models your agent uses and how frequently:
Model usage tracking reveals:
- Model Adoption: Which models are being used across sessions
- Selection Frequency: How often each model is chosen
- Model Versions: Track version consistency and detect unintended upgrades
- Multi-Model Patterns: Understand when and why model switching occurs
Model Performance
Compare response times across models to spot performance bottlenecks and keep a consistent user experience. This view shows both Average and P95 (95th percentile) response times.
How to Read This Chart
Each model displays two bars:
- Average Response Time - Typical response duration across all calls.
- P95 Response Time - The time under which 95% of requests complete (i.e., the slowest 5% of responses your users see).
Why P95 Matters
Averages can hide serious issues. A model with a 1.5s average might still have a P95 of 8s, meaning 1 in 20 requests is painfully slow.
P95 surfaces inconsistent performance that the average masks, and is critical for:
- SLA design
- User experience guarantees
- Alerting and capacity planning
What to Look For
Use these as rules of thumb (not hard rules):
- P95 < 2ร Average โ โ Consistent, predictable performance
- P95 < 3 seconds โ โ Good UX for most interactive use cases
- Large gap between Avg and P95 โ โ ๏ธ High variance, investigate causes
- P95 > 10 seconds โ ๐จ Users are likely experiencing frustrating delays
Example:
- Model A: Avg 1.2s, P95 2.5s โ Consistent, reliable for production
- Model B: Avg 1.5s, P95 8.2s โ Unreliable, high variance, needs investigation
Model A is preferable for production not just because it's slightly faster on average, but because its performance is predictable.
Performance Optimization
If all models are slow:
- Reduce unnecessary context (shorten conversation history / retrieved docs).
- Trim long system prompts and templates.
- Use streaming responses to improve perceived latency.
If one model is consistently slower:
- Check that it's used only where its strengths are needed (e.g., complex reasoning, not trivial tasks).
- Consider switching to a faster/cheaper model for time-sensitive paths.
- Investigate whether specific input patterns (very long prompts, large tool outputs, certain tasks) correlate with slow requests.
Cost Analysis
The cost dashboard helps you understand where your money goes across models and workloads:
- Total Cost - Cumulative spend across all sessions.
- Cost per 1K Tokens - Effective cost normalized per 1,000 tokens, so you can compare efficiency across models.
- Cost by Model - Per-model breakdown showing total spend and relative efficiency.
- Pricing Updated Date - When pricing data was last refreshed (provider prices can change frequently).
๐ก Cost per 1K Tokens
Cost per 1K tokens normalizes spending independent of volume, making models easy to compare side by side.
It answers questions like: "Am I using a premium model where a cheaper one would be good enough?"
This metric is usually calculated as:
(Total cost for the model รท Total tokens for that model) ร 1,000
It reflects your effective price, including your mix of input/output tokens and any discounts, not just the provider's list price.
Example Analysis
Suppose your dashboard shows:
- GPT-4:
- Total Cost: $35.00
- Tokens: 500K
- Cost per 1K: $0.070
- Claude 3.5 Sonnet:
- Total Cost: $10.67
- Tokens: 400K
- Cost per 1K: $0.027
Observation: GPT-4 is roughly 2.6ร more expensive per token.
Action: Test whether Claude 3.5 Sonnet can handle some tasks currently handled by GPT-4 without hurting quality.
Potential Savings: If you move equivalent workloads from GPT-4 to Claude, you can save around 60% on those tokens (e.g., turning a $35 workload into roughly ~$13-14).
Cost Optimization Strategies
Quick Wins
- Use tiered models
Fast/cheap models for simple or high-volume tasks, premium models for complex reasoning and critical paths. - Cache repeated requests
Avoid recomputing responses for identical or highly similar inputs. - Optimize prompts
Shorter, more focused prompts reduce input token costs and can simplify outputs. - Test cheaper alternatives
Many tasks (classification, extraction, basic Q&A) don't need the most expensive model.
Advanced Strategies
- A/B test models
Compare quality vs. cost using real traffic and success metrics. - Implement routing logic
Automatically route requests based on task type, complexity, or user tier (free vs. paid). - Set budget alerts
Monitor spend per project/session and flag sudden spikes or anomalies. - Negotiate volume pricing
If your usage is high, talk to providers about discounts or committed-use pricing.
Model Trends Over Time
Track how model usage and token consumption evolve over time to understand growth patterns, validate optimizations, and catch regressions early.
The timeline chart shows two lines:
- Requests (cyan): How many calls per day
- Tokens (purple): Token consumption per day
Understanding the Trends
When lines are parallel: Consistent token usage per request (healthy, predictable)
When tokens increase faster than requests: Growing prompt sizes or output length - investigate why
When requests spike: Usage growth or event-driven activity - is it expected?
Example Analysis
Day 1: 100 requests, 50K tokens = 500 tokens/request Day 7: 150 requests, 112K tokens = 747 tokens/request Tokens per request increased 49% โ investigate prompt changes
Using Trends for Planning
- Capacity Planning: Extrapolate growth rate for budgeting future costs
- Performance Monitoring: Spot degradation before it becomes critical
- Validate Optimizations: Confirm that changes actually reduced costs
- Correlate with Deployments: See how code changes affect behavior
Example Optimization Validation:
Before: 200K tokens/day @ $5.60/day After: 150K tokens/day @ $3.20/day Result: 43% cost reduction โ
Tool Usage & Performance
Tools are how agents interact with the world: file systems, APIs, databases, and external services. Understanding which tools are used, how often, and how well they perform is critical for optimizing agent behavior, controlling costs, and avoiding performance bottlenecks.
Tool Adoption & Frequency
A ranked list of all tools showing execution frequency, revealing your agent's "core functions" and identifying unused capabilities.
Each tool displays:
- Tool Name: Identifier for the function or API
- Visual Bar: Relative frequency (longest bar = most frequently used)
- Execution Count: Exact number of times this tool was called
What This Reveals
Identify unused tools:
get_weather: 450 calls get_news: 23 calls get_stocks: 0 calls โ Remove this - adds cognitive load for no benefit
Spot inefficiency:
search_database: 890 calls cache_lookup: 12 calls โ Cache should be much higher! โ Agent isn't using cache effectively
Validate expected behavior:
For a customer support agent: lookup_ticket: 234 calls โ (expected high) create_ticket: 45 calls โ (expected lower) delete_ticket: 89 calls โ ๏ธ (investigate why so many deletions)
Every tool you define adds to the context sent with each LLM call. Unused tools waste tokens, increase costs, and add cognitive load to the model. Removing unused tools makes your agent faster, cheaper, and often more accurate.
Slow Tool Detection
Tools ranked by execution time to help you identify and fix performance bottlenecks. Slow tools directly impact user experience and can make your agent feel unresponsive.
Each tool shows:
- Average Duration: Typical execution time
- Max Duration: Worst-case performance (can indicate timeouts or stuck calls)
- Failure Rate: Percentage of calls that fail (if > 0%)
- Color Coding: Red (slowest) โ Yellow (moderate) โ Green (fastest)
Latency zones for interactive agents:
- ~0-100ms: Feels instant; below the threshold of conscious delay
- ~100-300ms: Still feels very fast / snappy for most interactions
- ~300-800ms: Noticeable, but usually acceptable if something visibly happens
- ~800ms-2s: Clearly "waiting", but tolerable for heavy operations
- >2s: Starts to feel slow; >5s is "something might be broken"
- Max in hundreds of seconds: Indicates timeouts or blocking operations
Understanding Failure Rates
Failure rates indicate tool reliability and are shown as a percentage on the bar:
- 0-5%: โ Acceptable - likely transient errors, monitor
- 5-10%: โ ๏ธ Warning - investigate patterns, add retry logic
- 10-20%: ๐จ Problem - serious reliability issue
- 20%+: ๐ด Critical - tool is broken, fix immediately or disable
- API rate limits being exceeded
- Network timeouts (tool takes too long)
- Invalid parameters from agent (malformed tool calls)
- External service downtime
- Authentication or permission issues
How to Fix Slow Tools
If a tool is slow (consistently >2 seconds):
- File & I/O tools: Avoid reading huge files synchronously, stream large results, add size limits
- External API calls: Cache results when possible, use async/parallel calls, set timeouts
- Database queries: Add indexes, optimize queries, use connection pooling
Example Optimization:
Before: get_user_profile - 3.2s avg (database query every call) After: get_user_profile - 0.1s avg (Redis cache, 5min TTL) Result: 97% faster โ
If failure rate is high:
- Check if tool timeout is too short (increase it)
- Add better error handling and retry logic
- Validate tool call parameters before execution
- Monitor external service status
- If >20% failure, consider disabling tool until fixed
Execution Trends
Visualize how tool usage evolves over time to detect patterns, validate optimizations, and understand how agent behavior changes with updates. Select up to 3 tools to compare simultaneously.
The multi-line chart shows:
- Colored lines: One per selected tool
- Data points: Actual daily execution counts
- Gradient fill: Emphasizes trend direction
What Trends Reveal
Usage Patterns:
Tool A (blue): Flat line โ Stable core functionality Tool B (purple): Growing โ New feature gaining adoption Tool C (green): Declining โ Being phased out or losing relevance
Correlation Between Tools:
search_database โ (increasing) cache_lookup โ (flat) โ Problem: Cache isn't scaling with searches, investigate why create_order โ (increasing) send_confirmation โ (increasing at same rate) โ Good: Tools working together as expected
Anomaly Detection:
Normal: ~50 calls/day Spike: 450 calls/day on Nov 15 โ Investigate: Bug? Feature launch? Attack attempt?
Using Trends for Decisions
Capacity Planning:
Tool A growing 20% week-over-week Current: 1,000 calls/day Projected (4 weeks): 2,074 calls/day โ Ensure tool can handle 3,000+/day
Feature Validation:
Deployed new search on Nov 10 search_advanced: 0 โ 150 calls/day within 5 days โ Feature successfully adopted โ
Cost Optimization:
expensive_api: 500 calls/day @ $0.10 = $50/day cheap_alternative: 50 calls/day @ $0.01 = $0.50/day After promoting alternative: expensive: 200 calls/day = $20/day cheap: 350 calls/day = $3.50/day โ Savings: $26.50/day (53%) โ
Using These Views for Decisions
- Refine tool sets: Remove unused tools and simplify the agent's toolbox
- Optimize performance: Focus on tools with high latency and high call volume
- Validate changes: After deployments, watch trends to confirm tools are used as expected
- Control costs & risk: High-volume, slow, or error-prone tools are prime candidates for caching or redesign