· Tim Quinteiro

How to monitor your MCP server in production

A practical guide to monitoring an MCP server in production: what to instrument, which metrics matter, and how to set up alerting for tool calls, sessions, and clients.

  • mcp
  • observability
  • guide

You shipped an MCP server. Real agents are calling it now. So the question stops being “does it work” and becomes “how do I know when it doesn’t, and how fast can I find out why.” This is a guide to monitoring an MCP server in production: what to instrument, which numbers actually tell you something, and how to wire up alerts that fire on the things you care about.

It is written for engineering teams who already run a general-purpose APM (Datadog, Sentry, New Relic) and have now added an MCP surface that the APM does not understand. If that is you, start here.

Why an MCP server needs its own monitoring

An MCP server is not a normal HTTP service. From the outside it looks like one request in, one response out. Inside, every exchange is a JSON-RPC message: a tool call, a prompt fetch, a resource read, an initialization handshake. Those are the units of work that matter, and they are invisible to anything that only sees the HTTP envelope.

When something breaks, the questions are MCP-shaped:

  • Which tool was called, and with what arguments?
  • Which client made the call: Claude Desktop, Cursor, Codex, or an agent you have never heard of?
  • Did the response actually come back, or did the model get an error it quietly swallowed?
  • Is this the same session that failed yesterday?
  • Why is one customer seeing a 30% error rate when everyone else is fine?

You can answer some of these with logs and a generic APM, but you will spend more time mapping MCP concepts onto HTTP spans than debugging. The point of MCP monitoring is to work in the units your protocol actually uses.

What to instrument

Five things cover almost every production question.

Tool calls. The center of everything. Capture the tool name, the arguments, the response (or the error), and the duration. Most incidents are a single tool misbehaving for a single class of input, and you cannot see that without the arguments next to the timing.

Resource reads. Resources are reads against your data. They fail differently from tools: permissions, missing URIs, oversized payloads. Track them separately so a spike in resource errors does not hide inside your tool numbers.

Prompts. If your server exposes prompts, capture which ones are fetched and how often. A prompt that suddenly stops being requested usually means a client changed behavior.

Sessions. A session is the thread that ties a sequence of calls to one client connection. Session-level grouping is what lets you reconstruct “what was this agent actually trying to do” instead of staring at isolated calls.

Client identity. Which client, which version. This is the single most useful breakdown dimension in practice, because regressions almost always correlate with a client release, not with your own deploy.

The metrics that actually matter

Volume and a single global error rate are where most teams stop. They are not enough. The metrics that change how you operate are the ones broken down by dimension.

Error rate by client and by tool. A flat 2% error rate is meaningless. A 2% rate that is actually 40% on one tool for one client version is an incident. Always look at error rate sliced by tool and by client.

Duration percentiles per tool (p50 / p95). Averages hide the calls that matter. Track p50 and p95 per tool. The p95 on your slowest tool is usually what an agent experiences as “the server is slow.”

Session error concentration. Are errors spread evenly, or concentrated in a handful of sessions? Concentrated errors point at a specific client or a specific workflow; spread-out errors point at your server.

Throughput by tool over time. A tool whose call volume drops to zero is often a more urgent signal than an error, because it means a client stopped calling it entirely, silently.

Setting it up

There are two ways to get this data into Spanly. Pick whichever fits how you ship.

The SDK is a few lines in your server, for TypeScript and Python:

import { SpanlyClient } from '@spanly/sdk';

const spanly = new SpanlyClient({
  apiKey: process.env.SPANLY_API_KEY,
});

spanly.monitor(mcpServer);

If you cannot or do not want to touch the server code (third-party servers, containerized stacks, anything you do not own), the CLI wraps or proxies it with no code change:

export SPANLY_API_KEY="spanly_us_..."

# wrap your MCP server (stdio or HTTP)
npx -y @spanly/spanly run -- node ./server.js

Either way, tool calls, sessions, and client identity start showing up within seconds. The TypeScript quickstart and Python quickstart walk through the full setup.

Alerting: fire on what you care about

Monitoring you have to remember to look at is monitoring you will not look at. Set alerts so the system tells you. A few that earn their keep:

  • Error rate over a threshold in a short window, for example error rate above 10% over the last 5 minutes. This catches a bad client release or a downstream dependency failing.
  • p95 latency on a critical tool above a ceiling. Slow is the failure mode agents notice first.
  • Throughput collapse. A tool that drops to zero calls when it normally runs steadily.

Each rule can fan out to email, Slack, and signed webhooks, so the alert lands wherever your on-call already lives. Start with two or three rules and tune the thresholds against a week of real traffic rather than guessing up front.

Keep your APM, add the MCP layer

None of this replaces your existing observability stack. Your APM still owns HTTP, infrastructure, and the rest of your service. MCP monitoring sits alongside it and adds the protocol layer your APM cannot see. Because Spanly propagates the W3C trace context on inbound MCP requests, every view links straight back to the matching trace in Datadog, Sentry, or New Relic. We go deeper on that split in MCP observability vs APM, and on the category itself in what MCP observability is.

Try it

Tim