MCP+
Precision Context Management for MCP Agents

A post-processing layer that wraps your MCP clients and returns only the needles - not the haystack - so you cut context bloat and cost without changing your agent.

Code

Prathyusha Jwalapuram, Akhilesh Deepak Gotmare, Doyen Sahoo, Silvio Savarese, Junnan Li

Salesforce AI Research

Contact: pjwalapuram@salesforce.com


About

The promise of Model Context Protocol (MCP) servers is seamless tool integration, but the reality is often indiscriminate context bloat. Every time an agent fetches a 1000-line HTML document just to find a single button's ID, you aren't just paying for those tokens once — the context bloat compounds with every subsequent turn in the conversation. The task agent is forced to find needles in the haystack while you foot the bill for the hay. This inflates costs, exhausts context windows, and distracts even the most capable LLMs.

To mitigate this context bloat, we built MCP+: a server-, agent-, and task-agnostic post-processing layer designed to sit as a protective wrapper around your MCP clients. Instead of forcing your primary task agent to sift through a mountain of hay, MCP+ intercepts the tool outputs and hands over only the needles. By delivering a highly condensed context window, it achieves comparable reasoning performance while slashing associated inference costs by up to 75%.

MCP+ pipeline: without vs with MCP+.

The power of the expected_info argument

MCP+ acts as a pseudo-MCP-server; a layer that requires zero changes to your existing agent's logic. To your task agent, it looks like the same MCP server with the same tools, but with one powerful new capability: the expected_info argument. This allows the agent to call a tool and define the specific needle it needs before the haystack even reaches its context.

It's the difference between calling a tool for a 2000-line nested JSON data dump and asking for specific data values for User ID X, Y and Z — one is a context-clogging haystack; the other is the exact token-efficient answer. By offloading the haystack sifting to a more economical model (e.g. GPT-5-mini or Gemini-2.5-Flash), you stop burning premium tokens on noise and start investing them in reasoning.

Example: Yahoo Finance MCP Server

Here's an example output from the Yahoo Finance MCP server, where the agent has called the tool get_historical_stock_prices with the arguments 'ticker': 'MSFT', 'end_date': '2025-01-09', 'start_date': '2023-01-08':

Raw tool response (standard MCP)
# Loading…

                        

Running the same task with MCP+ adds expected_info: e.g. "Closing prices for MSFT on 2023-01-09 and 2025-01-08 to calculate investment returns." With that, MCP+ returns only the relevant slice to the task agent:

MCP+ filtered response
# ~2500 characters — only what was requested =====DUAL EXTRACTION RESULTS===== Two extraction methods were used. You can use either result, or combine information from both as appropriate. -------------------------------------------------------------------------------- DIRECT EXTRACTION: -------------------------------------------------------------------------------- 2023-01-09: Close = 221.8165893555; 2025-01-08: Close = 421.4510192871 -------------------------------------------------------------------------------- CODE-BASED EXTRACTION: -------------------------------------------------------------------------------- {'2023-01-09_close': 221.8165893555, '2025-01-08_close': 421.4510192871}

The number of tokens goes down from roughly ~7000 to ~200 (>95% reduction).

Intelligent activation and cost control

MCP+ uses an economical, low-latency model (e.g. GPT-5-mini) as a post-processing agent that performs both direct and code-based extraction to filter out the fluff while preserving the integrity of the original data. To keep the process efficient, we've included a customizable activation threshold: MCP+ only engages when the tool call output is large enough to threaten your context window or your budget. For smaller payloads, the data passes through untouched.

We also add guardrails that enforce a strict token ceiling, so the extracted output never exceeds the size of the original source. The result is a system that maximizes information density and delivers a consistent reduction in overhead without compromising performance.

Results

Measuring the impact of MCP+ on performance and inference costs

We ran a systematic evaluation across task domains, output formats, and agent LLMs. Using the MCP-Universe Benchmark, we measured the performance of Claude 4.0 Sonnet, GPT-5, and Gemini 3 Pro Preview as function-calling agents across:

  • Browser Navigation: Playwright (raw HTML DOM)
  • Financial Management: Yahoo Finance (structured JSON)
  • Web Search: Google Search (Serper API payloads)

In each case we compared standard tool-use with an MCP+ configuration (powered by GPT-5-mini) to measure the delta in performance and inference cost. We found that MCP+ leads to consistent cost savings while maintaining comparable performance when averaged across multiple runs.

Performance and Cost comparison: Standard vs MCP+ across domains and target LLMs.

Demo

MCP+: Arbitrage for developers to preserve premium LLM context

For developers using Cursor or Claude Code, context bloat is a direct drain on your innovation budget. When an MCP tool dumps 10,000 tokens of irrelevant metadata into your chat, it's stealing the tokens your model needs to refactor that function or debug that edge case.

MCP+ lets you do an economic arbitrage: by offloading the haystack sifting to a cheaper model, you preserve your premium Claude Opus 4.5 context for what actually matters — writing code. While this orchestration step introduces a small latency overhead, the tradeoff is a leaner, more focused, and significantly more cost-effective inference cycle.

Here's a demo showcasing MCP+ in action in Cursor with Playwright:

Demo: Cursor with Playwright - MCP+ in action.

Here's a demo showcasing MCP+ in action in Claude Code with Context7:

Demo: Claude Code with Context7 - MCP+ in action.

Here's a demo showcasing MCP+ in action in Agentforce Vibes with DX MCP:

Demo: Agentforce Vibes with DX MCP - MCP+ in action.

Installation

Prerequisites: Python 3.10+, OpenAI API key (or another economical LLM of your choice).

bash
# Install MCP-Universe $ pip install mcpuniverse # Set your API key $ export OPENAI_API_KEY=sk-... # Wrap your existing MCP servers (requires path to client's MCP config file) # E.g. Run with ~/.claude.json for Claude Code $ mcp-build-plus --mcp-config ~/.cursor/mcp.json # This creates -plus versions of all your MCP servers (e.g. github → github-plus). # Wrap specific servers only $ mcp-build-plus --mcp-config ~/.cursor/mcp.json --servers github # Adjust token threshold — MCP+ is invoked for responses beyond this length. If you specify a server, the threshold will apply only to that server. $ mcp-build-plus --mcp-config ~/.cursor/mcp.json --token-threshold 300 # Use a different/cheaper model (default: gpt-5-mini) $ mcp-build-plus --mcp-config ~/.cursor/mcp.json --llm-model gpt-5-mini-2025-08-07 # Use Gemini instead of OpenAI $ mcp-build-plus --mcp-config ~/.cursor/mcp.json \ --llm-provider gemini \ --llm-model gemini-2.5-flash \ --llm-api-key-env GOOGLE_API_KEY # Use Anthropic instead of OpenAI $ mcp-build-plus --mcp-config ~/.cursor/mcp.json \ --llm-provider claude \ --llm-model claude-haiku-4-5-20251001 \ --llm-api-key-env ANTHROPIC_API_KEY

Restart Cursor or Claude Code. Your servers now have -plus versions with intelligent filtering.