Skip to content

[Feature]: Benchmark MCP Server for Load Testing and Performance Analysis #1219

@crivetimihai

Description

@crivetimihai

🚀 Epic: Benchmark MCP Server for Load Testing and Performance Analysis

Goal

Provide a highly configurable, Go-based MCP server that can generate an arbitrary number of tools, resources, and prompts with customizable response payloads for benchmarking, load testing, and performance analysis of MCP Gateway implementations, clients, and protocol stacks. This enables developers to validate scalability, measure throughput, and identify bottlenecks in their MCP infrastructure before production deployment.

Why Now?

As MCP Gateway evolves to support thousands of tools, resources, and prompts across federated gateways, teams need a reliable way to:

  1. Validate scalability - Test how systems handle 1,000, 10,000, or 100,000+ MCP primitives
  2. Measure performance - Benchmark tool invocation latency, resource access speed, and prompt generation throughput
  3. Test edge cases - Validate behavior with varying payload sizes (1 byte to 1MB+)
  4. Compare transports - Benchmark stdio vs SSE vs HTTP performance characteristics
  5. Stress test infrastructure - Identify memory leaks, connection limits, and CPU bottlenecks

This tool provides a standardized benchmarking platform for the entire MCP ecosystem, enabling apples-to-apples performance comparisons across implementations.


📖 User Stories

US-1: Performance Engineer - Large-Scale Tool Discovery Testing

As a Performance Engineer
I want to generate 10,000+ tools with configurable payload sizes
So that I can measure how MCP Gateway handles large tool listings and invocations

Acceptance Criteria:

Given I start the benchmark server with flags:
  -tools=10000 -tool-size=5000 -resources=0 -prompts=0
When an MCP client sends "tools/list" request
Then the server should:
  - Return all 10,000 tool definitions within 100ms
  - Each tool should have unique name "benchmark_tool_N"
  - Each tool should accept "param1" and "param2" arguments
  - Tool descriptions should indicate the tool number

When the client invokes "benchmark_tool_0"
Then the server should:
  - Return JSON response with ~5000 byte payload
  - Include timestamp, tool name, and passed arguments
  - Respond within 10ms

Technical Requirements:

  • Tool generation must be deterministic and repeatable
  • Memory usage should scale linearly (O(n)) with tool count
  • Support up to 100,000 tools without performance degradation
  • All tools registered during server startup
US-2: QA Engineer - Mixed Workload Testing

As a QA Engineer
I want to configure different payload sizes for tools, resources, and prompts independently
So that I can simulate realistic MCP workloads with varying response sizes

Acceptance Criteria:

Given I start the benchmark server with flags:
  -tools=1000 -tool-size=2000
  -resources=500 -resource-size=50000
  -prompts=200 -prompt-size=1000
When MCP clients access the server
Then the server should:
  - Return tools with ~2KB payloads
  - Return resources with ~50KB payloads
  - Return prompts with ~1KB payloads
  - Maintain consistent payload sizes across invocations

Technical Requirements:

  • Independent size controls: -tool-size, -resource-size, -prompt-size
  • Payload generation must be efficient (no excessive memory allocations)
  • Support payload sizes from 1 byte to 10MB+
  • Validate size parameters at startup
US-3: DevOps Engineer - Multi-Transport Benchmarking

As a DevOps Engineer
I want to run the benchmark server over stdio, SSE, and HTTP transports
So that I can compare protocol performance characteristics

Acceptance Criteria:

# STDIO Mode (for Claude Desktop integration)
Given I run: ./benchmark-server -tools=1000
Then the server communicates via stdin/stdout JSON-RPC

# SSE Mode (for web clients)
Given I run: ./benchmark-server -transport=sse -port=8080 -tools=1000
Then the server exposes:
  - SSE events at /sse
  - SSE messages at /messages
  - Health check at /health
  - Version info at /version

# HTTP Mode (for REST-like clients)
Given I run: ./benchmark-server -transport=http -port=9090 -tools=1000
Then the server accepts POST requests with JSON-RPC payloads at /

Technical Requirements:

  • All transports support same MCP protocol features
  • SSE transport must support Server-Sent Events streaming
  • HTTP transport must support streamable responses
  • Health/version endpoints work without authentication
US-4: Load Test Engineer - Stress Testing Infrastructure

As a Load Test Engineer
I want to generate extreme-scale configurations (100,000+ items)
So that I can identify breaking points in MCP infrastructure

Acceptance Criteria:

Given I start the benchmark server with:
  -tools=100000 -resources=50000 -prompts=10000
When the server starts
Then it should:
  - Register all items within 5 seconds
  - Report configuration via logs
  - Consume less than 500MB memory
  - Respond to tool/list requests within 200ms

Technical Requirements:

  • Fast registration (instant for 10K, <5s for 100K)
  • Efficient memory usage (no redundant data structures)
  • Configurable log levels (debug, info, warn, error, none)
  • Graceful handling of OS limits (file descriptors, memory)
US-5: Security Engineer - Authenticated Transport Testing

As a Security Engineer
I want to require Bearer token authentication for SSE/HTTP transports
So that I can test authentication flows in benchmarking scenarios

Acceptance Criteria:

Given I start with: ./benchmark-server -transport=sse -auth-token=secret123
When a client sends a request without Authorization header
Then the server responds with 401 Unauthorized

When a client sends: Authorization: Bearer secret123
Then the server accepts the request and processes normally

Given environment variable: AUTH_TOKEN=secret456
When I start with: ./benchmark-server -transport=sse
Then the server uses "secret456" as the auth token

Technical Requirements:

  • Bearer token validation on all endpoints except /health and /version
  • Support both CLI flag and environment variable
  • Return proper WWW-Authenticate header on 401 responses
  • Log authentication attempts at debug level

🏗 Architecture

Component Architecture

graph TB
    subgraph "Benchmark Server (Go)"
        A1[Flag Parser]
        A2[MCP Server Core]
        A3[Dynamic Handler Generator]
        A4[Tool Handlers 0..N]
        A5[Resource Handlers 0..N]
        A6[Prompt Handlers 0..N]
        A7[Transport Layer]
        A8[STDIO Transport]
        A9[SSE Transport]
        A10[HTTP Transport]
        A11[Auth Middleware]
    end

    subgraph "MCP Clients"
        B1[Claude Desktop]
        B2[Web Browser]
        B3[HTTP Client]
        B4[Load Test Tool]
    end

    A1 --> A2
    A1 --> A3
    A3 --> A4
    A3 --> A5
    A3 --> A6
    A2 --> A4
    A2 --> A5
    A2 --> A6
    A2 --> A7
    A7 --> A8
    A7 --> A9
    A7 --> A10
    A9 --> A11
    A10 --> A11

    B1 -->|JSON-RPC stdin| A8
    B2 -->|SSE /sse| A9
    B3 -->|HTTP POST /| A10
    B4 -->|Concurrent Requests| A9
    B4 -->|Concurrent Requests| A10
Loading

Payload Generation Flow

sequenceDiagram
    participant Client as MCP Client
    participant Server as Benchmark Server
    participant Handler as Tool Handler
    participant Generator as Payload Generator

    Client->>Server: tools/call {"name":"benchmark_tool_0"}
    Server->>Handler: invoke(toolName="benchmark_tool_0", args={})
    Handler->>Generator: generatePayload("benchmark_tool_0", size=5000)
    Generator->>Generator: base = "Response from benchmark_tool_0. "
    Generator->>Generator: filler = "This is benchmark data. " (repeated)
    Generator->>Generator: result = base + filler (truncated to 5000 bytes)
    Generator-->>Handler: payload (5000 bytes)
    Handler->>Handler: Build JSON response with tool name, timestamp, args, data
    Handler-->>Server: JSON response
    Server-->>Client: MCP ToolResult with payload
Loading

📋 File Structure

mcp-servers/go/benchmark-server/
├── main.go                 # Server implementation (691 lines)
├── go.mod                  # Go module definition
├── go.sum                  # Dependency checksums
├── Makefile               # Build automation with targets
├── Dockerfile             # Multi-stage container build
├── README.md              # Comprehensive documentation
└── dist/
    └── benchmark-server   # Compiled binary

⚙️ Command-Line Interface

Core Flags

Flag Default Description
-transport stdio Transport type: stdio, sse, or http
-tools 100 Number of tools to generate
-resources 100 Number of resources to generate
-prompts 100 Number of prompts to generate
-tool-size 1000 Size of tool response payload in bytes
-resource-size 1000 Size of resource response payload in bytes
-prompt-size 1000 Size of prompt response payload in bytes
-port 8080 TCP port for SSE/HTTP transport
-listen 0.0.0.0 Listen interface for SSE/HTTP
-addr - Full listen address (overrides -listen/-port)
-public-url - External base URL for SSE clients
-auth-token - Bearer token for authentication (SSE/HTTP only)
-log-level info Logging level: debug, info, warn, error, none
-help - Show help message

Environment Variables

Variable Description
AUTH_TOKEN Bearer token for authentication (overrides -auth-token flag)

📊 Usage Examples

Small Scale Testing (Development)

# Quick test with 10 items each
./benchmark-server -tools=10 -resources=10 -prompts=10 -log-level=debug

# Test specific type with custom size
./benchmark-server -tools=5 -tool-size=500 -resources=0 -prompts=0

Medium Scale Testing (Integration)

# Realistic workload
./benchmark-server -tools=1000 -resources=500 -prompts=200

# Mixed payload sizes
./benchmark-server -tools=1000 -tool-size=2000 \
                   -resources=500 -resource-size=50000 \
                   -prompts=200 -prompt-size=1000

Large Scale Testing (Performance)

# 10K tools with 5KB payloads
./benchmark-server -tools=10000 -tool-size=5000

# Mixed scale for gateway stress testing
./benchmark-server -tools=10000 -resources=5000 -prompts=1000 \
                   -tool-size=2000 -resource-size=10000 -prompt-size=500

Extreme Scale Testing (Limits)

# 100K tools (test discovery performance)
./benchmark-server -tools=100000 -resources=0 -prompts=0 -log-level=none

# Large payloads (test data transfer)
./benchmark-server -tools=100 -tool-size=1000000  # 1MB payloads

Multi-Transport Testing

# STDIO (Claude Desktop)
./benchmark-server -tools=1000

# SSE (Web clients)
./benchmark-server -transport=sse -port=8080 -tools=1000

# HTTP (REST clients)
./benchmark-server -transport=http -port=9090 -tools=1000

# SSE with authentication
./benchmark-server -transport=sse -port=8080 -auth-token=secret123 -tools=500

Claude Desktop Integration

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "benchmark": {
      "command": "/path/to/benchmark-server",
      "args": ["-tools=1000", "-resources=500", "-prompts=200"]
    }
  }
}

🔧 API Response Format

Tool Response

{
  "tool": "benchmark_tool_0",
  "timestamp": "2025-10-11T12:34:56Z",
  "arguments": {
    "param1": "value1",
    "param2": "value2"
  },
  "data": "Response from benchmark_tool_0. This is benchmark data. This is benchmark data..."
}

Resource Response

{
  "resource": "benchmark_resource_0",
  "timestamp": "2025-10-11T12:34:56Z",
  "data": "Response from benchmark_resource_0. This is benchmark data..."
}

Prompt Response

Prompt: benchmark_prompt_0

Timestamp: 2025-10-11T12:34:56Z

Arguments:
  - arg1: value1
  - arg2: value2

Response from benchmark_prompt_0. This is benchmark data...

📈 Performance Characteristics

Registration Speed

Item Count Registration Time Memory Usage
100 <1ms ~5MB
1,000 <10ms ~10MB
10,000 <100ms ~50MB
100,000 <5s ~300MB

Response Times

Operation 1,000 items 10,000 items 100,000 items
Tool listing <10ms <50ms <200ms
Tool invocation <5ms <5ms <5ms
Resource access <5ms <5ms <5ms
Prompt generation <5ms <5ms <5ms

Payload Size Impact

Payload Size Tool Invocation Time Memory per Request
1KB <5ms ~2KB
10KB <5ms ~12KB
100KB <10ms ~105KB
1MB <20ms ~1.1MB

📋 Implementation Tasks

Phase 1: Core Server Implementation ✅

  • Project Structure

    • Create mcp-servers/go/benchmark-server/ directory
    • Initialize Go module with go.mod
    • Add github.com/mark3labs/mcp-go v0.32.0 dependency
  • Main Application (main.go)

    • Implement command-line flag parsing
    • Add logging infrastructure with levels
    • Create MCP server initialization
    • Implement transport selection logic

Phase 2: Dynamic Handler Generation ✅

  • Payload Generation

    • Implement generatePayload() function
    • Support arbitrary payload sizes
    • Use repeating filler text for efficiency
  • Handler Factories

    • Implement createToolHandler() factory
    • Implement createResourceHandler() factory
    • Implement createPromptHandler() factory
    • Support closure-based handler creation with size parameters

Phase 3: Tool/Resource/Prompt Registration ✅

  • Tool Registration Loop

    • Generate N tools with sequential names
    • Add tool descriptions and metadata
    • Register with MCP server
  • Resource Registration Loop

    • Generate N resources with URIs
    • Add resource descriptions
    • Register with MCP server
  • Prompt Registration Loop

    • Generate N prompts with names
    • Add prompt arguments
    • Register with MCP server

Phase 4: Transport Implementation ✅

  • STDIO Transport

    • Use server.ServeStdio() for stdin/stdout
    • Support JSON-RPC over stdio
    • Ignore auth-token in stdio mode
  • SSE Transport

    • Implement SSE server with /sse and /messages endpoints
    • Add health/version endpoints
    • Support Bearer token authentication
    • Implement logging middleware
  • HTTP Transport

    • Implement HTTP server with JSON-RPC POST endpoint
    • Add health/version endpoints
    • Support Bearer token authentication
    • Implement logging middleware

Phase 5: Authentication & Security ✅

  • Bearer Token Auth
    • Implement authMiddleware() function
    • Validate Authorization header format
    • Skip auth for health/version endpoints
    • Support both CLI flag and environment variable
    • Return proper 401 responses with WWW-Authenticate

Phase 6: Customizable Payload Sizes ✅

  • Separate Size Controls
    • Replace single -payload-size with three flags
    • Add -tool-size flag (default: 1000)
    • Add -resource-size flag (default: 1000)
    • Add -prompt-size flag (default: 1000)
    • Update handler factories to use separate sizes
    • Update logging to show all three sizes

Phase 7: Build Automation ✅

  • Makefile

    • Create build target with CGO_ENABLED=0
    • Add run target for quick testing
    • Add run-small, run-medium, run-large, run-xlarge presets
    • Add run-sse and run-http transport targets
    • Add clean target
    • Add help target with descriptions
    • Add tidy, fmt, test targets
  • Dockerfile

    • Multi-stage build with golang:1.23
    • Scratch-based final image
    • CGO_ENABLED=0 for static binary
    • Trimmed and stripped binary

Phase 8: Documentation ✅

  • README.md

    • Project overview and features
    • Quick start guide
    • Command-line options table
    • Usage examples (small, medium, large, extreme scale)
    • Claude Desktop integration example
    • Testing examples with curl
    • Makefile targets documentation
    • Benchmarking scenarios
    • Performance characteristics
    • API response format examples
    • Docker usage instructions
  • Code Documentation

    • File header with usage examples
    • Function docstrings
    • Inline comments for complex logic

Phase 9: Testing & Validation ✅

  • Functional Testing

    • Test tool listing with 100 items (default)
    • Test tool invocation with parameters
    • Test resource listing
    • Test resource reading
    • Test prompt listing
    • Test prompt generation
    • Test with 1,000 items (medium scale)
    • Test with 10,000 items (large scale)
  • Payload Size Testing

    • Verify tool payload size accuracy
    • Verify resource payload size accuracy
    • Verify prompt payload size accuracy
    • Test mixed payload sizes
  • Transport Testing

    • Test stdio mode with echo piping
    • Test SSE mode (manual verification)
    • Test HTTP mode (manual verification)
  • Performance Testing

    • Measure registration time for 10,000 items
    • Verify instant registration
    • Verify memory usage is reasonable

✅ Success Criteria

  • Functionality: Can generate 1 to 100,000+ tools/resources/prompts on demand
  • Customization: Separate size controls for tools, resources, and prompts
  • Performance: Instant registration for 10K items, <5s for 100K items
  • Transports: Full support for stdio, SSE, and HTTP transports
  • Authentication: Bearer token auth for SSE/HTTP with environment variable support
  • Build System: Makefile with convenient targets and Docker support
  • Documentation: Comprehensive README with examples and Claude Desktop integration
  • Testing: Verified with actual MCP protocol invocations
  • Logging: Configurable log levels with structured output
  • Standards: Full MCP 1.0 protocol compliance

📝 Additional Notes

🔹 Zero Dependencies: The server uses only the official mcp-go library and Go standard library, ensuring minimal attack surface and fast compilation.

🔹 Deterministic Behavior: Tool names, resource URIs, and prompt names are sequential and predictable, making it easy to write automated tests.

🔹 Efficient Memory Usage: Handlers are generated as closures that capture only the necessary data (name, size), avoiding redundant storage.

🔹 Payload Flexibility: Supports payloads from 1 byte to 10MB+, enabling testing of:

  • Small responses (metadata-heavy workloads)
  • Medium responses (typical tool outputs)
  • Large responses (data export, log streaming)

🔹 Real-World Simulation: The three-tier configuration (tools, resources, prompts with independent sizes) mirrors production MCP servers that expose different types of primitives with varying response characteristics.

🔹 Container-Ready: Dockerfile produces a 10MB scratch-based image with static binary, ideal for Kubernetes deployments and CI/CD pipelines.

🔹 Claude Desktop Compatible: Works out-of-the-box with Claude Desktop via stdio transport, allowing manual testing of tool discovery and invocation.

🔹 Future Extensions:

  • Sampling mode (random tool/resource/prompt selection)
  • Latency injection for network delay simulation
  • Error rate injection for failure testing
  • Prometheus metrics endpoint
  • OpenTelemetry tracing support

🏁 Definition of Done

  • All implementation tasks completed
  • Server runs with default configuration (100 items each)
  • Server handles extreme scale (10,000+ items) without errors
  • Separate payload size controls functional and tested
  • All three transports (stdio, SSE, HTTP) operational
  • Authentication works with both CLI flag and environment variable
  • Makefile targets work correctly
  • Dockerfile builds successfully
  • README documentation complete with examples
  • Code includes header comments with usage examples
  • Tested with actual MCP protocol requests
  • Performance characteristics documented
  • Project follows Go best practices (gofmt, proper error handling)
  • Binary size optimized (stripped and trimmed)

🎯 Use Cases

1. Gateway Scalability Testing

Test MCP Gateway with increasing tool counts to identify discovery bottlenecks.

2. Transport Performance Comparison

Benchmark stdio vs SSE vs HTTP to determine optimal transport for production.

3. Client Load Testing

Stress test MCP clients (Claude Desktop, web apps) with large tool catalogs.

4. Protocol Compliance Verification

Validate MCP protocol implementations handle large-scale tool/resource/prompt scenarios correctly.

5. Memory Profiling

Profile MCP Gateway memory usage under various load conditions (tool count × payload size).

6. Latency Analysis

Measure end-to-end latency from tool invocation to response across different scales.

7. Federation Testing

Test federated gateway scenarios with multiple benchmark servers exposing different scales.

8. CI/CD Performance Regression

Automated benchmarking in CI/CD to detect performance regressions across versions.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesttriageIssues / Features awaiting triage

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions