Skip to content

Sample MCP Server - Python (data-analysis-server) #900

@crivetimihai

Description

@crivetimihai

Overview

Create a sample MCP Server in Python that provides comprehensive data analysis capabilities including statistical analysis, visualization, and data manipulation using pandas, numpy, and scientific computing libraries.

Server Specifications

Server Details

  • Name: data-analysis-server
  • Language: Python 3.11+
  • Location: mcp-servers/python/data_analysis_server/
  • Purpose: Demonstrate data science and analytics workflows via MCP

Core Features

  • Data loading from multiple formats (CSV, JSON, Parquet, SQL)
  • Statistical analysis and hypothesis testing
  • Data visualization and plotting
  • Data cleaning and transformation
  • Time series analysis
  • Machine learning pipeline integration

Tools Provided

1. load_dataset

Load data from various sources and formats

@dataclass
class DataLoadRequest:
    source: str  # file path, URL, or SQL connection string
    format: str  # csv, json, parquet, sql, excel
    options: Optional[Dict[str, Any]] = None
    sample_size: Optional[int] = None
    cache_data: bool = True
    dataset_id: Optional[str] = None

2. analyze_dataset

Comprehensive dataset analysis and profiling

@dataclass
class DataAnalysisRequest:
    dataset_id: str
    analysis_type: str  # descriptive, exploratory, correlation
    columns: Optional[List[str]] = None
    include_distributions: bool = True
    include_correlations: bool = True
    include_outliers: bool = True
    confidence_level: float = 0.95

3. statistical_test

Perform statistical hypothesis testing

@dataclass
class StatTestRequest:
    dataset_id: str
    test_type: str  # t_test, chi_square, anova, regression
    columns: List[str]
    groupby_column: Optional[str] = None
    hypothesis: Optional[str] = None
    alpha: float = 0.05
    alternative: str = "two-sided"

4. create_visualization

Generate statistical plots and charts

@dataclass
class VisualizationRequest:
    dataset_id: str
    plot_type: str  # histogram, scatter, box, heatmap, time_series
    x_column: str
    y_column: Optional[str] = None
    color_column: Optional[str] = None
    facet_column: Optional[str] = None
    title: Optional[str] = None
    save_format: str = "png"  # png, svg, html

5. transform_data

Apply data transformations and cleaning

@dataclass
class TransformRequest:
    dataset_id: str
    operations: List[Dict[str, Any]]
    create_new_dataset: bool = False
    new_dataset_id: Optional[str] = None
    
# Example operations:
# {"type": "drop_na", "columns": ["col1", "col2"]}
# {"type": "fill_na", "columns": ["col1"], "method": "mean"}
# {"type": "scale", "columns": ["col1", "col2"], "method": "standard"}
# {"type": "encode_categorical", "columns": ["category"], "method": "one_hot"}

6. time_series_analysis

Analyze time series data patterns and trends

@dataclass
class TimeSeriesRequest:
    dataset_id: str
    time_column: str
    value_columns: List[str]
    frequency: Optional[str] = None  # D, W, M, Q, Y
    operations: List[str] = None  # trend, seasonal, forecast
    forecast_periods: int = 12
    confidence_intervals: bool = True

7. query_data

SQL-like querying of loaded datasets

@dataclass
class DataQueryRequest:
    dataset_id: str
    query: str  # SQL-like syntax
    limit: Optional[int] = 1000
    offset: int = 0
    return_format: str = "json"  # json, csv, html

Implementation Requirements

Directory Structure

mcp-servers/python/data_analysis_server/
├── src/
│   └── data_analysis_server/
│       ├── __init__.py
│       ├── server.py
│       ├── core/
│       │   ├── __init__.py
│       │   ├── data_loader.py
│       │   ├── analyzer.py
│       │   └── transformer.py
│       ├── statistics/
│       │   ├── __init__.py
│       │   ├── descriptive.py
│       │   ├── hypothesis_tests.py
│       │   └── time_series.py
│       ├── visualization/
│       │   ├── __init__.py
│       │   ├── plots.py
│       │   └── charts.py
│       ├── storage/
│       │   ├── __init__.py
│       │   └── dataset_manager.py
│       └── utils/
│           ├── __init__.py
│           └── query_parser.py
├── tests/
├── requirements.txt
├── pyproject.toml
├── README.md
├── examples/
│   ├── sales_analysis.py
│   ├── time_series_example.py
│   └── statistical_testing.py
└── sample_data/
    ├── sales_data.csv
    ├── stock_prices.csv
    └── customer_data.json

Dependencies

# requirements.txt
mcp>=1.0.0
pandas>=2.1.0
numpy>=1.24.0
scipy>=1.11.0
scikit-learn>=1.3.0
matplotlib>=3.7.0
seaborn>=0.13.0
plotly>=5.17.0
statsmodels>=0.14.0
pyarrow>=14.0.0  # for parquet support
sqlalchemy>=2.0.0
openpyxl>=3.1.0  # for Excel support
requests>=2.31.0
pydantic>=2.5.0

Configuration

# config.yaml
server:
  max_dataset_size: "1GB"
  cache_dir: "./data_cache"
  temp_dir: "./temp"
  
data_sources:
  allowed_protocols: ["http", "https", "file"]
  max_download_size: "500MB"
  timeout: 30
  
visualization:
  default_theme: "seaborn-v0_8"
  output_dir: "./plots"
  max_plot_points: 10000
  
statistics:
  default_confidence_level: 0.95
  max_categories: 50
  outlier_method: "iqr"  # iqr, zscore, isolation_forest
  
performance:
  chunk_size: 10000
  parallel_processing: true
  max_workers: 4

Usage Examples

Basic Data Analysis Workflow

# Load sales data
await mcp_client.call_tool("load_dataset", {
    "source": "./data/sales_2023.csv",
    "format": "csv",
    "dataset_id": "sales_data",
    "cache_data": True
})

# Analyze the dataset
analysis = await mcp_client.call_tool("analyze_dataset", {
    "dataset_id": "sales_data",
    "analysis_type": "exploratory",
    "include_distributions": True,
    "include_correlations": True
})

# Create visualization
viz = await mcp_client.call_tool("create_visualization", {
    "dataset_id": "sales_data",
    "plot_type": "scatter",
    "x_column": "price",
    "y_column": "quantity_sold",
    "color_column": "product_category",
    "title": "Price vs Quantity by Category"
})

Statistical Testing

# Load customer data
await mcp_client.call_tool("load_dataset", {
    "source": "./data/ab_test_results.csv", 
    "format": "csv",
    "dataset_id": "ab_test"
})

# Perform t-test
test_result = await mcp_client.call_tool("statistical_test", {
    "dataset_id": "ab_test",
    "test_type": "t_test",
    "columns": ["conversion_rate"],
    "groupby_column": "test_group",
    "hypothesis": "Group A != Group B",
    "alpha": 0.05
})

Time Series Analysis

# Load stock price data
await mcp_client.call_tool("load_dataset", {
    "source": "https://api.example.com/stock_data.json",
    "format": "json", 
    "dataset_id": "stock_prices"
})

# Time series analysis with forecasting
ts_analysis = await mcp_client.call_tool("time_series_analysis", {
    "dataset_id": "stock_prices",
    "time_column": "date",
    "value_columns": ["close_price"],
    "operations": ["trend", "seasonal", "forecast"],
    "forecast_periods": 30,
    "confidence_intervals": True
})

Data Transformation Pipeline

# Clean and transform data
transformed = await mcp_client.call_tool("transform_data", {
    "dataset_id": "raw_customer_data",
    "operations": [
        {"type": "drop_na", "columns": ["email", "age"]},
        {"type": "fill_na", "columns": ["income"], "method": "median"},
        {"type": "encode_categorical", "columns": ["region"], "method": "one_hot"},
        {"type": "scale", "columns": ["income", "age"], "method": "standard"}
    ],
    "create_new_dataset": True,
    "new_dataset_id": "clean_customer_data"
})

Advanced Features

  • Data Pipeline Automation: Chain multiple analysis operations
  • Interactive Dashboards: Generate web-based dashboards
  • Statistical Modeling: Advanced regression and classification
  • Anomaly Detection: Identify outliers and unusual patterns
  • Data Quality Assessment: Automated data quality scoring
  • Export Capabilities: Export results to various formats

Visualization Capabilities

  • Statistical Plots: Histograms, box plots, Q-Q plots
  • Correlation Matrices: Heatmaps and network graphs
  • Time Series Plots: Trends, seasonality, forecasts
  • Interactive Charts: Plotly-based interactive visualizations
  • Custom Styling: Configurable themes and styling options

Security Features

  • Data source validation and sandboxing
  • Query complexity limits
  • Memory usage monitoring
  • Safe evaluation of transformation operations
  • Audit logging for all data operations

Testing Requirements

  • Unit tests for all statistical functions
  • Integration tests with sample datasets
  • Performance tests with large datasets
  • Visualization output validation
  • Statistical accuracy verification

Acceptance Criteria

  • Python MCP server with 7+ data analysis tools
  • Support for multiple data formats (CSV, JSON, Parquet, SQL, Excel)
  • Comprehensive statistical analysis capabilities
  • Data visualization with multiple plot types
  • Time series analysis and forecasting
  • Data transformation and cleaning operations
  • SQL-like querying capabilities
  • Comprehensive test suite with sample data (>90% coverage)
  • Performance optimization for large datasets
  • Complete documentation with analysis examples

Priority

High - Demonstrates data science workflows essential for AI and analytics applications

Use Cases

  • Business intelligence and reporting
  • Data science experimentation
  • Statistical analysis and hypothesis testing
  • Data quality assessment
  • Exploratory data analysis (EDA)
  • Time series forecasting
  • A/B test analysis
  • Research data analysis

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmcp-serversMCP Server SamplesoicOpen Innovation Community ContributionspythonPython / backend development (FastAPI)

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions