-
Notifications
You must be signed in to change notification settings - Fork 420
Labels
enhancementNew feature or requestNew feature or requestmcp-serversMCP Server SamplesMCP Server SamplesoicOpen Innovation Community ContributionsOpen Innovation Community ContributionspythonPython / backend development (FastAPI)Python / backend development (FastAPI)
Milestone
Description
Overview
Create a sample MCP Server in Python that provides comprehensive data analysis capabilities including statistical analysis, visualization, and data manipulation using pandas, numpy, and scientific computing libraries.
Server Specifications
Server Details
- Name:
data-analysis-server - Language: Python 3.11+
- Location:
mcp-servers/python/data_analysis_server/ - Purpose: Demonstrate data science and analytics workflows via MCP
Core Features
- Data loading from multiple formats (CSV, JSON, Parquet, SQL)
- Statistical analysis and hypothesis testing
- Data visualization and plotting
- Data cleaning and transformation
- Time series analysis
- Machine learning pipeline integration
Tools Provided
1. load_dataset
Load data from various sources and formats
@dataclass
class DataLoadRequest:
source: str # file path, URL, or SQL connection string
format: str # csv, json, parquet, sql, excel
options: Optional[Dict[str, Any]] = None
sample_size: Optional[int] = None
cache_data: bool = True
dataset_id: Optional[str] = None2. analyze_dataset
Comprehensive dataset analysis and profiling
@dataclass
class DataAnalysisRequest:
dataset_id: str
analysis_type: str # descriptive, exploratory, correlation
columns: Optional[List[str]] = None
include_distributions: bool = True
include_correlations: bool = True
include_outliers: bool = True
confidence_level: float = 0.953. statistical_test
Perform statistical hypothesis testing
@dataclass
class StatTestRequest:
dataset_id: str
test_type: str # t_test, chi_square, anova, regression
columns: List[str]
groupby_column: Optional[str] = None
hypothesis: Optional[str] = None
alpha: float = 0.05
alternative: str = "two-sided"4. create_visualization
Generate statistical plots and charts
@dataclass
class VisualizationRequest:
dataset_id: str
plot_type: str # histogram, scatter, box, heatmap, time_series
x_column: str
y_column: Optional[str] = None
color_column: Optional[str] = None
facet_column: Optional[str] = None
title: Optional[str] = None
save_format: str = "png" # png, svg, html5. transform_data
Apply data transformations and cleaning
@dataclass
class TransformRequest:
dataset_id: str
operations: List[Dict[str, Any]]
create_new_dataset: bool = False
new_dataset_id: Optional[str] = None
# Example operations:
# {"type": "drop_na", "columns": ["col1", "col2"]}
# {"type": "fill_na", "columns": ["col1"], "method": "mean"}
# {"type": "scale", "columns": ["col1", "col2"], "method": "standard"}
# {"type": "encode_categorical", "columns": ["category"], "method": "one_hot"}6. time_series_analysis
Analyze time series data patterns and trends
@dataclass
class TimeSeriesRequest:
dataset_id: str
time_column: str
value_columns: List[str]
frequency: Optional[str] = None # D, W, M, Q, Y
operations: List[str] = None # trend, seasonal, forecast
forecast_periods: int = 12
confidence_intervals: bool = True7. query_data
SQL-like querying of loaded datasets
@dataclass
class DataQueryRequest:
dataset_id: str
query: str # SQL-like syntax
limit: Optional[int] = 1000
offset: int = 0
return_format: str = "json" # json, csv, htmlImplementation Requirements
Directory Structure
mcp-servers/python/data_analysis_server/
├── src/
│ └── data_analysis_server/
│ ├── __init__.py
│ ├── server.py
│ ├── core/
│ │ ├── __init__.py
│ │ ├── data_loader.py
│ │ ├── analyzer.py
│ │ └── transformer.py
│ ├── statistics/
│ │ ├── __init__.py
│ │ ├── descriptive.py
│ │ ├── hypothesis_tests.py
│ │ └── time_series.py
│ ├── visualization/
│ │ ├── __init__.py
│ │ ├── plots.py
│ │ └── charts.py
│ ├── storage/
│ │ ├── __init__.py
│ │ └── dataset_manager.py
│ └── utils/
│ ├── __init__.py
│ └── query_parser.py
├── tests/
├── requirements.txt
├── pyproject.toml
├── README.md
├── examples/
│ ├── sales_analysis.py
│ ├── time_series_example.py
│ └── statistical_testing.py
└── sample_data/
├── sales_data.csv
├── stock_prices.csv
└── customer_data.json
Dependencies
# requirements.txt
mcp>=1.0.0
pandas>=2.1.0
numpy>=1.24.0
scipy>=1.11.0
scikit-learn>=1.3.0
matplotlib>=3.7.0
seaborn>=0.13.0
plotly>=5.17.0
statsmodels>=0.14.0
pyarrow>=14.0.0 # for parquet support
sqlalchemy>=2.0.0
openpyxl>=3.1.0 # for Excel support
requests>=2.31.0
pydantic>=2.5.0Configuration
# config.yaml
server:
max_dataset_size: "1GB"
cache_dir: "./data_cache"
temp_dir: "./temp"
data_sources:
allowed_protocols: ["http", "https", "file"]
max_download_size: "500MB"
timeout: 30
visualization:
default_theme: "seaborn-v0_8"
output_dir: "./plots"
max_plot_points: 10000
statistics:
default_confidence_level: 0.95
max_categories: 50
outlier_method: "iqr" # iqr, zscore, isolation_forest
performance:
chunk_size: 10000
parallel_processing: true
max_workers: 4Usage Examples
Basic Data Analysis Workflow
# Load sales data
await mcp_client.call_tool("load_dataset", {
"source": "./data/sales_2023.csv",
"format": "csv",
"dataset_id": "sales_data",
"cache_data": True
})
# Analyze the dataset
analysis = await mcp_client.call_tool("analyze_dataset", {
"dataset_id": "sales_data",
"analysis_type": "exploratory",
"include_distributions": True,
"include_correlations": True
})
# Create visualization
viz = await mcp_client.call_tool("create_visualization", {
"dataset_id": "sales_data",
"plot_type": "scatter",
"x_column": "price",
"y_column": "quantity_sold",
"color_column": "product_category",
"title": "Price vs Quantity by Category"
})Statistical Testing
# Load customer data
await mcp_client.call_tool("load_dataset", {
"source": "./data/ab_test_results.csv",
"format": "csv",
"dataset_id": "ab_test"
})
# Perform t-test
test_result = await mcp_client.call_tool("statistical_test", {
"dataset_id": "ab_test",
"test_type": "t_test",
"columns": ["conversion_rate"],
"groupby_column": "test_group",
"hypothesis": "Group A != Group B",
"alpha": 0.05
})Time Series Analysis
# Load stock price data
await mcp_client.call_tool("load_dataset", {
"source": "https://api.example.com/stock_data.json",
"format": "json",
"dataset_id": "stock_prices"
})
# Time series analysis with forecasting
ts_analysis = await mcp_client.call_tool("time_series_analysis", {
"dataset_id": "stock_prices",
"time_column": "date",
"value_columns": ["close_price"],
"operations": ["trend", "seasonal", "forecast"],
"forecast_periods": 30,
"confidence_intervals": True
})Data Transformation Pipeline
# Clean and transform data
transformed = await mcp_client.call_tool("transform_data", {
"dataset_id": "raw_customer_data",
"operations": [
{"type": "drop_na", "columns": ["email", "age"]},
{"type": "fill_na", "columns": ["income"], "method": "median"},
{"type": "encode_categorical", "columns": ["region"], "method": "one_hot"},
{"type": "scale", "columns": ["income", "age"], "method": "standard"}
],
"create_new_dataset": True,
"new_dataset_id": "clean_customer_data"
})Advanced Features
- Data Pipeline Automation: Chain multiple analysis operations
- Interactive Dashboards: Generate web-based dashboards
- Statistical Modeling: Advanced regression and classification
- Anomaly Detection: Identify outliers and unusual patterns
- Data Quality Assessment: Automated data quality scoring
- Export Capabilities: Export results to various formats
Visualization Capabilities
- Statistical Plots: Histograms, box plots, Q-Q plots
- Correlation Matrices: Heatmaps and network graphs
- Time Series Plots: Trends, seasonality, forecasts
- Interactive Charts: Plotly-based interactive visualizations
- Custom Styling: Configurable themes and styling options
Security Features
- Data source validation and sandboxing
- Query complexity limits
- Memory usage monitoring
- Safe evaluation of transformation operations
- Audit logging for all data operations
Testing Requirements
- Unit tests for all statistical functions
- Integration tests with sample datasets
- Performance tests with large datasets
- Visualization output validation
- Statistical accuracy verification
Acceptance Criteria
- Python MCP server with 7+ data analysis tools
- Support for multiple data formats (CSV, JSON, Parquet, SQL, Excel)
- Comprehensive statistical analysis capabilities
- Data visualization with multiple plot types
- Time series analysis and forecasting
- Data transformation and cleaning operations
- SQL-like querying capabilities
- Comprehensive test suite with sample data (>90% coverage)
- Performance optimization for large datasets
- Complete documentation with analysis examples
Priority
High - Demonstrates data science workflows essential for AI and analytics applications
Use Cases
- Business intelligence and reporting
- Data science experimentation
- Statistical analysis and hypothesis testing
- Data quality assessment
- Exploratory data analysis (EDA)
- Time series forecasting
- A/B test analysis
- Research data analysis
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestmcp-serversMCP Server SamplesMCP Server SamplesoicOpen Innovation Community ContributionsOpen Innovation Community ContributionspythonPython / backend development (FastAPI)Python / backend development (FastAPI)