Proposal: Add pd.check(df) utility function for quick dataset diagnostics

### Feature Type

- [x] Adding new functionality to pandas

- [ ] Changing existing functionality in pandas

- [ ] Removing existing functionality in pandas


### Problem Description

While working with pandas DataFrames during exploratory data analysis (EDA), analysts frequently perform the same manual steps to understand their dataset:

- Count null and non-null values
- Check unique value counts
- Estimate missing percentages

These operations are often repeated multiple times, especially after data cleaning, filtering, or merging. Currently, users rely on combinations like:
```
df.isnull().sum()
df.nunique()
df.notnull().sum()
```
There is no single built-in pandas utility that offers this all-in-one diagnostic view.

### Feature Description

Add a utility function pd.check(df) that returns a concise column-wise summary of a DataFrame’s structure, including:

- Unique values per column
- Non-null value counts
- Missing value counts
- Missing percentages (rounded to 2 decimals by default)

This function is designed to streamline early-stage exploratory data analysis by combining multiple common pandas operations into one, reusable utility.

Suggested API:
`def check(df: pd.DataFrame, round_digits: int = 2) -> pd.DataFrame:
    ...
`
- Optional round_digits parameter to control percentage precision
- Returns a pandas DataFrame
- No side effects (no printing)
- Aligns well with other utility functions like pd.describe()

### Alternative Solutions

There are existing pandas functions like:

- `df.info()` – shows non-null counts and data types
- `df.describe() `– provides statistical summaries (only for numeric data)
- `df.isnull().sum()` – shows missing values per column
- `df.nunique() `– shows unique counts

However, none of these provide a combined summary in a single DataFrame format. Users must manually combine several operations, which can be repetitive and error-prone.

Third-party options:

**pandas-profiling** and **sweetviz** offer full data profiling, but they are heavy-weight, generate HTML reports, and not ideal for lightweight inspection or script-based pipelines.

My package [pandas_eda_check](https://pypi.org/project/pandas-eda-check/) implements this specific summary cleanly and could be a minimal addition to pandas.

### Additional Context

Why in pandas?

- Aligns with pandas’ mission of being a one-stop shop for tabular data operations
- Adds convenience and consistency to common EDA workflows
- Minimal overhead and easy to implement
- Could serve as a precursor to a more comprehensive eda submodule in the future

Reference Implementation

I've implemented this in an open-source utility here:
🔗 https://github.com/CS-Ponkoj/pandas_eda_check

PyPI: https://pypi.org/project/pandas-eda-check/

Open to Feedback

I’d love to hear from the maintainers and community about:

- Whether this function aligns with pandas’ philosophy
- Suggestions to improve API or return format
- If accepted, I’m happy to submit a PR with tests and docs

Thanks for your time and consideration.

Ponkoj Shill
PhD Candidate, ML Engineer
Email: [csponkoj@gmail.com](mailto:csponkoj@gmail.com)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Proposal: Add pd.check(df) utility function for quick dataset diagnostics #61691

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Proposal: Add pd.check(df) utility function for quick dataset diagnostics #61691

Description

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions