Skip to content

Proposal: Add pd.check(df) utility function for quick dataset diagnostics #61691

Open
@CS-Ponkoj

Description

@CS-Ponkoj

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

While working with pandas DataFrames during exploratory data analysis (EDA), analysts frequently perform the same manual steps to understand their dataset:

  • Count null and non-null values
  • Check unique value counts
  • Estimate missing percentages

These operations are often repeated multiple times, especially after data cleaning, filtering, or merging. Currently, users rely on combinations like:

df.isnull().sum()
df.nunique()
df.notnull().sum()

There is no single built-in pandas utility that offers this all-in-one diagnostic view.

Feature Description

Add a utility function pd.check(df) that returns a concise column-wise summary of a DataFrame’s structure, including:

  • Unique values per column
  • Non-null value counts
  • Missing value counts
  • Missing percentages (rounded to 2 decimals by default)

This function is designed to streamline early-stage exploratory data analysis by combining multiple common pandas operations into one, reusable utility.

Suggested API:
def check(df: pd.DataFrame, round_digits: int = 2) -> pd.DataFrame: ...

  • Optional round_digits parameter to control percentage precision
  • Returns a pandas DataFrame
  • No side effects (no printing)
  • Aligns well with other utility functions like pd.describe()

Alternative Solutions

There are existing pandas functions like:

  • df.info() – shows non-null counts and data types
  • df.describe() – provides statistical summaries (only for numeric data)
  • df.isnull().sum() – shows missing values per column
  • df.nunique() – shows unique counts

However, none of these provide a combined summary in a single DataFrame format. Users must manually combine several operations, which can be repetitive and error-prone.

Third-party options:

pandas-profiling and sweetviz offer full data profiling, but they are heavy-weight, generate HTML reports, and not ideal for lightweight inspection or script-based pipelines.

My package pandas_eda_check implements this specific summary cleanly and could be a minimal addition to pandas.

Additional Context

Why in pandas?

  • Aligns with pandas’ mission of being a one-stop shop for tabular data operations
  • Adds convenience and consistency to common EDA workflows
  • Minimal overhead and easy to implement
  • Could serve as a precursor to a more comprehensive eda submodule in the future

Reference Implementation

I've implemented this in an open-source utility here:
🔗 https://github.com/CS-Ponkoj/pandas_eda_check

PyPI: https://pypi.org/project/pandas-eda-check/

Open to Feedback

I’d love to hear from the maintainers and community about:

  • Whether this function aligns with pandas’ philosophy
  • Suggestions to improve API or return format
  • If accepted, I’m happy to submit a PR with tests and docs

Thanks for your time and consideration.

Ponkoj Shill
PhD Candidate, ML Engineer
Email: [email protected]

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions