Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
While working with pandas DataFrames during exploratory data analysis (EDA), analysts frequently perform the same manual steps to understand their dataset:
- Count null and non-null values
- Check unique value counts
- Estimate missing percentages
These operations are often repeated multiple times, especially after data cleaning, filtering, or merging. Currently, users rely on combinations like:
df.isnull().sum()
df.nunique()
df.notnull().sum()
There is no single built-in pandas utility that offers this all-in-one diagnostic view.
Feature Description
Add a utility function pd.check(df) that returns a concise column-wise summary of a DataFrame’s structure, including:
- Unique values per column
- Non-null value counts
- Missing value counts
- Missing percentages (rounded to 2 decimals by default)
This function is designed to streamline early-stage exploratory data analysis by combining multiple common pandas operations into one, reusable utility.
Suggested API:
def check(df: pd.DataFrame, round_digits: int = 2) -> pd.DataFrame: ...
- Optional round_digits parameter to control percentage precision
- Returns a pandas DataFrame
- No side effects (no printing)
- Aligns well with other utility functions like pd.describe()
Alternative Solutions
There are existing pandas functions like:
df.info()
– shows non-null counts and data typesdf.describe()
– provides statistical summaries (only for numeric data)df.isnull().sum()
– shows missing values per columndf.nunique()
– shows unique counts
However, none of these provide a combined summary in a single DataFrame format. Users must manually combine several operations, which can be repetitive and error-prone.
Third-party options:
pandas-profiling and sweetviz offer full data profiling, but they are heavy-weight, generate HTML reports, and not ideal for lightweight inspection or script-based pipelines.
My package pandas_eda_check implements this specific summary cleanly and could be a minimal addition to pandas.
Additional Context
Why in pandas?
- Aligns with pandas’ mission of being a one-stop shop for tabular data operations
- Adds convenience and consistency to common EDA workflows
- Minimal overhead and easy to implement
- Could serve as a precursor to a more comprehensive eda submodule in the future
Reference Implementation
I've implemented this in an open-source utility here:
🔗 https://github.com/CS-Ponkoj/pandas_eda_check
PyPI: https://pypi.org/project/pandas-eda-check/
Open to Feedback
I’d love to hear from the maintainers and community about:
- Whether this function aligns with pandas’ philosophy
- Suggestions to improve API or return format
- If accepted, I’m happy to submit a PR with tests and docs
Thanks for your time and consideration.
Ponkoj Shill
PhD Candidate, ML Engineer
Email: [email protected]