Skip to content

Conversation

@tylerriccio33
Copy link
Contributor

Summary

This method checks the percentage of null values against a column in a dataset. I added some tests and documentation. I'm seeking review, since navigating tolerance is can be confusing.

I can add examples for doctests, if you guys use those?

The big thing missing is the icon, which prohibits usage with get_tabular_report. See the xfail test I added which will pass once the icon is ok. Is there some software you use to generate those. Thanks.

Related GitHub Issues and PRs

Checklist

  • [x ] I understand and agree to the Code of Conduct.
  • [x ] I have followed the Style Guide for Python Code as best as possible for the submitted code.
  • [ x] I have added pytest unit tests for any new functionality.

@tylerriccio33
Copy link
Contributor Author

CI passed minus pre-commit, which failed on formatting but you can just override. Maybe my ruff is out of sync.

@tylerriccio33
Copy link
Contributor Author

Any update here?

@rich-iannone
Copy link
Member

Sorry again for the delay in responding. After reviewing the PR, I want to discuss an alternative approach that I think might work better for this. Really it's that Pointblank already has the tools to accomplish what col_vals_pct_null() does (existing validation methods combined with our composable features).

Option 1: Using col_vals_not_null() with thresholds

Since you want to check that a certain percentage of values are null/None, we can flip this around and check that a certain percentage are not null, using the threshold system:

import pointblank as pb
import polars as pl

# Sample data with 20% null values
data = pl.DataFrame({
    "id": range(1, 11),
    "value": [1, 2, None, 4, 5, None, 7, 8, 9, 10]
})

# Check that AT MOST 20% are null (i.e., at least 80% are not null)
# We set the threshold to allow up to 20% failure
validation = (
    pb.Validate(data=data)
    .col_vals_not_null(
        columns="value",
        thresholds=0.20,  # Allow up to 20% to fail (be null)
        brief="Value column should have at most 20% null values"
    )
    .interrogate()
)

validation
image

Option 2: Using pre= with existing comparison methods

For more fine-grained control (like checking if the null/None percentage is within a specific range, below a fixed value, etc.), you can use a preprocessing function with existing validation methods:

def get_null_pct(tbl):
    """Calculate percentage of null values in 'value' column."""
    import narwhals as nw
    df = nw.from_native(tbl)
    total = df.select(nw.len()).item()
    n_null = df.select(nw.col("value").is_null().sum()).item()
    
    # Return a single-row, single-column table with the percentage
    return nw.from_native(pl.DataFrame({"null_pct": [n_null / total]}))

validation = (
    pb.Validate(data=data)
    .col_vals_lt(
        columns="null_pct",
        value=0.10,
        pre=get_null_pct,
        thresholds=pb.Thresholds(critical=1),
        brief="Null percentage should be less than 10%"
    )
    .col_vals_between(
        columns="null_pct",
        left=0.15,
        right=0.25,
        pre=get_null_pct,
        thresholds=pb.Thresholds(critical=1),
        brief="Null percentage should be between 15% and 25%"
    )
    .interrogate()
)

validation
image

The composable nature of pre= combined with our existing validation methods gives us a lot of flexibility:

  • it prevents method proliferation: we could create specialized methods for every possible column statistic (.pct_null(), .pct_duplicate(), .pct_outliers(), etc.), but this would lead to an explosion of similar methods
  • more expressive: the pre= approach makes it crystal clear what's being calculated and validated
  • consistent with the R version of Pointblank: we're working toward feature parity with the R version of Pointblank first (in terms of validation methods), then we can explore additional validation types
  • flexible: Need to check null percentage against dynamic thresholds? Or combine it with other metrics? The pre= approach handles it all

Thanks for putting the work into the PR (I really value your contributions to the project) but I think we should leave this particular PR out. Please let me know your thoughts on this whole approach, I'm hoping you'll find it reasonable!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants