Add percentage null assertion #290

tylerriccio33 · 2025-09-28T22:37:09Z

Summary

This method checks the percentage of null values against a column in a dataset. I added some tests and documentation. I'm seeking review, since navigating tolerance is can be confusing.

I can add examples for doctests, if you guys use those?

The big thing missing is the icon, which prohibits usage with get_tabular_report. See the xfail test I added which will pass once the icon is ok. Is there some software you use to generate those. Thanks.

Related GitHub Issues and PRs

Checklist

[x ] I understand and agree to the Code of Conduct.
[x ] I have followed the Style Guide for Python Code as best as possible for the submitted code.
[ x] I have added pytest unit tests for any new functionality.

tylerriccio33 · 2025-09-28T23:03:32Z

CI passed minus pre-commit, which failed on formatting but you can just override. Maybe my ruff is out of sync.

tylerriccio33 · 2025-10-14T22:51:55Z

Any update here?

rich-iannone · 2025-10-17T01:23:06Z

Sorry again for the delay in responding. After reviewing the PR, I want to discuss an alternative approach that I think might work better for this. Really it's that Pointblank already has the tools to accomplish what col_vals_pct_null() does (existing validation methods combined with our composable features).

Option 1: Using col_vals_not_null() with thresholds

Since you want to check that a certain percentage of values are null/None, we can flip this around and check that a certain percentage are not null, using the threshold system:

import pointblank as pb
import polars as pl

# Sample data with 20% null values
data = pl.DataFrame({
    "id": range(1, 11),
    "value": [1, 2, None, 4, 5, None, 7, 8, 9, 10]
})

# Check that AT MOST 20% are null (i.e., at least 80% are not null)
# We set the threshold to allow up to 20% failure
validation = (
    pb.Validate(data=data)
    .col_vals_not_null(
        columns="value",
        thresholds=0.20,  # Allow up to 20% to fail (be null)
        brief="Value column should have at most 20% null values"
    )
    .interrogate()
)

validation

Option 2: Using pre= with existing comparison methods

For more fine-grained control (like checking if the null/None percentage is within a specific range, below a fixed value, etc.), you can use a preprocessing function with existing validation methods:

def get_null_pct(tbl):
    """Calculate percentage of null values in 'value' column."""
    import narwhals as nw
    df = nw.from_native(tbl)
    total = df.select(nw.len()).item()
    n_null = df.select(nw.col("value").is_null().sum()).item()
    
    # Return a single-row, single-column table with the percentage
    return nw.from_native(pl.DataFrame({"null_pct": [n_null / total]}))

validation = (
    pb.Validate(data=data)
    .col_vals_lt(
        columns="null_pct",
        value=0.10,
        pre=get_null_pct,
        thresholds=pb.Thresholds(critical=1),
        brief="Null percentage should be less than 10%"
    )
    .col_vals_between(
        columns="null_pct",
        left=0.15,
        right=0.25,
        pre=get_null_pct,
        thresholds=pb.Thresholds(critical=1),
        brief="Null percentage should be between 15% and 25%"
    )
    .interrogate()
)

validation

The composable nature of pre= combined with our existing validation methods gives us a lot of flexibility:

it prevents method proliferation: we could create specialized methods for every possible column statistic (.pct_null(), .pct_duplicate(), .pct_outliers(), etc.), but this would lead to an explosion of similar methods
more expressive: the pre= approach makes it crystal clear what's being calculated and validated
consistent with the R version of Pointblank: we're working toward feature parity with the R version of Pointblank first (in terms of validation methods), then we can explore additional validation types
flexible: Need to check null percentage against dynamic thresholds? Or combine it with other metrics? The pre= approach handles it all

Thanks for putting the work into the PR (I really value your contributions to the project) but I think we should leave this particular PR out. Please let me know your thoughts on this whole approach, I'm hoping you'll find it reasonable!

tylerriccio33 and others added 4 commits September 28, 2025 18:33

add half null series to conftest

7b79454

add pct null method

d2092a8

add pct null tests

ac98682

Merge branch 'main' into pct-null

d92ed1f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add percentage null assertion #290

Add percentage null assertion #290

Uh oh!

tylerriccio33 commented Sep 28, 2025

Uh oh!

tylerriccio33 commented Sep 28, 2025

Uh oh!

tylerriccio33 commented Oct 14, 2025

Uh oh!

rich-iannone commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add percentage null assertion #290

Are you sure you want to change the base?

Add percentage null assertion #290

Uh oh!

Conversation

tylerriccio33 commented Sep 28, 2025

Summary

Related GitHub Issues and PRs

Checklist

Uh oh!

tylerriccio33 commented Sep 28, 2025

Uh oh!

tylerriccio33 commented Oct 14, 2025

Uh oh!

rich-iannone commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants