modified TableDescription formatting #616

AnirudhVIyer · 2023-06-13T20:53:55Z

Describe your changes

Modified the TableDescription Class such that now we

check the data for each column ( numeric or categorical)
check a column contains numerical data in str form (this creates Nan issues) and issue a warning
Calculate Mean, Min, Max, Count, STD, percentiles for numeric data
Calculate Unique, Top, Frequency for categorical data
Apply styling to indicate columns with data mismatch
Added formatting so we don't encounter scientific values eg. 1.621e+1

Now the profile command acts similar to the pandas describe function

Jupysql profile

Pandas profile

## Issue number

Closes #459

Checklist before requesting a review

Performed a self-review of my code
Formatted my code with pkgmt format
Added tests (when necessary).
Added docstring documentation and update the changelog (when needed)

📚 Documentation preview 📚: https://jupysql--616.org.readthedocs.build/en/616/

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1204482537184202

AnirudhVIyer · 2023-06-13T22:56:59Z

@edublancas
I checked the way pandas handles profiling using the describe method. And have tried to replicate our TableDescription class that way.

The reason that we see a lot of NaN values is because of the operations that we are running on categorical vs numerical data. In Pandas they handle this in the following way
Type - Operations

Categorical - top unique count freq rest are NaN
Numerical - min max 25% 50% 75% mean unique count``stdev rest are NaN

I have replicated it here, also there is an issue where numerical (int/float) data is stored as a string. Here this data gets treated like categorical data.
In pandas there is no warning for this
We give a message and highlight such a column with a different colour

edublancas

looks great overall!

can we change the red background for yellow? Also, let's add some message below the table explaining why we have the yellow background.

something like "columns A, B, C are categorical, hence median, mean, cannot be computed"

AnirudhVIyer · 2023-06-14T13:50:50Z

looks great overall!

can we change the red background for yellow? Also, let's add some message below the table explaining why we have the yellow background.

something like "columns A, B, C are categorical, hence median, mean, cannot be computed"

Sure! making the changes

AnirudhVIyer · 2023-06-14T18:14:30Z

@edublancas made the changes

edublancas · 2023-06-15T18:39:29Z

please merge from master, I think the error you're getting got fixed already

@tonykploomber: is the error caused by the mariadb image problem you encountered? because the logs suggest the problem is with oracle

tonykploomber · 2023-06-15T18:50:31Z

Seems the latest build with this oracledb error too, however it's not impacted to the testing flow.

I believe rebasing with latest master will solve the issue, however, something we need to check about oracledb

edublancas · 2023-06-15T18:51:44Z

@tonykploomber: can you open a dummy PR to master? let's see if things are working. otherwise we'll need to fix oracle as well

AnirudhVIyer · 2023-06-15T18:53:10Z

I will rebase and try again.

src/sql/inspect.py

AnirudhVIyer · 2023-06-21T16:11:12Z

@edublancas ready for review

edublancas

I tried testing this and I was expecting the island column to be yellow and to see the warning message since island is a string:

(using the penguins.csv dataset we download in the quick start)

am I doing something wrong?

please share the data you used for your tests so I can check it out

AnirudhVIyer · 2023-06-21T20:06:04Z

I tried testing this and I was expecting the island column to be yellow and to see the warning message since island is a string:
(using the penguins.csv dataset we download in the quick start)
am I doing something wrong?

please share the data you used for your tests so I can check it out

Hey, so the island column in the penguins dataset is made up of strings and is a categorical column(is not bade up of integers), we show the column to be yellow and add a warning message if the column is made of strings but contains numerical values (like ints/floats). Because of this we cannot properly query the statistics through SQL.

In case of the churn.csv file the totalCharges column represents integer values but they are stored as strings.

you can test it on either churn.csv file or this mock dataset:

%%sql sqlite:// CREATE TABLE people (name varchar(50),age varchar(50),number int, country varchar(50),gender_1 varchar(50), gender_2 varchar(50)); INSERT INTO people VALUES ('joe', '48', 82, 'usa', '0', 'male'); INSERT INTO people VALUES ('paula', '50', 93, 'uk', '1', 'female');

%sqlcmd profile -t people

edublancas

ok, I think I did something wrong because now when I profile penguins.csv I see it in yellow:

suggestion: can we make the yellow background softer? (the current one is too strong) the yellow one you show in the warning looks better, let's highlight the columns with the same color:

also, note that the warning shows.

`gender` <- backticks

can we change the backticks for something else? perhaps we can use

<code>column_name</code>

AnirudhVIyer · 2023-06-21T21:03:30Z

ok, I think I did something wrong because now when I profile penguins.csv I see it in yellow:

suggestion: can we make the yellow background softer? (the current one is too strong) the yellow one you show in the warning looks better, let's highlight the columns with the same color:

also, note that the warning shows.
`gender` <- backticks
can we change the backticks for something else? perhaps we can use
<code>column_name</code>

Making the changes to the styling, don't know why penguins.csv has the yellow columns now, runs fine locally. I will check it out.

edublancas · 2023-06-21T21:06:20Z

Making the changes to the styling, don't know why penguins.csv has the yellow columns now, runs fine locally. I will check it out.

I think it's good that the island column in the penguins dataset is displayed on yellow, since we can't compute the statistics on strings. so let's keep it (if it's a bug, let's make it a feature 😂)

having said that, we can also display a message like:

the island column is a string type. cannot calculate a,b,c,d...

AnirudhVIyer · 2023-06-22T15:47:54Z

Making the changes to the styling, don't know why penguins.csv has the yellow columns now, runs fine locally. I will check it out.

I think it's good that the island column in the penguins dataset is displayed on yellow, since we can't compute the statistics on strings. so let's keep it (if it's a bug, let's make it a feature 😂)

having said that, we can also display a message like:
the island column is a string type. cannot calculate a,b,c,d...

Hey @edublancas , so I fixed the bug (it was some global css styling issue). We can add styling for categorical columns but i think it would make the table a bit shabby if there are many such columns.

For instance, in the churn dataset, most of the columns are categorical columns, where we calculate top, freq, unique and count. So I am not sure if we should add styling to categorical columns

AnirudhVIyer added 2 commits June 13, 2023 16:43

modified TableDescription formatting

7da38b1

added tests and fixed table vals to 4 decimal points

49ecf81

AnirudhVIyer marked this pull request as ready for review June 13, 2023 22:48

AnirudhVIyer requested a review from edublancas as a code owner June 13, 2023 22:48

edublancas requested changes Jun 14, 2023

View reviewed changes

edublancas requested a review from mehtamohit013 June 14, 2023 04:23

AnirudhVIyer added 5 commits June 14, 2023 10:44

warning-colour changes to yellow, output message generated

879413f

format and lint message generation

559641b

fixed linting errors for message

941375d

message grammar check

c843cb6

profile message logic updated

e7a65bb

AnirudhVIyer requested a review from edublancas June 14, 2023 18:14

edublancas requested a review from neelasha23 June 15, 2023 18:38

Merge branch 'ploomber:master' into 459-sqlcmd-profile-improvements

f920df9