Skip to content

ENH/VIS: Pass DataFrame column to size argument in df.scatter #8244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Sep 11, 2014 · 8 comments
Closed

ENH/VIS: Pass DataFrame column to size argument in df.scatter #8244

TomAugspurger opened this issue Sep 11, 2014 · 8 comments

Comments

@TomAugspurger
Copy link
Contributor

You can already kind of do this by passing in the numpy array

In [83]: df = pd.DataFrame(np.random.randn(100, 2))

In [84]: df['z'] = np.random.uniform(size=(100))

In [85]: df.plot(kind='scatter', x=0, y=1, s=df.z.values * 1000)
Out[85]: <matplotlib.axes._subplots.AxesSubplot at 0x11a81df60>

p

But when I merge #7780 (coloring by column) it would be natural (and awesome) to do df.plot(kind='scatter', x='x', y='y', c='color', s='size')

Shouldn't be too hard if we're willing.

@TomAugspurger TomAugspurger added this to the Someday milestone Sep 11, 2014
@shoyer
Copy link
Member

shoyer commented Sep 11, 2014

The reason I didn't do this in #7780 is because, unlike coloring by column, you need to have "size" in the right units to make the result look reasonable. So we would need to invent another argument (e.g., s_scale) to adjust printer points to the right size. We could pick some sort of sane default based on the statistics of the "size" column. Possibly would be worth looking at how ggplot handles this.

@jorisvandenbossche
Copy link
Member

@TomAugspurger Something else, which matplotlib style did you use in the plot above? I think the plots in out docs should look like that! Is it a style that you can express in rcParams, then we could update https://github.com/pydata/pandas/blob/master/pandas/tools/plotting.py#L34 (eg the grid lines -> white lines)

@shoyer
Copy link
Member

shoyer commented Sep 11, 2014

@jorisvandenbossche This is the style you get from importing seaborn. Just import seaborn should do it.

By the way, if you haven't tried Seaborn, you should definitely check it out. It's has a very well thought out design (both the API and the graphics style).

@jorisvandenbossche
Copy link
Member

Ah, OK. Yes, I know seaborn, but have not yet really used it. In any case, we could maybe copy some the rcParams to update the style of the plots in our docs.

@onesandzeroes
Copy link
Contributor

The seaborn style looks like it's just ggplot's default style. It's one of the built in styles in matplotlib 1.4, so if you wanted to use that for the docs, then you could just do:

import matplotlib.style
matplotlib.style.use('ggplot')

@onesandzeroes
Copy link
Contributor

Also, at first glance the way ggplot handles this doesn't seem super complicated, it seems like it's all done here. So basically, it sets up a range between 1 and 6 (units are arbitrary, we'll just have to pick a range that looks good I guess) and normalizes the values to that range.

The main difference is that I think ggplot is scaling based on the radius, whereas matplotlib markersize sets the area, so we might need to transform? There's a bit of discussion on SO here, the scaling in the second example looks quite good.

@onesandzeroes
Copy link
Contributor

To me, the sizes seem pretty good if we just pick sensible defaults for the min and max point size, and then normalize the values to that range, e.g.:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

def convert_to_points(vals, size_range=(50, 1000)):
    min_size, max_size = size_range
    val_range = vals.max() - vals.min()
    normalized_vals = (vals - vals.min()) / val_range
    point_sizes = (min_size + (normalized_vals * (max_size - min_size)))
    return point_sizes

df2 = pd.DataFrame({
    'x': np.linspace(0, 50, 6),
    'y': np.linspace(0, 20, 6)
})
df2.plot(kind='scatter', x='x', y='y', s=convert_to_points(df2.x.values))

point_scaling

I can't claim to have the best eye for visual design though, so if anyone can suggest scaling methods that work better than a straight linear transform I'm happy to hear them. If the aim is to provide an argument that lets people adjust the min and max size up and down, it might also be nice to present the user with more sensible numbers like ggplot does with its default (1, 6) range

@jreback jreback changed the title ENH/VIS: Pass DataFrame column to size argument in df.scatter ENH/VIS: Pass DataFrame column to size argument in df.scatter Nov 24, 2014
@jreback jreback modified the milestones: 0.16.0, Someday Dec 6, 2014
@jreback jreback modified the milestones: 0.16.1, 0.16.0 Mar 5, 2015
@jreback jreback modified the milestones: 0.16.1, 0.17.0 Apr 28, 2015
@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015
@TomAugspurger
Copy link
Contributor Author

Dupe of #16827

@TomAugspurger TomAugspurger modified the milestones: Contributions Welcome, No action Jul 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants