Skip to content

add train_test_split function to DataFrame #6687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

cloga
Copy link

@cloga cloga commented Mar 22, 2014

Use random sample to split a DataFrame to train set and test set which ready for cross validation.

Use random sample to split a DataFrame to train set and test set which ready for cross validation.
@jreback
Copy link
Contributor

jreback commented Mar 22, 2014

this is ok but should be in pandas/utils/testing.py

also need a test or 2 for this to validate it's working

@TomAugspurger
Copy link
Contributor

Does it belong in testing? I'd say it's fine for NDFrames. The train / test split is a machine learning thing used all the time for choosing models.

What's the status on https://github.com/paulgb/sklearn-pandas? It doesn't seem to be very active.

@TomAugspurger
Copy link
Contributor

Also, we should talk about how broadly we want to support the machine learning / stats preprocessing stuff. Why add this one, but not every Cross Validation method in http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cross_validation? The preprocessing methods here are also arguably useful, again depending on if this is out of scope or not.

test_size = int(len(self) * test_rate)

if random_state:
random.seed(random_state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pandas.utils.testing there's a class RNGContext that's a context manager for the random state. I'd use that to return the user to their original state once you're done with the splitting, rather than modifying the global state.

@TomAugspurger
Copy link
Contributor

Maybe we could put a bunch of these methods under pandas.stats, rather than adding even more methods to Series and DataFrame?

@cloga
Copy link
Author

cloga commented Mar 22, 2014

Actually I add this function to DataFrame is not for sklearn, sklearn already provide enough functions for machine learning, and in some case I will use sklearn and pandas together, for engineer this maybe enough and also high-efficiency.
But when I use panda and statsmodels together, I need this function to construct train set and test set, I think maybe some user like me will have the same concern who is learning machine learning algorithm from statistics point of view not the engineering point of view. Sklearn didn't provide more statistics metrics like statsmodels. And for me, statsmodels are more friendly for pandas than sklearn, actually when I use pandas and sklearn I need to transfer pandas to NDarrays.
To be honest, I don't know whether this function is in the scope of pandas. Or maybe it is in the scope of statsmodels.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2014

@TomAugspurger you raise a lot of good questions
so how much should pandas support pre scikit-learn stuff?

I agree that this should be maybe a separate module at the very least

pandas.stats.learn maybe

the question is how much pre processing should pandas support

you want to send an email to scikit-learn dev and pydata to get some feedback?

@jorisvandenbossche
Copy link
Member

I personally think this is a little bit out of scope for main pandas at this moment.

  • If we add it, I would at least make it more general, eg just a split function (that can be random or not random), or a sample function. There are other reasons to split a dataframe in two apart from train/test dataset.
  • But indeed, maybe we should think more in general if this kind of functions belong in pandas, or in sklearn/statsmodels/seperate project... or in a module in pandas. I also think we should be precautious with overloading dataframe/series too much with adding more methods.

See also statsmodels/statsmodels#1498 to ask to add this to statsmodels

@jseabold
Copy link
Contributor

I tend to agree that this is out of scope just because it's kind of a "so what?" proposition if all you have his pandas installed, though I don't feel all that strongly about it. Statsmodels could easily add pandas-aware CV tools from sklearn. I have many of them littered in my personal code just haven't gotten around to adding them to the project.

@hayd
Copy link
Contributor

hayd commented Mar 27, 2014

Don't we have an issue for spliting with a groupby? +1 on being unsure about scope.

You can get sample with a groupby which may be faster (e.g. 10%):

g = df.groupby(np.random.randint(0, 10, (len(df),)) == 0), as_index=False)
train = g.get_group(False)
test = g.get_group(True)

tbh I would have thought that scikit learn had some methods to do this (e.g. in an ensemble)...

...I kindof think the solution could be for scikit etc to support pandas objects, perhaps we could write a simple decorator to (if passed a frame/series) extract values, execute, and then (if appropriate) reglue index/columns to result. (probably a non-starter for scikit if there's a perf hit?)

@jtratner
Copy link
Contributor

Isn't that the equivalent of just boolean indexing though?
On Mar 26, 2014 5:02 PM, "Andy Hayden" [email protected] wrote:

Don't we have an issue for spliting with a groupby? +1 on being unsure
about scope.

You can get sample with a groupby which may be faster (e.g. 10%):

g = df.groupby(np.random.randint(0, 10, (len(df),)) == 0), as_index=False)
train = g.get_group(False)
test = g.get_group(True)

tbh I would have thought that scikit learn had some methods to do this
(e.g. in an ensemble)...

...I kindof think the solution could be for scikit etc to support pandas
objects, perhaps we could write a simple decorator to (if passed a
frame/series) extract values, execute, and then (if appropriate) reglue
index/columns to result. (probably a non-starter for scikit if there's a
perf hit?)

Reply to this email directly or view it on GitHubhttps://github.com//pull/6687#issuecomment-38755537
.

random.seed(random_state)
test_index = random.sample(self.index, test_size)
df_train = self.ix[test_index]
df_test = self.ix[[i for i in self.index if i not in test_index]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do boolean indexing (as @jtratner mentions) self.ix[~test_index]

@hayd
Copy link
Contributor

hayd commented Mar 27, 2014

@jtratner ahem, it totally is!

@jreback
Copy link
Contributor

jreback commented Mar 27, 2014

should use iloc for the indexing

@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

closing..out of scope of pandas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants