add train_test_split function to DataFrame #6687

cloga · 2014-03-22T14:13:42Z

Use random sample to split a DataFrame to train set and test set which ready for cross validation.

jreback · 2014-03-22T14:43:12Z

this is ok but should be in pandas/utils/testing.py

also need a test or 2 for this to validate it's working

TomAugspurger · 2014-03-22T14:47:01Z

Does it belong in testing? I'd say it's fine for NDFrames. The train / test split is a machine learning thing used all the time for choosing models.

What's the status on https://github.com/paulgb/sklearn-pandas? It doesn't seem to be very active.

TomAugspurger · 2014-03-22T14:53:26Z

Also, we should talk about how broadly we want to support the machine learning / stats preprocessing stuff. Why add this one, but not every Cross Validation method in http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cross_validation? The preprocessing methods here are also arguably useful, again depending on if this is out of scope or not.

TomAugspurger · 2014-03-22T14:55:47Z

pandas/core/frame.py

+        test_size = int(len(self) * test_rate)
+
+        if random_state:
+            random.seed(random_state)


In pandas.utils.testing there's a class RNGContext that's a context manager for the random state. I'd use that to return the user to their original state once you're done with the splitting, rather than modifying the global state.

TomAugspurger · 2014-03-22T14:58:46Z

Maybe we could put a bunch of these methods under pandas.stats, rather than adding even more methods to Series and DataFrame?

cloga · 2014-03-22T15:27:09Z

Actually I add this function to DataFrame is not for sklearn, sklearn already provide enough functions for machine learning, and in some case I will use sklearn and pandas together, for engineer this maybe enough and also high-efficiency.
But when I use panda and statsmodels together, I need this function to construct train set and test set, I think maybe some user like me will have the same concern who is learning machine learning algorithm from statistics point of view not the engineering point of view. Sklearn didn't provide more statistics metrics like statsmodels. And for me, statsmodels are more friendly for pandas than sklearn, actually when I use pandas and sklearn I need to transfer pandas to NDarrays.
To be honest, I don't know whether this function is in the scope of pandas. Or maybe it is in the scope of statsmodels.

jreback · 2014-03-22T15:28:37Z

@TomAugspurger you raise a lot of good questions
so how much should pandas support pre scikit-learn stuff?

I agree that this should be maybe a separate module at the very least

pandas.stats.learn maybe

the question is how much pre processing should pandas support

you want to send an email to scikit-learn dev and pydata to get some feedback?

jorisvandenbossche · 2014-03-22T19:03:09Z

I personally think this is a little bit out of scope for main pandas at this moment.

If we add it, I would at least make it more general, eg just a split function (that can be random or not random), or a sample function. There are other reasons to split a dataframe in two apart from train/test dataset.
But indeed, maybe we should think more in general if this kind of functions belong in pandas, or in sklearn/statsmodels/seperate project... or in a module in pandas. I also think we should be precautious with overloading dataframe/series too much with adding more methods.

See also statsmodels/statsmodels#1498 to ask to add this to statsmodels

jseabold · 2014-03-24T18:12:54Z

I tend to agree that this is out of scope just because it's kind of a "so what?" proposition if all you have his pandas installed, though I don't feel all that strongly about it. Statsmodels could easily add pandas-aware CV tools from sklearn. I have many of them littered in my personal code just haven't gotten around to adding them to the project.

hayd · 2014-03-27T00:02:19Z

Don't we have an issue for spliting with a groupby? +1 on being unsure about scope.

You can get sample with a groupby which may be faster (e.g. 10%):

g = df.groupby(np.random.randint(0, 10, (len(df),)) == 0), as_index=False)
train = g.get_group(False)
test = g.get_group(True)

tbh I would have thought that scikit learn had some methods to do this (e.g. in an ensemble)...

...I kindof think the solution could be for scikit etc to support pandas objects, perhaps we could write a simple decorator to (if passed a frame/series) extract values, execute, and then (if appropriate) reglue index/columns to result. (probably a non-starter for scikit if there's a perf hit?)

jtratner · 2014-03-27T04:07:39Z

Isn't that the equivalent of just boolean indexing though?
On Mar 26, 2014 5:02 PM, "Andy Hayden" [email protected] wrote:

Don't we have an issue for spliting with a groupby? +1 on being unsure
about scope.

You can get sample with a groupby which may be faster (e.g. 10%):

g = df.groupby(np.random.randint(0, 10, (len(df),)) == 0), as_index=False)
train = g.get_group(False)
test = g.get_group(True)

tbh I would have thought that scikit learn had some methods to do this
(e.g. in an ensemble)...

...I kindof think the solution could be for scikit etc to support pandas
objects, perhaps we could write a simple decorator to (if passed a
frame/series) extract values, execute, and then (if appropriate) reglue
index/columns to result. (probably a non-starter for scikit if there's a
perf hit?)

Reply to this email directly or view it on GitHubhttps://github.com//pull/6687#issuecomment-38755537
.

hayd · 2014-03-27T05:12:58Z

pandas/core/frame.py

+            random.seed(random_state)
+        test_index = random.sample(self.index, test_size)
+        df_train = self.ix[test_index]
+        df_test = self.ix[[i for i in self.index if i not in test_index]]


You can do boolean indexing (as @jtratner mentions) self.ix[~test_index]

hayd · 2014-03-27T05:13:25Z

@jtratner ahem, it totally is!

jreback · 2014-03-27T09:13:01Z

should use iloc for the indexing

jreback · 2014-04-22T15:34:50Z

closing..out of scope of pandas

add train_test_split function to DataFrame

29a4b12

Use random sample to split a DataFrame to train set and test set which ready for cross validation.

TomAugspurger reviewed Mar 22, 2014
View reviewed changes

hayd reviewed Mar 27, 2014
View reviewed changes

jreback closed this Apr 22, 2014

jorisvandenbossche mentioned this pull request Oct 6, 2015

added random_split in generic.py, for DataFrames etc. #11253

Closed

jorisvandenbossche mentioned this pull request Jul 8, 2019

Added get_train_test_stratified_by_time_and_col #27272

Closed

4 tasks

Uh oh!

add train_test_split function to DataFrame #6687

add train_test_split function to DataFrame #6687

Uh oh!

Conversation

cloga commented Mar 22, 2014

Uh oh!

jreback commented Mar 22, 2014

Uh oh!

TomAugspurger commented Mar 22, 2014

Uh oh!

TomAugspurger commented Mar 22, 2014

Uh oh!

TomAugspurger Mar 22, 2014

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Mar 22, 2014

Uh oh!

cloga commented Mar 22, 2014

Uh oh!

jreback commented Mar 22, 2014

Uh oh!

jorisvandenbossche commented Mar 22, 2014

Uh oh!

jseabold commented Mar 24, 2014

Uh oh!

hayd commented Mar 27, 2014

Uh oh!

jtratner commented Mar 27, 2014

Uh oh!

hayd Mar 27, 2014

Choose a reason for hiding this comment

Uh oh!

hayd commented Mar 27, 2014

Uh oh!

jreback commented Mar 27, 2014

Uh oh!

jreback commented Apr 22, 2014

Uh oh!

Uh oh!