-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
add train_test_split function to DataFrame #6687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use random sample to split a DataFrame to train set and test set which ready for cross validation.
this is ok but should be in pandas/utils/testing.py also need a test or 2 for this to validate it's working |
Does it belong in testing? I'd say it's fine for NDFrames. The train / test split is a machine learning thing used all the time for choosing models. What's the status on https://github.com/paulgb/sklearn-pandas? It doesn't seem to be very active. |
Also, we should talk about how broadly we want to support the machine learning / stats preprocessing stuff. Why add this one, but not every Cross Validation method in http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cross_validation? The preprocessing methods here are also arguably useful, again depending on if this is out of scope or not. |
test_size = int(len(self) * test_rate) | ||
|
||
if random_state: | ||
random.seed(random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pandas.utils.testing there's a class RNGContext
that's a context manager for the random state. I'd use that to return the user to their original state once you're done with the splitting, rather than modifying the global state.
Maybe we could put a bunch of these methods under |
Actually I add this function to DataFrame is not for sklearn, sklearn already provide enough functions for machine learning, and in some case I will use sklearn and pandas together, for engineer this maybe enough and also high-efficiency. |
@TomAugspurger you raise a lot of good questions I agree that this should be maybe a separate module at the very least pandas.stats.learn maybe the question is how much pre processing should pandas support you want to send an email to scikit-learn dev and pydata to get some feedback? |
I personally think this is a little bit out of scope for main pandas at this moment.
See also statsmodels/statsmodels#1498 to ask to add this to statsmodels |
I tend to agree that this is out of scope just because it's kind of a "so what?" proposition if all you have his pandas installed, though I don't feel all that strongly about it. Statsmodels could easily add pandas-aware CV tools from sklearn. I have many of them littered in my personal code just haven't gotten around to adding them to the project. |
Don't we have an issue for spliting with a groupby? +1 on being unsure about scope. You can get sample with a groupby which may be faster (e.g. 10%):
tbh I would have thought that scikit learn had some methods to do this (e.g. in an ensemble)... ...I kindof think the solution could be for scikit etc to support pandas objects, perhaps we could write a simple decorator to (if passed a frame/series) extract values, execute, and then (if appropriate) reglue index/columns to result. (probably a non-starter for scikit if there's a perf hit?) |
Isn't that the equivalent of just boolean indexing though?
|
random.seed(random_state) | ||
test_index = random.sample(self.index, test_size) | ||
df_train = self.ix[test_index] | ||
df_test = self.ix[[i for i in self.index if i not in test_index]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do boolean indexing (as @jtratner mentions) self.ix[~test_index]
@jtratner ahem, it totally is! |
should use iloc for the indexing |
closing..out of scope of pandas |
Use random sample to split a DataFrame to train set and test set which ready for cross validation.