-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Suggestion: method to slice strings using index columns (start and end) in dataframe #8748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Not sure I like it, but we could consider
|
@MarkInLabcoat can you give a code-example here, e.g. copy/pastable (and indicate where the syntax is wanted). Mainly need to boil down the example so its very clear what is needed/wanted. |
There are really two enhancements here:
|
@jim22k the direct slicing on
But allowing it to be list-likes is indeed an enhancement request. |
@jreback |
As I said before, the |
@jorisvandenbossche |
I came across this issue when I was searching for solution of a similar problem. Just wondering if there are any updates on @MarkInLabcoat's enhancements request No.2? |
no but pull requests are welcome |
Closing this for now. PRs welcome |
Here's an idea I came up with. Split the variable-offset slicing task into several slicing subtasks each with its own fixed offset (first creating value counts of the various offsets encountered in the data). For each of these fixed offsets taken in turn we can handle all rows in a single call to the current limited str.slice() method. Here is a sample code snippet which takes around 1 sec to process a 1 million row Series of strings (with 8 different offset values):
|
There is a really fast way to do this.
This solution is the same as Psidom's solution in https://stackoverflow.com/a/45523050/6004997 however I added ability to use start and end index. |
What about implementing the following slice function? The df.start and df.end columns contain the start and end index required to slice the df.string.
Currently we can slice columns with a fixed start and end index
However it would be great if we could do this using variable start and stop indices from the dataframe itself, without the need for lambda functions.
Possible complications:
I can imagine that this would be complicated by the presence of NaN values in the column.
You could either force users to clean up their data first, so they can only apply the function if the column dtypes of the start and stop are integers (basically: take your dirty boots off before stepping into the house!).
Or you could be nice, and apply the slice function to anything in the target column that looks like a string, using anything in the start and end columns that looks like an integer. (not that I would have a clue how to do that!) Using this strategy, return NaN only when invalid strings, NaN or floats, or index-out-of-range are encountered?
This problem was raised along with #8747 in a StackOverflow question. Some code and examples are given.
http://stackoverflow.com/questions/26658213/how-can-i-find-the-start-and-end-of-a-regex-match-using-a-python-pandas-datafram
edit: here is some example code, including a current workaround. Sorry, I'll make sure the code is included immediately next time.
The text was updated successfully, but these errors were encountered: