Skip to content

Suggestion: method to slice strings using index columns (start and end) in dataframe #8748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
teese opened this issue Nov 6, 2014 · 12 comments
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Strings String extension data type and string data

Comments

@teese
Copy link

teese commented Nov 6, 2014

What about implementing the following slice function? The df.start and df.end columns contain the start and end index required to slice the df.string.

df.sliced = df.string[df.start:df.end]

Currently we can slice columns with a fixed start and end index

df.sliced = df.string.str.slice(1, -1)

However it would be great if we could do this using variable start and stop indices from the dataframe itself, without the need for lambda functions.

Possible complications:
I can imagine that this would be complicated by the presence of NaN values in the column.

You could either force users to clean up their data first, so they can only apply the function if the column dtypes of the start and stop are integers (basically: take your dirty boots off before stepping into the house!).

Or you could be nice, and apply the slice function to anything in the target column that looks like a string, using anything in the start and end columns that looks like an integer. (not that I would have a clue how to do that!) Using this strategy, return NaN only when invalid strings, NaN or floats, or index-out-of-range are encountered?

This problem was raised along with #8747 in a StackOverflow question. Some code and examples are given.
http://stackoverflow.com/questions/26658213/how-can-i-find-the-start-and-end-of-a-regex-match-using-a-python-pandas-datafram

edit: here is some example code, including a current workaround. Sorry, I'll make sure the code is included immediately next time.

import pandas as pd
#some example strings for slicing, start indices, stop indices
h1,h2,h3 = 'MPTMGFWVYITVE','MP-NSSLVYIGLE','MPLETQDALYVAL' 
s1, s2, s3 = 1,1,2
e1, e2, e3 = 7,7,8
#create a pandas dataframe to hold the aligned sequences
df = pd.DataFrame({'hit':[h1,h2,h3],'start': [s1, s2, s3],'end': [e1, e2, e3]})
#reorder columns
df = df[['hit', 'start', 'end']]

#HERE is where the new syntax would be useful to slice the strings, example
#df.sliced = df.hit.str[df.start:df.end] 
#or the equivalent df['sliced'] = df['hit'].str[df['start']:df['end']]

#Current workaround using a lambda function that specifies the start & end columns.
fn_slice_hit = lambda x : x['hit'][x['start']:x['end']]
#apply the slice function to the dataframe
df['sliced'] = df.apply(fn_slice_hit, axis = 1)

In [2]: df
Out[2]: 
             hit  start  end  sliced
0  MPTMGFWVYITVE      1    7  PTMGFW
1  MP-NSSLVYIGLE      1    7  P-NSSL
2  MPLETQDALYVAL      2    8  LETQDA

[3 rows x 4 columns]
@shoyer
Copy link
Member

shoyer commented Nov 6, 2014

Not sure I like it, but we could consider df.string.str[df.start:df.end].

df.string[df.start:df.end] it out because it conflicts with standard series indexing.

@jreback
Copy link
Contributor

jreback commented Nov 7, 2014

@MarkInLabcoat can you give a code-example here, e.g. copy/pastable (and indicate where the syntax is wanted). Mainly need to boil down the example so its very clear what is needed/wanted.

@jreback jreback added API Design Strings String extension data type and string data Indexing Related to indexing on series/frames, not to indexes themselves labels Nov 7, 2014
@jim22k
Copy link

jim22k commented Nov 7, 2014

There are really two enhancements here:

  1. A new notation for string slicing df.string.str[1:-1] instead of df.string.str.slice(1, -1)
  2. Allowing the slice parameters start and stop to be Series objects rather than ints

@jorisvandenbossche
Copy link
Member

@jim22k the direct slicing on str already works!

In [1]: s = pd.Series(['abcde', 'fghij'])

In [2]: s
Out[2]:
0    abcde
1    fghij
dtype: object

In [3]: s.str[1:-1]
Out[3]:
0    bcd
1    ghi
dtype: object

But allowing it to be list-likes is indeed an enhancement request.

@teese
Copy link
Author

teese commented Nov 8, 2014

@jreback
Thanks for the tip. I added some example code.
@jorisvandenbossche
As you say, the enhancement request involves accepting a list-like (list or series) rather than an integer. Perhaps also a list of tuples containing the start and stop indices?
@jim22k
If the notation for slicing was changed to df.string.str[1:-1], this would be have an advantage in that it is compatible with stride df.string.str[1:-1: 2], using a similar syntax as a for single string.

@jorisvandenbossche
Copy link
Member

As I said before, the df.string.str[1:-1] notation already works, nothing to change there!
Although slicing with a stride does not work there, but that seems more like a bug.
UPDATE: opened an issue for that: #8754

@teese
Copy link
Author

teese commented Nov 8, 2014

@jorisvandenbossche
sorry, you're right, df.string.str[1:-1] already works.
However:
stride: df.string.str[0:5:2] gives the same output as df.string.str[0:5]
reverse: df.string.str[::-1] simply gives the original series as output. Another bug?
Are there other string functions that we haven't tested yet?

@hanfang
Copy link

hanfang commented Mar 21, 2016

I came across this issue when I was searching for solution of a similar problem. Just wondering if there are any updates on @MarkInLabcoat's enhancements request No.2?
2. Allowing the slice parameters start and stop to be Series objects rather than ints

@jreback
Copy link
Contributor

jreback commented Mar 21, 2016

no but pull requests are welcome

@jreback jreback added this to the Next Major Release milestone Mar 21, 2016
@wesm wesm added the Won't Fix label Jul 6, 2018
@wesm
Copy link
Member

wesm commented Jul 6, 2018

Closing this for now. PRs welcome

@wesm wesm closed this as completed Jul 6, 2018
@mirekphd
Copy link

mirekphd commented Jun 5, 2019

Here's an idea I came up with. Split the variable-offset slicing task into several slicing subtasks each with its own fixed offset (first creating value counts of the various offsets encountered in the data). For each of these fixed offsets taken in turn we can handle all rows in a single call to the current limited str.slice() method.

Here is a sample code snippet which takes around 1 sec to process a 1 million row Series of strings (with 8 different offset values):

# get the array of indexes of the searched substring
searched_substr_ind = cur_raw_data.str.find(searched_substr) 

# prepare the list of variable offsets
substr_offsets = pd.Series(searched_substr_ind).value_counts().index.astype("int64")
substr_offsets = substr_offsets[substr_offsets > 0]

# by default insert raw missings code:
cur_cleaned_data = pd.Series(np.repeat(raw_missings_code, len(cur_raw_data)))

# loop over all substring offsets 
# (caution: can get slow if having too many)
for cur_offset in substr_offsets:
        
    # identify rows with the current offset
    cur_offset_rows = np.where(searched_substr_ind == cur_offset)
    
    # for all the identified rows extract the slice using current offset
    cur_cleaned_data.iloc[cur_offset_rows] = \
        cur_raw_data.iloc[cur_offset_rows].str.slice(cur_offset+searched_substr_len, 
                                                     cur_offset+searched_substr_len+ret_substr_len)

print(pd.Series(cur_cleaned_data).value_counts())

@Quetzalcohuatl
Copy link

There is a really fast way to do this.

df['new_column'] = [A[B:C] for A, B, C in zip(df.A, df.start_index df.end_index)]

This solution is the same as Psidom's solution in https://stackoverflow.com/a/45523050/6004997 however I added ability to use start and end index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

9 participants