Suggestion: method to slice strings using index columns (start and end) in dataframe #8748

teese · 2014-11-06T17:34:58Z

What about implementing the following slice function? The df.start and df.end columns contain the start and end index required to slice the df.string.

df.sliced = df.string[df.start:df.end]

Currently we can slice columns with a fixed start and end index

df.sliced = df.string.str.slice(1, -1)

However it would be great if we could do this using variable start and stop indices from the dataframe itself, without the need for lambda functions.

Possible complications:
I can imagine that this would be complicated by the presence of NaN values in the column.

You could either force users to clean up their data first, so they can only apply the function if the column dtypes of the start and stop are integers (basically: take your dirty boots off before stepping into the house!).

Or you could be nice, and apply the slice function to anything in the target column that looks like a string, using anything in the start and end columns that looks like an integer. (not that I would have a clue how to do that!) Using this strategy, return NaN only when invalid strings, NaN or floats, or index-out-of-range are encountered?

This problem was raised along with #8747 in a StackOverflow question. Some code and examples are given.
http://stackoverflow.com/questions/26658213/how-can-i-find-the-start-and-end-of-a-regex-match-using-a-python-pandas-datafram

edit: here is some example code, including a current workaround. Sorry, I'll make sure the code is included immediately next time.

import pandas as pd
#some example strings for slicing, start indices, stop indices
h1,h2,h3 = 'MPTMGFWVYITVE','MP-NSSLVYIGLE','MPLETQDALYVAL' 
s1, s2, s3 = 1,1,2
e1, e2, e3 = 7,7,8
#create a pandas dataframe to hold the aligned sequences
df = pd.DataFrame({'hit':[h1,h2,h3],'start': [s1, s2, s3],'end': [e1, e2, e3]})
#reorder columns
df = df[['hit', 'start', 'end']]

#HERE is where the new syntax would be useful to slice the strings, example
#df.sliced = df.hit.str[df.start:df.end] 
#or the equivalent df['sliced'] = df['hit'].str[df['start']:df['end']]

#Current workaround using a lambda function that specifies the start & end columns.
fn_slice_hit = lambda x : x['hit'][x['start']:x['end']]
#apply the slice function to the dataframe
df['sliced'] = df.apply(fn_slice_hit, axis = 1)

In [2]: df
Out[2]: 
             hit  start  end  sliced
0  MPTMGFWVYITVE      1    7  PTMGFW
1  MP-NSSLVYIGLE      1    7  P-NSSL
2  MPLETQDALYVAL      2    8  LETQDA

[3 rows x 4 columns]

The text was updated successfully, but these errors were encountered:

shoyer · 2014-11-06T18:58:21Z

Not sure I like it, but we could consider df.string.str[df.start:df.end].

df.string[df.start:df.end] it out because it conflicts with standard series indexing.

jreback · 2014-11-07T17:14:41Z

@MarkInLabcoat can you give a code-example here, e.g. copy/pastable (and indicate where the syntax is wanted). Mainly need to boil down the example so its very clear what is needed/wanted.

jim22k · 2014-11-07T19:17:52Z

There are really two enhancements here:

A new notation for string slicing df.string.str[1:-1] instead of df.string.str.slice(1, -1)
Allowing the slice parameters start and stop to be Series objects rather than ints

jorisvandenbossche · 2014-11-07T19:20:55Z

@jim22k the direct slicing on str already works!

In [1]: s = pd.Series(['abcde', 'fghij'])

In [2]: s
Out[2]:
0    abcde
1    fghij
dtype: object

In [3]: s.str[1:-1]
Out[3]:
0    bcd
1    ghi
dtype: object

But allowing it to be list-likes is indeed an enhancement request.

teese · 2014-11-08T16:18:09Z

@jreback
Thanks for the tip. I added some example code.
@jorisvandenbossche
As you say, the enhancement request involves accepting a list-like (list or series) rather than an integer. Perhaps also a list of tuples containing the start and stop indices?
@jim22k
If the notation for slicing was changed to df.string.str[1:-1], this would be have an advantage in that it is compatible with stride df.string.str[1:-1: 2], using a similar syntax as a for single string.

jorisvandenbossche · 2014-11-08T16:22:14Z

As I said before, the df.string.str[1:-1] notation already works, nothing to change there!
Although slicing with a stride does not work there, but that seems more like a bug.
UPDATE: opened an issue for that: #8754

teese · 2014-11-08T20:23:35Z

@jorisvandenbossche
sorry, you're right, df.string.str[1:-1] already works.
However:
stride: df.string.str[0:5:2] gives the same output as df.string.str[0:5]
reverse: df.string.str[::-1] simply gives the original series as output. Another bug?
Are there other string functions that we haven't tested yet?

hanfang · 2016-03-21T18:56:56Z

I came across this issue when I was searching for solution of a similar problem. Just wondering if there are any updates on @MarkInLabcoat's enhancements request No.2?
2. Allowing the slice parameters start and stop to be Series objects rather than ints

jreback · 2016-03-21T19:09:26Z

no but pull requests are welcome

wesm · 2018-07-06T21:49:52Z

Closing this for now. PRs welcome

mirekphd · 2019-06-05T09:26:03Z

Here's an idea I came up with. Split the variable-offset slicing task into several slicing subtasks each with its own fixed offset (first creating value counts of the various offsets encountered in the data). For each of these fixed offsets taken in turn we can handle all rows in a single call to the current limited str.slice() method.

Here is a sample code snippet which takes around 1 sec to process a 1 million row Series of strings (with 8 different offset values):

# get the array of indexes of the searched substring
searched_substr_ind = cur_raw_data.str.find(searched_substr) 

# prepare the list of variable offsets
substr_offsets = pd.Series(searched_substr_ind).value_counts().index.astype("int64")
substr_offsets = substr_offsets[substr_offsets > 0]

# by default insert raw missings code:
cur_cleaned_data = pd.Series(np.repeat(raw_missings_code, len(cur_raw_data)))

# loop over all substring offsets 
# (caution: can get slow if having too many)
for cur_offset in substr_offsets:
        
    # identify rows with the current offset
    cur_offset_rows = np.where(searched_substr_ind == cur_offset)
    
    # for all the identified rows extract the slice using current offset
    cur_cleaned_data.iloc[cur_offset_rows] = \
        cur_raw_data.iloc[cur_offset_rows].str.slice(cur_offset+searched_substr_len, 
                                                     cur_offset+searched_substr_len+ret_substr_len)

print(pd.Series(cur_cleaned_data).value_counts())

Quetzalcohuatl · 2020-01-30T20:23:56Z

There is a really fast way to do this.

df['new_column'] = [A[B:C] for A, B, C in zip(df.A, df.start_index df.end_index)]

This solution is the same as Psidom's solution in https://stackoverflow.com/a/45523050/6004997 however I added ability to use start and end index.

jreback added API Design Strings String extension data type and string data Indexing Related to indexing on series/frames, not to indexes themselves labels Nov 7, 2014

jreback added this to the Next Major Release milestone Mar 21, 2016

jreback added Difficulty Intermediate labels Mar 21, 2016

wesm added the Won't Fix label Jul 6, 2018

wesm closed this as completed Jul 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Suggestion: method to slice strings using index columns (start and end) in dataframe #8748

Suggestion: method to slice strings using index columns (start and end) in dataframe #8748

teese commented Nov 6, 2014

shoyer commented Nov 6, 2014

Uh oh!

jreback commented Nov 7, 2014

Uh oh!

jim22k commented Nov 7, 2014

Uh oh!

jorisvandenbossche commented Nov 7, 2014

Uh oh!

teese commented Nov 8, 2014

Uh oh!

jorisvandenbossche commented Nov 8, 2014

Uh oh!

teese commented Nov 8, 2014

Uh oh!

hanfang commented Mar 21, 2016

Uh oh!

jreback commented Mar 21, 2016

Uh oh!

wesm commented Jul 6, 2018

Uh oh!

mirekphd commented Jun 5, 2019 •

edited

Loading

Uh oh!

Quetzalcohuatl commented Jan 30, 2020

Uh oh!

Uh oh!

Suggestion: method to slice strings using index columns (start and end) in dataframe #8748

Suggestion: method to slice strings using index columns (start and end) in dataframe #8748

Comments

teese commented Nov 6, 2014

shoyer commented Nov 6, 2014

Uh oh!

jreback commented Nov 7, 2014

Uh oh!

jim22k commented Nov 7, 2014

Uh oh!

jorisvandenbossche commented Nov 7, 2014

Uh oh!

teese commented Nov 8, 2014

Uh oh!

jorisvandenbossche commented Nov 8, 2014

Uh oh!

teese commented Nov 8, 2014

Uh oh!

hanfang commented Mar 21, 2016

Uh oh!

jreback commented Mar 21, 2016

Uh oh!

wesm commented Jul 6, 2018

Uh oh!

mirekphd commented Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Quetzalcohuatl commented Jan 30, 2020

Uh oh!

mirekphd commented Jun 5, 2019 •

edited

Loading