Document Using Regex for str.split #25296

stevenlis · 2019-02-13T04:02:52Z

import pandas as pd
df = pd.DataFrame({'col': ['a-b-c+e=d,f#t']*5})
df.col.str.split('+|=', expand=True)

Problem description

While passing two patterns separating with | to str.split() method, if one of them is +, panads returns the following error:

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
 in 
      2 import pandas as pd
      3 df = pd.DataFrame({'col': ['a-b-c+e=d,f#t']*5})
----> 4 df.col.str.split('+|=', expand=True)

~\Anaconda3\lib\site-packages\pandas\core\strings.py in split(self, pat, n, expand)
   2328     @copy(str_split)
   2329     def split(self, pat=None, n=-1, expand=False):
-> 2330         result = str_split(self._data, pat, n=n)
   2331         return self._wrap_result(result, expand=expand)
   2332 

~\Anaconda3\lib\site-packages\pandas\core\strings.py in str_split(arr, pat, n)
   1458             if n is None or n == -1:
   1459                 n = 0
-> 1460             regex = re.compile(pat)
   1461             f = lambda x: regex.split(x, maxsplit=n)
   1462     res = _na_map(f, arr)

~\Anaconda3\lib\re.py in compile(pattern, flags)
    231 def compile(pattern, flags=0):
    232     "Compile a regular expression pattern, returning a pattern object."
--> 233     return _compile(pattern, flags)
    234 
    235 def purge():

~\Anaconda3\lib\re.py in _compile(pattern, flags)
    299     if not sre_compile.isstring(pattern):
    300         raise TypeError("first argument must be string or compiled pattern")
--> 301     p = sre_compile.compile(pattern, flags)
    302     if not (flags & DEBUG):
    303         if len(_cache) >= _MAXCACHE:

~\Anaconda3\lib\sre_compile.py in compile(p, flags)
    560     if isstring(p):
    561         pattern = p
--> 562         p = sre_parse.parse(p, flags)
    563     else:
    564         pattern = None

~\Anaconda3\lib\sre_parse.py in parse(str, flags, pattern)
    853 
    854     try:
--> 855         p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
    856     except Verbose:
    857         # the VERBOSE flag was switched on inside the pattern.  to be

~\Anaconda3\lib\sre_parse.py in _parse_sub(source, state, verbose, nested)
    414     while True:
    415         itemsappend(_parse(source, state, verbose, nested + 1,
--> 416                            not nested and not items))
    417         if not sourcematch("|"):
    418             break

~\Anaconda3\lib\sre_parse.py in _parse(source, state, verbose, nested, first)
    614             if not item or (_len(item) == 1 and item[0][0] is AT):
    615                 raise source.error("nothing to repeat",
--> 616                                    source.tell() - here + len(this))
    617             if item[0][0] in _REPEATCODES:
    618                 raise source.error("multiple repeat",

error: nothing to repeat at position 0

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.7.1
pip: 18.1
setuptools: 40.2.0
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: 0.11.0
IPython: 7.1.1
sphinx: 1.7.6
patsy: 0.5.1
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.5
lxml: 4.2.4
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-02-13T04:25:43Z

This is not a bug as you would need to escape the plus sign if using a regular expression.

That said, this feature is not documented so I think we can re-purpose this issue to actually document support for regex splitting

stevenlis · 2019-02-13T04:55:29Z

The behavior is inconsistent though as it seems + is the only character that will cause this issue.

# this works:
df.col.str.split(',|#|=|-', expand=True)
# this does not:
df.col.str.split(',|#|=|-|+', expand=True)
# and you have to 
df.col.str.split(',|#|=|-|\+', expand=True)

WillAyd · 2019-02-13T04:56:59Z

It's consistent with regex behavior where + is a special character. You will get the same error with * amongst others as well

zangell44 · 2019-02-15T00:18:39Z

I can work on putting this in the documentation. Would you be okay with localized documentation in all of the str methods where this is applicable?

WillAyd · 2019-02-15T04:22:46Z

@zangell44 I think it is documented in most methods but sure if you see others where it isn't by all means include in a PR

Closes gh-25296

Closes pandas-devgh-25296

WillAyd changed the title ~~Passing "+" as one character with "|" separator in str.split causes error~~ Document Using Regex for str.split Feb 13, 2019

WillAyd added Docs good first issue labels Feb 13, 2019

vandenn mentioned this issue May 2, 2019

DOC: Add regex example in str.split docstring #26267

Merged

4 tasks

gfyoung closed this as completed in #26267 May 3, 2019

gfyoung pushed a commit that referenced this issue May 3, 2019

DOC: Add regex example in str.split docstring (#26267)

e854ccf

Closes gh-25296

vandenn added a commit to vandenn/pandas that referenced this issue May 3, 2019

DOC: Add regex example in str.split docstring (pandas-dev#26267)

e5db601

Closes pandas-devgh-25296

vandenn added a commit to vandenn/pandas that referenced this issue May 3, 2019

DOC: Add regex example in str.split docstring (pandas-dev#26267) (#2)

79e3613

Closes pandas-devgh-25296

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Document Using Regex for str.split #25296

Document Using Regex for str.split #25296

stevenlis commented Feb 13, 2019

INSTALLED VERSIONS

WillAyd commented Feb 13, 2019

Uh oh!

stevenlis commented Feb 13, 2019 •

edited

Loading

Uh oh!

WillAyd commented Feb 13, 2019

Uh oh!

zangell44 commented Feb 15, 2019

Uh oh!

WillAyd commented Feb 15, 2019

Uh oh!

Uh oh!

Document Using Regex for str.split #25296

Document Using Regex for str.split #25296

Comments

stevenlis commented Feb 13, 2019

Problem description

INSTALLED VERSIONS

WillAyd commented Feb 13, 2019

Uh oh!

stevenlis commented Feb 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd commented Feb 13, 2019

Uh oh!

zangell44 commented Feb 15, 2019

Uh oh!

WillAyd commented Feb 15, 2019

Uh oh!

stevenlis commented Feb 13, 2019 •

edited

Loading