Skip to content

Document Using Regex for str.split #25296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stevenlis opened this issue Feb 13, 2019 · 5 comments · Fixed by #26267
Closed

Document Using Regex for str.split #25296

stevenlis opened this issue Feb 13, 2019 · 5 comments · Fixed by #26267

Comments

@stevenlis
Copy link

import pandas as pd
df = pd.DataFrame({'col': ['a-b-c+e=d,f#t']*5})
df.col.str.split('+|=', expand=True)

Problem description

While passing two patterns separating with | to str.split() method, if one of them is +, panads returns the following error:

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
 in 
      2 import pandas as pd
      3 df = pd.DataFrame({'col': ['a-b-c+e=d,f#t']*5})
----> 4 df.col.str.split('+|=', expand=True)

~\Anaconda3\lib\site-packages\pandas\core\strings.py in split(self, pat, n, expand)
   2328     @copy(str_split)
   2329     def split(self, pat=None, n=-1, expand=False):
-> 2330         result = str_split(self._data, pat, n=n)
   2331         return self._wrap_result(result, expand=expand)
   2332 

~\Anaconda3\lib\site-packages\pandas\core\strings.py in str_split(arr, pat, n)
   1458             if n is None or n == -1:
   1459                 n = 0
-> 1460             regex = re.compile(pat)
   1461             f = lambda x: regex.split(x, maxsplit=n)
   1462     res = _na_map(f, arr)

~\Anaconda3\lib\re.py in compile(pattern, flags)
    231 def compile(pattern, flags=0):
    232     "Compile a regular expression pattern, returning a pattern object."
--> 233     return _compile(pattern, flags)
    234 
    235 def purge():

~\Anaconda3\lib\re.py in _compile(pattern, flags)
    299     if not sre_compile.isstring(pattern):
    300         raise TypeError("first argument must be string or compiled pattern")
--> 301     p = sre_compile.compile(pattern, flags)
    302     if not (flags & DEBUG):
    303         if len(_cache) >= _MAXCACHE:

~\Anaconda3\lib\sre_compile.py in compile(p, flags)
    560     if isstring(p):
    561         pattern = p
--> 562         p = sre_parse.parse(p, flags)
    563     else:
    564         pattern = None

~\Anaconda3\lib\sre_parse.py in parse(str, flags, pattern)
    853 
    854     try:
--> 855         p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
    856     except Verbose:
    857         # the VERBOSE flag was switched on inside the pattern.  to be

~\Anaconda3\lib\sre_parse.py in _parse_sub(source, state, verbose, nested)
    414     while True:
    415         itemsappend(_parse(source, state, verbose, nested + 1,
--> 416                            not nested and not items))
    417         if not sourcematch("|"):
    418             break

~\Anaconda3\lib\sre_parse.py in _parse(source, state, verbose, nested, first)
    614             if not item or (_len(item) == 1 and item[0][0] is AT):
    615                 raise source.error("nothing to repeat",
--> 616                                    source.tell() - here + len(this))
    617             if item[0][0] in _REPEATCODES:
    618                 raise source.error("multiple repeat",

error: nothing to repeat at position 0

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.7.1
pip: 18.1
setuptools: 40.2.0
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: 0.11.0
IPython: 7.1.1
sphinx: 1.7.6
patsy: 0.5.1
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.5
lxml: 4.2.4
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Feb 13, 2019

This is not a bug as you would need to escape the plus sign if using a regular expression.

That said, this feature is not documented so I think we can re-purpose this issue to actually document support for regex splitting

@WillAyd WillAyd changed the title Passing "+" as one character with "|" separator in str.split causes error Document Using Regex for str.split Feb 13, 2019
@stevenlis
Copy link
Author

stevenlis commented Feb 13, 2019

The behavior is inconsistent though as it seems + is the only character that will cause this issue.

# this works:
df.col.str.split(',|#|=|-', expand=True)
# this does not:
df.col.str.split(',|#|=|-|+', expand=True)
# and you have to 
df.col.str.split(',|#|=|-|\+', expand=True)

@WillAyd
Copy link
Member

WillAyd commented Feb 13, 2019

It's consistent with regex behavior where + is a special character. You will get the same error with * amongst others as well

@zangell44
Copy link
Contributor

I can work on putting this in the documentation. Would you be okay with localized documentation in all of the str methods where this is applicable?

@WillAyd
Copy link
Member

WillAyd commented Feb 15, 2019

@zangell44 I think it is documented in most methods but sure if you see others where it isn't by all means include in a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants