@@ -168,28 +168,37 @@ Extracting Substrings
168168
169169.. _text.extract :
170170
171- The method ``extract `` (introduced in version 0.13) accepts `regular expressions
172- <https://docs.python.org/2/library/re.html> `__ with match groups. Extracting a
173- regular expression with one group returns a Series of strings.
171+ Extract first match in each subject (extract)
172+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
174173
175- .. ipython :: python
174+ .. versionadded :: 0.13.0
175+
176+ .. warning ::
177+
178+ In version 0.18.0, ``extract `` gained the ``expand `` argument. When
179+ ``expand=False `` it returns a ``Series ``, ``Index ``, or
180+ ``DataFrame ``, depending on the subject and regular expression
181+ pattern (same behavior as pre-0.18.0). When ``expand=True `` it
182+ always returns a ``DataFrame ``, which is more consistent and less
183+ confusing from the perspective of a user.
176184
177- pd.Series([' a1' , ' b2' , ' c3' ]).str.extract(' [ab](\d)' )
185+ The ``extract `` method accepts a `regular expression
186+ <https://docs.python.org/2/library/re.html> `__ with at least one
187+ capture group.
178188
179- Elements that do not match return `` NaN ``. Extracting a regular expression
180- with more than one group returns a DataFrame with one column per group.
189+ Extracting a regular expression with more than one group returns a
190+ DataFrame with one column per group.
181191
182192.. ipython :: python
183193
184194 pd.Series([' a1' , ' b2' , ' c3' ]).str.extract(' ([ab])(\d)' )
185195
186- Elements that do not match return a row filled with ``NaN ``.
187- Thus, a Series of messy strings can be "converted" into a
188- like-indexed Series or DataFrame of cleaned-up or more useful strings,
189- without necessitating ``get() `` to access tuples or ``re.match `` objects.
190-
191- The results dtype always is object, even if no match is found and the result
192- only contains ``NaN ``.
196+ Elements that do not match return a row filled with ``NaN ``. Thus, a
197+ Series of messy strings can be "converted" into a like-indexed Series
198+ or DataFrame of cleaned-up or more useful strings, without
199+ necessitating ``get() `` to access tuples or ``re.match `` objects. The
200+ results dtype always is object, even if no match is found and the
201+ result only contains ``NaN ``.
193202
194203Named groups like
195204
@@ -201,9 +210,109 @@ and optional groups like
201210
202211.. ipython :: python
203212
204- pd.Series([' a1' , ' b2' , ' 3' ]).str.extract(' (?P<letter>[ab])?(?P<digit>\d)' )
213+ pd.Series([' a1' , ' b2' , ' 3' ]).str.extract(' ([ab])?(\d)' )
214+
215+ can also be used. Note that any capture group names in the regular
216+ expression will be used for column names; otherwise capture group
217+ numbers will be used.
218+
219+ Extracting a regular expression with one group returns a ``DataFrame ``
220+ with one column if ``expand=True ``.
221+
222+ .. ipython :: python
223+
224+ pd.Series([' a1' , ' b2' , ' c3' ]).str.extract(' [ab](\d)' , expand = True )
225+
226+ It returns a Series if ``expand=False ``.
227+
228+ .. ipython :: python
229+
230+ pd.Series([' a1' , ' b2' , ' c3' ]).str.extract(' [ab](\d)' , expand = False )
231+
232+ Calling on an ``Index `` with a regex with exactly one capture group
233+ returns a ``DataFrame `` with one column if ``expand=True ``,
234+
235+ .. ipython :: python
236+
237+ s = pd.Series([" a1" , " b2" , " c3" ], [" A11" , " B22" , " C33" ])
238+ s
239+ s.index.str.extract(" (?P<letter>[a-zA-Z])" , expand = True )
240+
241+ It returns an ``Index `` if ``expand=False ``.
242+
243+ .. ipython :: python
244+
245+ s.index.str.extract(" (?P<letter>[a-zA-Z])" , expand = False )
246+
247+ Calling on an ``Index `` with a regex with more than one capture group
248+ returns a ``DataFrame `` if ``expand=True ``.
249+
250+ .. ipython :: python
251+
252+ s.index.str.extract(" (?P<letter>[a-zA-Z])([0-9]+)" , expand = True )
253+
254+ It raises ``ValueError `` if ``expand=False ``.
255+
256+ .. code-block :: python
257+
258+ >> > s.index.str.extract(" (?P<letter>[a-zA-Z])([0-9]+)" , expand = False )
259+ ValueError : This pattern contains no groups to capture.
260+
261+ The table below summarizes the behavior of ``extract(expand=False) ``
262+ (input subject in first column, number of groups in regex in
263+ first row)
264+
265+ +--------+---------+------------+
266+ | | 1 group | >1 group |
267+ +--------+---------+------------+
268+ | Index | Index | ValueError |
269+ +--------+---------+------------+
270+ | Series | Series | DataFrame |
271+ +--------+---------+------------+
272+
273+ Extract all matches in each subject (extractall)
274+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
275+
276+ .. _text.extractall :
277+
278+ Unlike ``extract `` (which returns only the first match),
279+
280+ .. ipython :: python
281+
282+ s = pd.Series([" a1a2" , " b1" , " c1" ], [" A" , " B" , " C" ])
283+ s
284+ s.str.extract(" [ab](?P<digit>\d)" )
285+
286+ .. versionadded :: 0.18.0
287+
288+ the ``extractall `` method returns every match. The result of
289+ ``extractall `` is always a ``DataFrame `` with a ``MultiIndex `` on its
290+ rows. The last level of the ``MultiIndex `` is named ``match `` and
291+ indicates the order in the subject.
292+
293+ .. ipython :: python
294+
295+ s.str.extractall(" [ab](?P<digit>\d)" )
296+
297+ When each subject string in the Series has exactly one match,
298+
299+ .. ipython :: python
300+
301+ s = pd.Series([' a3' , ' b3' , ' c2' ])
302+ s
303+ two_groups = ' (?P<letter>[a-z])(?P<digit>[0-9])'
304+
305+ then ``extractall(pat).xs(0, level='match') `` gives the same result as
306+ ``extract(pat) ``.
307+
308+ .. ipython :: python
309+
310+ extract_result = s.str.extract(two_groups)
311+ extract_result
312+ extractall_result = s.str.extractall(two_groups)
313+ extractall_result
314+ extractall_result.xs(0 , level = " match" )
205315
206- can also be used.
207316
208317 Testing for Strings that Match or Contain a Pattern
209318~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -288,7 +397,8 @@ Method Summary
288397 :meth: `~Series.str.endswith `,Equivalent to ``str.endswith(pat) `` for each element
289398 :meth: `~Series.str.findall `,Compute list of all occurrences of pattern/regex for each string
290399 :meth: `~Series.str.match `,"Call ``re.match `` on each element, returning matched groups as list"
291- :meth: `~Series.str.extract `,"Call ``re.match `` on each element, as ``match `` does, but return matched groups as strings for convenience."
400+ :meth: `~Series.str.extract `,"Call ``re.search `` on each element, returning DataFrame with one row for each element and one column for each regex capture group"
401+ :meth: `~Series.str.extractall `,"Call ``re.findall `` on each element, returning DataFrame with one row for each match and one column for each regex capture group"
292402 :meth: `~Series.str.len `,Compute string lengths
293403 :meth: `~Series.str.strip `,Equivalent to ``str.strip ``
294404 :meth: `~Series.str.rstrip `,Equivalent to ``str.rstrip ``
0 commit comments