-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
io.html.read_html support XPath expressions for table selection #5416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 4 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
48ed43b
io.html.read_html support XPath expressions for table selection (only…
phaebz de94512
Coverage tests for `match` and `attr` parameters
phaebz 9a300b4
XPath expression has to match table elements only
phaebz 563a955
Further testing for XPath feature
phaebz 432269b
Correct format specifiers for XPath string formatting
phaebz c2bcbc9
Release notes addition
phaebz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -165,13 +165,15 @@ class _HtmlFrameParser(object): | |
See each method's respective documentation for details on their | ||
functionality. | ||
""" | ||
def __init__(self, io, match, attrs): | ||
def __init__(self, io, match, attrs, xpath): | ||
self.io = io | ||
self.match = match | ||
self.attrs = attrs | ||
self.xpath = xpath | ||
|
||
def parse_tables(self): | ||
tables = self._parse_tables(self._build_doc(), self.match, self.attrs) | ||
tables = self._parse_tables(self._build_doc(), self.match, self.attrs, | ||
self.xpath) | ||
return (self._build_table(table) for table in tables) | ||
|
||
def _parse_raw_data(self, rows): | ||
|
@@ -227,7 +229,7 @@ def _parse_td(self, obj): | |
""" | ||
raise NotImplementedError | ||
|
||
def _parse_tables(self, doc, match, attrs): | ||
def _parse_tables(self, doc, match, attrs, xpath): | ||
"""Return all tables from the parsed DOM. | ||
|
||
Parameters | ||
|
@@ -242,6 +244,9 @@ def _parse_tables(self, doc, match, attrs): | |
A dictionary of table attributes that can be used to disambiguate | ||
mutliple tables on a page. | ||
|
||
xpath : str or None | ||
An XPath style string used to filter for tables to be returned. | ||
|
||
Raises | ||
------ | ||
ValueError | ||
|
@@ -393,7 +398,7 @@ def _parse_tbody(self, table): | |
def _parse_tfoot(self, table): | ||
return table.find_all('tfoot') | ||
|
||
def _parse_tables(self, doc, match, attrs): | ||
def _parse_tables(self, doc, match, attrs, xpath): | ||
element_name = self._strainer.name | ||
tables = doc.find_all(element_name, attrs=attrs) | ||
|
||
|
@@ -481,24 +486,36 @@ def _parse_tr(self, table): | |
expr = './/tr[normalize-space()]' | ||
return table.xpath(expr) | ||
|
||
def _parse_tables(self, doc, match, kwargs): | ||
pattern = match.pattern | ||
def _parse_tables(self, doc, match, kwargs, xpath): | ||
if xpath: | ||
xpath_expr = xpath | ||
tables = doc.xpath(xpath_expr) | ||
|
||
# 1. check all descendants for the given pattern and only search tables | ||
# 2. go up the tree until we find a table | ||
query = '//table//*[re:test(text(), %r)]/ancestor::table' | ||
xpath_expr = u(query) % pattern | ||
if not all(table.tag == 'table' for table in tables): | ||
raise ValueError("XPath expression %s matched non-table elements" % xpath) | ||
|
||
# if any table attributes were given build an xpath expression to | ||
# search for them | ||
if kwargs: | ||
xpath_expr += _build_xpath_expr(kwargs) | ||
if not tables: | ||
raise ValueError("No tables found using XPath expression %s" % xpath) | ||
return tables | ||
|
||
tables = doc.xpath(xpath_expr, namespaces=_re_namespace) | ||
else: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be nice to dump the below into a function...but not necessary for this PR. |
||
pattern = match.pattern | ||
|
||
if not tables: | ||
raise ValueError("No tables found matching regex %r" % pattern) | ||
return tables | ||
# 1. check all descendants for the given pattern and only search tables | ||
# 2. go up the tree until we find a table | ||
query = '//table//*[re:test(text(), %r)]/ancestor::table' | ||
xpath_expr = u(query) % pattern | ||
|
||
# if any table attributes were given build an xpath expression to | ||
# search for them | ||
if kwargs: | ||
xpath_expr += _build_xpath_expr(kwargs) | ||
|
||
tables = doc.xpath(xpath_expr, namespaces=_re_namespace) | ||
|
||
if not tables: | ||
raise ValueError("No tables found matching regex %r" % pattern) | ||
return tables | ||
|
||
def _build_doc(self): | ||
""" | ||
|
@@ -688,15 +705,22 @@ def _validate_flavor(flavor): | |
|
||
|
||
def _parse(flavor, io, match, header, index_col, skiprows, infer_types, | ||
parse_dates, tupleize_cols, thousands, attrs): | ||
parse_dates, tupleize_cols, thousands, attrs, xpath): | ||
flavor = _validate_flavor(flavor) | ||
compiled_match = re.compile(match) # you can pass a compiled regex here | ||
|
||
if xpath and not _HAS_LXML: | ||
raise ValueError("XPath table selection needs the lxml module, " | ||
"please install it.") | ||
|
||
# hack around python 3 deleting the exception variable | ||
retained = None | ||
for flav in flavor: | ||
parser = _parser_dispatch(flav) | ||
p = parser(io, compiled_match, attrs) | ||
if xpath and flav in ('bs4', 'html5lib'): | ||
raise NotImplementedError | ||
|
||
p = parser(io, compiled_match, attrs, xpath) | ||
|
||
try: | ||
tables = p.parse_tables() | ||
|
@@ -714,7 +738,7 @@ def _parse(flavor, io, match, header, index_col, skiprows, infer_types, | |
|
||
def read_html(io, match='.+', flavor=None, header=None, index_col=None, | ||
skiprows=None, infer_types=None, attrs=None, parse_dates=False, | ||
tupleize_cols=False, thousands=','): | ||
tupleize_cols=False, thousands=',', xpath=None): | ||
r"""Read HTML tables into a ``list`` of ``DataFrame`` objects. | ||
|
||
Parameters | ||
|
@@ -795,6 +819,12 @@ def read_html(io, match='.+', flavor=None, header=None, index_col=None, | |
thousands : str, optional | ||
Separator to use to parse thousands. Defaults to ``','``. | ||
|
||
xpath : str or None, optional | ||
If not ``None`` try to identify the set of tables to be read by an | ||
XPath string; takes precedence over ``match``. Defaults to ``None``. | ||
Note: This functionality is not (yet) available with the Beautiful Soup | ||
parser (``flavor=bs4``). | ||
|
||
Returns | ||
------- | ||
dfs : list of DataFrames | ||
|
@@ -840,4 +870,4 @@ def read_html(io, match='.+', flavor=None, header=None, index_col=None, | |
raise ValueError('cannot skip rows starting from the end of the ' | ||
'data (you passed a negative value)') | ||
return _parse(flavor, io, match, header, index_col, skiprows, infer_types, | ||
parse_dates, tupleize_cols, thousands, attrs) | ||
parse_dates, tupleize_cols, thousands, attrs, xpath) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u use
%r
here instead of%s
?