Skip to content

html parsing with phantomjs? #5404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gliptak opened this issue Oct 31, 2013 · 11 comments
Closed

html parsing with phantomjs? #5404

gliptak opened this issue Oct 31, 2013 · 11 comments
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Milestone

Comments

@gliptak
Copy link
Contributor

gliptak commented Oct 31, 2013

I was looking into expanding the pandas.io.data functionality to read options data from Google Finance. After reading

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#html-gotchas

I tried various combinations for parsing

http://www.google.com/finance/option_chain?q=GOOG

without success. The page formats itself using javascript, so it has to be "executed" in a browser.

selenium/phantomjs seems to allow to process the page:

$ sudo aptitude install phantomjs
$ pip install selenium
$ ipython
In [1]: from selenium import webdriver
In [2]: browser=webdriver.PhantomJS()
In [3]: browser.get('http://www.google.com/finance/option_chain?q=IBM')
In [4]: exp=browser.find_element_by_id('expirations')
In [5]: exp.find_elements_by_tag_name('option')[2].text

Can they be considered for inclusion as parsing dependency?

Using phantomjs might also help with other HTML parsing issues experienced when using bs4/lxml/html5lib.

@jtratner
Copy link
Contributor

Maybe easier to just load page, get html after loading with selenium/phantomjs and then pass to HTML parser?

@gliptak
Copy link
Contributor Author

gliptak commented Oct 31, 2013

The API offers operations very similar to the bs4/lxml/html5lib, and in addition one can interact with the page (like clicking on a button/link/etc.) for further processing. bs4/lxml/html5lib might be fully replaced.

Here is a list of operations on the <div id="expirations">

exp.clear                               exp.find_elements_by_partial_link_text
exp.click                               exp.find_elements_by_tag_name
exp.find_element                        exp.find_elements_by_xpath
exp.find_element_by_class_name          exp.get_attribute
exp.find_element_by_css_selector        exp.id
exp.find_element_by_id                  exp.is_displayed
exp.find_element_by_link_text           exp.is_enabled
exp.find_element_by_name                exp.is_selected
exp.find_element_by_partial_link_text   exp.location
exp.find_element_by_tag_name            exp.location_once_scrolled_into_view
exp.find_element_by_xpath               exp.parent
exp.find_elements                       exp.send_keys
exp.find_elements_by_class_name         exp.size
exp.find_elements_by_css_selector       exp.submit
exp.find_elements_by_id                 exp.tag_name
exp.find_elements_by_link_text          exp.text
exp.find_elements_by_name               exp.value_of_css_property

@jtratner
Copy link
Contributor

selenium and phantomjs have much more overhead than bs4 or html5lib, right? But if you want to put together another 'flavor' of HTMLParser to consider, it would be interesting (especially if it worked better!)

@cpcloud
Copy link
Member

cpcloud commented Oct 31, 2013

Yep I think this would be nice. I can help u navigate the code if u want.

@jtratner
Copy link
Contributor

Thanks for suggesting this btw - didn't realize you could use
selenium/phantomjs like that!

@gliptak
Copy link
Contributor Author

gliptak commented Oct 31, 2013

I didn't know this either before yesterday :)

@gliptak
Copy link
Contributor Author

gliptak commented Oct 31, 2013

What is the status of #5395 (it modifies pandas/io/html.py significantly)?

@cancan101
Copy link
Contributor

@gliptak I am hoping #5395 gets accepted soon.

@ghost
Copy link

ghost commented Dec 30, 2013

This look out of scope for pandas. HTML scraping is fair enough (though it brought many dependencies
with it), but if you need to scrape the DOM I think the user should use selenium or whatever tool he chooses
and give pandas a rendered HTML blob.

The feature/creep boundary is always fuzzy, I suggest we draw the line here in this case.

Objections?

@cpcloud
Copy link
Member

cpcloud commented Dec 30, 2013

no objections here ... def no more deps (even opt ones) for html ... it's pretty hefty as is

@ghost
Copy link

ghost commented Jan 1, 2014

@gliptak, thanks for the idea. Pandas' focus is not html scraping, the existing functionality
strikes a good balance between simplicty and power. Users are free to retrieve HTML content
using any library they choose and pass it to pandas. As your code example demonstrates
well, it's fairly simple to do.

closing.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

4 participants