html parsing with phantomjs? #5404

gliptak · 2013-10-31T22:22:25Z

I was looking into expanding the pandas.io.data functionality to read options data from Google Finance. After reading

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#html-gotchas

I tried various combinations for parsing

http://www.google.com/finance/option_chain?q=GOOG

without success. The page formats itself using javascript, so it has to be "executed" in a browser.

selenium/phantomjs seems to allow to process the page:

$ sudo aptitude install phantomjs
$ pip install selenium
$ ipython
In [1]: from selenium import webdriver
In [2]: browser=webdriver.PhantomJS()
In [3]: browser.get('http://www.google.com/finance/option_chain?q=IBM')
In [4]: exp=browser.find_element_by_id('expirations')
In [5]: exp.find_elements_by_tag_name('option')[2].text

Can they be considered for inclusion as parsing dependency?

Using phantomjs might also help with other HTML parsing issues experienced when using bs4/lxml/html5lib.

The text was updated successfully, but these errors were encountered:

jtratner · 2013-10-31T22:23:28Z

Maybe easier to just load page, get html after loading with selenium/phantomjs and then pass to HTML parser?

gliptak · 2013-10-31T22:34:43Z

The API offers operations very similar to the bs4/lxml/html5lib, and in addition one can interact with the page (like clicking on a button/link/etc.) for further processing. bs4/lxml/html5lib might be fully replaced.

Here is a list of operations on the <div id="expirations">

exp.clear                               exp.find_elements_by_partial_link_text
exp.click                               exp.find_elements_by_tag_name
exp.find_element                        exp.find_elements_by_xpath
exp.find_element_by_class_name          exp.get_attribute
exp.find_element_by_css_selector        exp.id
exp.find_element_by_id                  exp.is_displayed
exp.find_element_by_link_text           exp.is_enabled
exp.find_element_by_name                exp.is_selected
exp.find_element_by_partial_link_text   exp.location
exp.find_element_by_tag_name            exp.location_once_scrolled_into_view
exp.find_element_by_xpath               exp.parent
exp.find_elements                       exp.send_keys
exp.find_elements_by_class_name         exp.size
exp.find_elements_by_css_selector       exp.submit
exp.find_elements_by_id                 exp.tag_name
exp.find_elements_by_link_text          exp.text
exp.find_elements_by_name               exp.value_of_css_property

jtratner · 2013-10-31T22:41:26Z

selenium and phantomjs have much more overhead than bs4 or html5lib, right? But if you want to put together another 'flavor' of HTMLParser to consider, it would be interesting (especially if it worked better!)

cpcloud · 2013-10-31T22:50:04Z

Yep I think this would be nice. I can help u navigate the code if u want.

jtratner · 2013-10-31T22:51:05Z

Thanks for suggesting this btw - didn't realize you could use
selenium/phantomjs like that!

gliptak · 2013-10-31T22:56:04Z

I didn't know this either before yesterday :)

gliptak · 2013-10-31T23:45:58Z

What is the status of #5395 (it modifies pandas/io/html.py significantly)?

cancan101 · 2013-11-03T20:50:22Z

@gliptak I am hoping #5395 gets accepted soon.

ghost · 2013-12-30T20:30:47Z

This look out of scope for pandas. HTML scraping is fair enough (though it brought many dependencies
with it), but if you need to scrape the DOM I think the user should use selenium or whatever tool he chooses
and give pandas a rendered HTML blob.

The feature/creep boundary is always fuzzy, I suggest we draw the line here in this case.

Objections?

cpcloud · 2013-12-30T22:22:46Z

no objections here ... def no more deps (even opt ones) for html ... it's pretty hefty as is

ghost · 2014-01-01T02:43:06Z

@gliptak, thanks for the idea. Pandas' focus is not html scraping, the existing functionality
strikes a good balance between simplicty and power. Users are free to retrieve HTML content
using any library they choose and pass it to pandas. As your code example demonstrates
well, it's fairly simple to do.

closing.

ghost closed this as completed Jan 1, 2014

gliptak mentioned this issue Dec 16, 2015

reading options data from Google Finance pydata/pandas-datareader#148

Closed

gliptak mentioned this issue Feb 18, 2016

HTML Parsing Cleanup #5395

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

html parsing with phantomjs? #5404

html parsing with phantomjs? #5404

gliptak commented Oct 31, 2013

jtratner commented Oct 31, 2013

Uh oh!

gliptak commented Oct 31, 2013

Uh oh!

jtratner commented Oct 31, 2013

Uh oh!

cpcloud commented Oct 31, 2013

Uh oh!

jtratner commented Oct 31, 2013

Uh oh!

gliptak commented Oct 31, 2013

Uh oh!

gliptak commented Oct 31, 2013

Uh oh!

cancan101 commented Nov 3, 2013

Uh oh!

ghost commented Dec 30, 2013

Uh oh!

cpcloud commented Dec 30, 2013

Uh oh!

ghost commented Jan 1, 2014

Uh oh!

Uh oh!

html parsing with phantomjs? #5404

html parsing with phantomjs? #5404

Comments

gliptak commented Oct 31, 2013

jtratner commented Oct 31, 2013

Uh oh!

gliptak commented Oct 31, 2013

Uh oh!

jtratner commented Oct 31, 2013

Uh oh!

cpcloud commented Oct 31, 2013

Uh oh!

jtratner commented Oct 31, 2013

Uh oh!

gliptak commented Oct 31, 2013

Uh oh!

gliptak commented Oct 31, 2013

Uh oh!

cancan101 commented Nov 3, 2013

Uh oh!

ghost commented Dec 30, 2013

Uh oh!

cpcloud commented Dec 30, 2013

Uh oh!

ghost commented Jan 1, 2014

Uh oh!