-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
html parsing with phantomjs? #5404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Maybe easier to just load page, get html after loading with selenium/phantomjs and then pass to HTML parser? |
The API offers operations very similar to the bs4/lxml/html5lib, and in addition one can interact with the page (like clicking on a button/link/etc.) for further processing. bs4/lxml/html5lib might be fully replaced. Here is a list of operations on the
|
selenium and phantomjs have much more overhead than bs4 or html5lib, right? But if you want to put together another 'flavor' of HTMLParser to consider, it would be interesting (especially if it worked better!) |
Yep I think this would be nice. I can help u navigate the code if u want. |
Thanks for suggesting this btw - didn't realize you could use |
I didn't know this either before yesterday :) |
What is the status of #5395 (it modifies |
This look out of scope for pandas. HTML scraping is fair enough (though it brought many dependencies The feature/creep boundary is always fuzzy, I suggest we draw the line here in this case. Objections? |
no objections here ... def no more deps (even opt ones) for html ... it's pretty hefty as is |
@gliptak, thanks for the idea. Pandas' focus is not html scraping, the existing functionality closing. |
I was looking into expanding the pandas.io.data functionality to read options data from Google Finance. After reading
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#html-gotchas
I tried various combinations for parsing
http://www.google.com/finance/option_chain?q=GOOG
without success. The page formats itself using javascript, so it has to be "executed" in a browser.
selenium/phantomjs seems to allow to process the page:
Can they be considered for inclusion as parsing dependency?
Using phantomjs might also help with other HTML parsing issues experienced when using bs4/lxml/html5lib.
The text was updated successfully, but these errors were encountered: