-
Notifications
You must be signed in to change notification settings - Fork 3
Force https for urls in scraper #2222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
OpenAPI ChangesShow/hide No detectable change.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting 70853 on both main and your branch after restarting all containers when running
from learning_resources.site_scrapers.utils import scraper_for_site
scraper = scraper_for_site("http://micromasters.mit.edu/scm/")
content = scraper.scrape()
print(len(content))
But this change makes sense to me, and seems plausible that it's not always reproducible, so 👍
Edit: Making sure EMBEDDINGS_EXTERNAL_FETCH_USE_WEBDRIVER=True
, now I get expected difference. 👍
@@ -5,6 +5,7 @@ | |||
|
|||
|
|||
def scraper_for_site(url): | |||
url = url.replace("http://", "https://") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW http://
is legal in query parameters, so theoretically this could change more than the protocol. (E.g., our login URL is https://api.rc.learn.mit.edu/learn/login?next=https://rc.learn.mit.edu/
)
Very likely those sorts of URLs don't matter for this scraping (and maybe changing http to https in query params would be desirable, too).
OpenAPI ChangesShow/hide No detectable change.
|
OpenAPI ChangesShow/hide No detectable change.
|
OpenAPI ChangesShow/hide No detectable change.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, meant to approve!
What are the relevant tickets?
Closes https://github.com/mitodl/hq/issues/7216
Description (What does it do?)
This PR makes it so that http start urls are forced to https before beginning the scraping process. Some of the apis our etls hit return http urls and the redirect to https is causing unexpected behavior.
How can this be tested?