Skip to content

Force https for urls in scraper #2222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 1, 2025
Merged

Conversation

shanbady
Copy link
Contributor

@shanbady shanbady commented Apr 30, 2025

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/7216

Description (What does it do?)

This PR makes it so that http start urls are forced to https before beginning the scraping process. Some of the apis our etls hit return http urls and the redirect to https is causing unexpected behavior.

How can this be tested?

  1. Make sure EMBEDDINGS_EXTERNAL_FETCH_USE_WEBDRIVER is True (even as you switch branches to test)
  2. On main (or any other branch but this one) - scrape an "http" url using the scraper and observe the content size:
from learning_resources.site_scrapers.utils import scraper_for_site
scraper = scraper_for_site("http://micromasters.mit.edu/scm/")
content = scraper.scrape()
print(len(content))
  1. checkout this branch.
  2. re-run the script and observe the content length - it should be significantly more since it scraped multiple pages

@shanbady shanbady added the Needs Review An open Pull Request that is ready for review label Apr 30, 2025
Copy link

OpenAPI Changes

Show/hide No detectable change.

@shanbady shanbady marked this pull request as ready for review April 30, 2025 18:41
@ChristopherChudzicki ChristopherChudzicki self-assigned this May 1, 2025
Copy link
Contributor

@ChristopherChudzicki ChristopherChudzicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting 70853 on both main and your branch after restarting all containers when running

from learning_resources.site_scrapers.utils import scraper_for_site
scraper = scraper_for_site("http://micromasters.mit.edu/scm/")
content = scraper.scrape()
print(len(content))

But this change makes sense to me, and seems plausible that it's not always reproducible, so 👍

Edit: Making sure EMBEDDINGS_EXTERNAL_FETCH_USE_WEBDRIVER=True, now I get expected difference. 👍

@@ -5,6 +5,7 @@


def scraper_for_site(url):
url = url.replace("http://", "https://")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW http:// is legal in query parameters, so theoretically this could change more than the protocol. (E.g., our login URL is https://api.rc.learn.mit.edu/learn/login?next=https://rc.learn.mit.edu/)

Very likely those sorts of URLs don't matter for this scraping (and maybe changing http to https in query params would be desirable, too).

Copy link

github-actions bot commented May 1, 2025

OpenAPI Changes

Show/hide No detectable change.

Copy link

github-actions bot commented May 1, 2025

OpenAPI Changes

Show/hide No detectable change.

Copy link

github-actions bot commented May 1, 2025

OpenAPI Changes

Show/hide No detectable change.

Copy link
Contributor

@ChristopherChudzicki ChristopherChudzicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, meant to approve!

@shanbady shanbady merged commit e41d4e5 into main May 1, 2025
13 checks passed
@shanbady shanbady deleted the shanbady/https-for-marketing-urls branch May 1, 2025 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review An open Pull Request that is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants