Force https for urls in scraper #2222

shanbady · 2025-04-30T18:40:56Z

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/7216

Description (What does it do?)

This PR makes it so that http start urls are forced to https before beginning the scraping process. Some of the apis our etls hit return http urls and the redirect to https is causing unexpected behavior.

How can this be tested?

Make sure EMBEDDINGS_EXTERNAL_FETCH_USE_WEBDRIVER is True (even as you switch branches to test)
On main (or any other branch but this one) - scrape an "http" url using the scraper and observe the content size:

from learning_resources.site_scrapers.utils import scraper_for_site
scraper = scraper_for_site("http://micromasters.mit.edu/scm/")
content = scraper.scrape()
print(len(content))

checkout this branch.
re-run the script and observe the content length - it should be significantly more since it scraped multiple pages

github-actions · 2025-04-30T18:41:13Z

OpenAPI Changes

Show/hide No detectable change.

ChristopherChudzicki

I'm getting 70853 on both main and your branch after restarting all containers when running

from learning_resources.site_scrapers.utils import scraper_for_site
scraper = scraper_for_site("http://micromasters.mit.edu/scm/")
content = scraper.scrape()
print(len(content))

But this change makes sense to me, and seems plausible that it's not always reproducible, so 👍

Edit: Making sure EMBEDDINGS_EXTERNAL_FETCH_USE_WEBDRIVER=True, now I get expected difference. 👍

ChristopherChudzicki · 2025-05-01T14:23:01Z

learning_resources/site_scrapers/utils.py

@@ -5,6 +5,7 @@


 def scraper_for_site(url):
+    url = url.replace("http://", "https://")


BTW http:// is legal in query parameters, so theoretically this could change more than the protocol. (E.g., our login URL is https://api.rc.learn.mit.edu/learn/login?next=https://rc.learn.mit.edu/)

Very likely those sorts of URLs don't matter for this scraping (and maybe changing http to https in query params would be desirable, too).

github-actions · 2025-05-01T15:59:41Z

OpenAPI Changes

Show/hide No detectable change.

github-actions · 2025-05-01T16:02:37Z

OpenAPI Changes

Show/hide No detectable change.

This reverts commit 9af6840.

This reverts commit 2ff2ac9.

github-actions · 2025-05-01T16:17:27Z

OpenAPI Changes

Show/hide No detectable change.

ChristopherChudzicki

oops, meant to approve!

shanbady added 2 commits April 30, 2025 13:48

force http to https

d60f615

fixing util method to convert to https

d51b706

shanbady added the Needs Review An open Pull Request that is ready for review label Apr 30, 2025

shanbady marked this pull request as ready for review April 30, 2025 18:41

ChristopherChudzicki self-assigned this May 1, 2025

ChristopherChudzicki reviewed May 1, 2025

View reviewed changes

adding test mode field

9af6840

adding filters to pull in test_mode resources

2ff2ac9

shanbady added 2 commits May 1, 2025 12:16

Revert "adding test mode field"

2646c1b

This reverts commit 9af6840.

Revert "adding filters to pull in test_mode resources"

18d49ec

This reverts commit 2ff2ac9.

ChristopherChudzicki approved these changes May 1, 2025

View reviewed changes

shanbady merged commit e41d4e5 into main May 1, 2025
13 checks passed

shanbady deleted the shanbady/https-for-marketing-urls branch May 1, 2025 20:29

This was referenced May 6, 2025

Release 0.32.0 #2227

Closed

Release 0.32.0 #2228

Closed

Release 0.31.2 #2229

Closed

Release 0.31.2 #2243

Closed

Release 0.31.3 #2244

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Force https for urls in scraper #2222

Force https for urls in scraper #2222

Uh oh!

shanbady commented Apr 30, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 30, 2025

Uh oh!

ChristopherChudzicki left a comment •

edited

Loading

Uh oh!

ChristopherChudzicki May 1, 2025

Uh oh!

github-actions bot commented May 1, 2025

Uh oh!

github-actions bot commented May 1, 2025

Uh oh!

github-actions bot commented May 1, 2025

Uh oh!

ChristopherChudzicki left a comment

Uh oh!

Uh oh!

Uh oh!

		@@ -5,6 +5,7 @@


		def scraper_for_site(url):
		url = url.replace("http://", "https://")

Force https for urls in scraper #2222

Force https for urls in scraper #2222

Uh oh!

Conversation

shanbady commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

Uh oh!

github-actions bot commented Apr 30, 2025

OpenAPI Changes

Uh oh!

ChristopherChudzicki left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChristopherChudzicki May 1, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 1, 2025

OpenAPI Changes

Uh oh!

github-actions bot commented May 1, 2025

OpenAPI Changes

Uh oh!

github-actions bot commented May 1, 2025

OpenAPI Changes

Uh oh!

ChristopherChudzicki left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shanbady commented Apr 30, 2025 •

edited

Loading

ChristopherChudzicki left a comment •

edited

Loading