multi page marketing site scraping #2196

shanbady · 2025-04-11T20:38:40Z

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/7081

Description (What does it do?)

This PR allows us to scrape multiple pages for a given marketing site (for purposes of embedding).

How can this be tested?

Checkout this branch
rebuild your web and celery containers
set settings.EMBEDDINGS_EXTERNAL_FETCH_USE_WEBDRIVER = True
docker compose down/up celery
run the task to fetch marketing page data

from learning_resources.tasks import scrape_marketing_pages
scrape_marketing_pages.run()

inspect the content of the marketing pages:

from learning_resources.models import ContentFile
cfs = ContentFile.objects.filter(file_type="marketing_page")
print(cfs.first().content)

for micromasters program pages - it should contain content from all the pages (tabs at the top):

from learning_resources.models import ContentFile
ContentFile.objects.filter(file_type="marketing_page", learning_resource__url__icontains='micromasters')

github-actions · 2025-04-11T20:39:01Z

OpenAPI Changes

Show/hide No detectable change.

abeglova

lgtm

shanbady added 8 commits April 9, 2025 15:19

adding initial base scraper

47e7a3e

making scrape task use scraper class

9b27fd6

adding utils file

3b7b163

adding utils file

cc7626d

fixing mock

03870c6

test fixes

9a47025

adding implicit wait

f98128a

adding test for utils

d5a76aa

shanbady changed the title ~~multi marketing page scraping~~ multi page marketing site scraping Apr 11, 2025

shanbady marked this pull request as ready for review April 11, 2025 20:40

abeglova self-assigned this Apr 14, 2025

abeglova approved these changes Apr 14, 2025

View reviewed changes

shanbady merged commit 4f82990 into main Apr 14, 2025
12 checks passed

shanbady deleted the shanbady/multi-page-scraping branch April 14, 2025 18:45

This was referenced Apr 29, 2025

Release 0.32.0 #2218

Closed

Release 0.32.0 #2227

Closed

Release 0.32.0 #2228

Closed

Release 0.31.2 #2229

Closed

This was referenced May 9, 2025

Release 0.31.2 #2243

Closed

Release 0.31.3 #2244

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multi page marketing site scraping #2196

multi page marketing site scraping #2196

Uh oh!

shanbady commented Apr 11, 2025

Uh oh!

github-actions bot commented Apr 11, 2025

Uh oh!

abeglova left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

multi page marketing site scraping #2196

multi page marketing site scraping #2196

Uh oh!

Conversation

shanbady commented Apr 11, 2025

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

Uh oh!

github-actions bot commented Apr 11, 2025

OpenAPI Changes

Uh oh!

abeglova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants