Skip to content

Feedback #19

@jonashaag

Description

@jonashaag

Gave this a try :-)

Feedback:

  • If this library works as advertised it'd be huge!
  • mlscraper.html is missing from the PyPI package.
  • When no scraper can be found, the error message could be more helpful:
    mlscraper.training.NoScraperFoundException: did not find scraper
    Would be nice if the error message gave some guidance as to what fields
    couldn't be found in the HTML.
    Even with DEBUG log level it's not really helpful.
  • See more notes in my script below.
  • Training the script was really slow (gave up after 15 min).
import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

jonas_url = "https://github.com/jonashaag"
resp = requests.get(jonas_url)
resp.raise_for_status()

page = Page(resp.content)
sample = Sample(
    page,
    {
        "name": "Jonas Haag",
        "followers": "329",  # Note that this doesn't work if 329 passed as an int.
        #'company': '@QuantCo',  # Does not work.
        "twitter": "@_jonashaag",  # Does not work without the "@".
        "username": "jonashaag",
        "nrepos": "282",
    },
)

training_set = TrainingSet()
training_set.add_sample(sample)

scraper = train_scraper(training_set)

resp = requests.get("https://github.com/lorey")
result = scraper.get(Page(resp.content))
print(result)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions