Tool to help scrape, mirror, and push content s3 website, and then queue in Pocket.
The 'mirror site' also includes a .png screenshot and .pdf "print to PDF" version.
Additional ability to consume links via Slack Bot or Google Tasks API (see settings.cfg.example)
- scrape site using firefox (selenium) + load
.xpiplugins - parse article using newspaper3k (text)
-p/--push-pocket- take screenshot
- print
.pdf - format
.html - upload files to
s3bucket under - queue item in Pocket
similar to scrape url but will accept URLs from a slack bot. slackbot is run sync so best run as a systemd service.
(TODO: dockerize this with something like selenium firefox as a base image.)
Probably doesn't work on windows without a few tweaks to pathing.
- Python 3.9
- poetry
- firefox (tested on
89.0.1) - geckodriver (tested on
0.30.0) - firefox dependecies (varies by system). Example for Debian 11:
- libgtk-3-0
- gconf-service
- libasound2
- libatk1.0-0
- libc6
- libcairo2
- libcups2
- libdbus-1-3
- libexpat1
- libfontconfig1
- libgcc1
- libgconf-2-4
- libgdk-pixbuf2.0-0
- libglib2.0-0
- libgtk-3-0
- libnspr4
- libpango-1.0-0
- libpangocairo-1.0-0
- libstdc++6
- libx11-6
- libx11-xcb1
- libxcb1
- libxcomposite1
- libxcursor1
- libxdamage1
- libxext6
- libxfixes3
- libxi6
- libxrandr2
- libxrender1
- libxss1
- libxtst6
- ca-certificates
- fonts-liberation
- libnss3
- lsb-release
- xdg-utils
- wget
- geckodriver Supported platforms¶
- Firefox Releases - e.g.
https://archive.mozilla.org/pub/firefox/releases/{{ firefox_version }}/linux-x86_64/en-US/firefox-{{ firefox_version }}.tar.bz2 - geckodriver Releases - e.g.
https://github.com/mozilla/geckodriver/releases/download/v{{ geckodriver_version }}/geckodriver-v{{ geckodriver_version }}-linux64.tar.gz
| folder | description |
|---|---|
xpi/ |
firefox plugins that get loaded into selenium, e.g. bypass-paywall-chrome |
bin/ |
EXPECTS; geckodriver and compatible firefox/firefox binary |
poetry install
- Obtain a pocket consumer key
- update
pocket_consumer_keyinsettings.cfg - user
./readcli pocket gen-access-tokento get an access token - update
pocket_access_tokeninsettings.cfg
- setup an s3 bucket website (e.g. with domain name)
- create IAM user with s3 bucket permissions
- update in
settings.cfgbucket_namebucket_is_domain_aliasaws_access_key_idaws_secret_access_key
- setup a slack bot
- update in
settings.cfgbot_tokenapp_token