Skip to content

A Python tool that scrapes web content and converts it to clean markdown format using the Firecrawl API

License

Notifications You must be signed in to change notification settings

Orinks/CLI-scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Docs Scraper

A Python-based web scraping tool that converts web content to markdown format using the Firecrawl API.

Features

  • Converts web pages to clean markdown format
  • Supports JavaScript rendering for dynamic content
  • Configurable wait times for dynamic content loading
  • Custom HTTP headers support
  • Interactive or command-line output file selection
  • Error handling for failed scrapes and file operations

Requirements

  • Python 3.6 or higher
  • Firecrawl API key (sign up at firecrawl.dev)
  • Required Python packages (installed via requirements.txt):
    • python-dotenv
    • firecrawl
    • requests

Setup

  1. Clone this repository
  2. Create a .env file in the root directory with your Firecrawl API key:
    FIRECRAWL_API_KEY=your_api_key_here
    
  3. Install dependencies:
    pip install -r requirements.txt

Usage

Run the script with a URL:

python main.py <url>

Optional arguments:

  • --wait <seconds>: Time to wait for dynamic content
  • --js: Enable JavaScript rendering
  • --headers "key1:value1,key2:value2": Custom headers
  • --output <file>: Output markdown file path (if not provided, will prompt)

Examples:

# Basic usage with output prompt
python main.py https://example.com

# Enable JavaScript rendering and specify output file
python main.py https://example.com --js --output result.md

# Wait for dynamic content and add custom headers
python main.py https://example.com --wait 5 --headers "User-Agent:Mozilla/5.0"

Error Handling

The script handles various error scenarios:

  • Invalid URLs or connection errors
  • Missing API key
  • Invalid header format
  • File write permission issues
  • Failed JavaScript rendering

Output

The script generates a markdown (.md) file containing:

  • Converted web content in markdown format
  • Preserved heading structure
  • Formatted links and images
  • Tables and lists (if present in source)

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

License

This project is open source and available under the MIT License.

About

A Python tool that scrapes web content and converts it to clean markdown format using the Firecrawl API

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages