A powerful Python tool for scanning websites to identify broken links, dead URLs, and accessibility issues. Perfect for website maintenance, SEO audits, and quality assurance.
- Comprehensive Link Scanning: Crawls websites recursively to find all links
- Multiple Export Formats: Export results to JSON and CSV formats
- Configurable Depth: Control how deep the crawler goes into your site
- Domain Filtering: Option to scan only same-domain links or include external links
- Rate Limiting: Built-in delays to be respectful to target servers
- Detailed Reporting: Categorizes links as working, broken, or error states
- Real-time Progress: Live updates during scanning with emoji indicators
- Flexible Configuration: Customizable via command-line arguments
- Python 3.7+
- Required packages:
- requests- HTTP library for making web requests
- beautifulsoup4- HTML parsing library
- lxml- XML and HTML parser (recommended for BeautifulSoup)
 
- 
Clone or download the script: wget https://raw.githubusercontent.com/rakeshf/broken_link_checker.py # or curl -O https://raw.githubusercontent.com/your-repo/broken_link_checker.py
- 
Install dependencies: pip install requests beautifulsoup4 lxml 
- 
Make it executable (optional): chmod +x broken_link_checker.py 
python broken_link_checker.py <website_url>| Option | Description | Default | 
|---|---|---|
| --max-urls <number> | Maximum URLs to scan | 100 | 
| --max-depth <number> | Maximum crawl depth | 2 | 
| --delay <seconds> | Delay between requests (be respectful!) | 1.0 | 
| --external | Include external links (default: same domain only) | False | 
| --json <filename> | Save results to JSON file | None | 
| --csv <filename> | Save results to CSV file | None | 
Basic website scan:
python broken_link_checker.py https://example.comComprehensive scan with custom settings:
python broken_link_checker.py https://example.com \
  --max-urls 500 \
  --max-depth 3 \
  --delay 2 \
  --externalExport results to files:
# JSON export
python broken_link_checker.py https://example.com --json results.json
# CSV export
python broken_link_checker.py https://example.com --csv results.csv
# Both formats
python broken_link_checker.py https://example.com \
  --json results.json \
  --csv results.csvLarge website audit:
python broken_link_checker.py https://mybigsite.com \
  --max-urls 1000 \
  --max-depth 4 \
  --delay 1.5 \
  --json comprehensive_audit.json \
  --csv comprehensive_audit.csvThe tool provides real-time feedback with emoji indicators:
- π·οΈ Crawling pages
- β Working links
- β Broken links
- β οΈ Error links
- π CSV report saved
- πΎ JSON report saved
{
  "scan_info": {
    "start_time": "2024-01-15T10:30:00",
    "end_time": "2024-01-15T10:35:30",
    "duration_seconds": 330.5,
    "start_domain": "example.com",
    "max_urls": 100,
    "max_depth": 2,
    "delay": 1.0,
    "same_domain_only": true
  },
  "statistics": {
    "total_urls_processed": 87,
    "working_links_count": 82,
    "broken_links_count": 3,
    "error_links_count": 2,
    "visited_pages_count": 15
  },
  "results": {
    "working_links": [...],
    "broken_links": [...],
    "error_links": [...]
  }
}| url | status | status_code | final_url | error | type | timestamp | 
|---|---|---|---|---|---|---|
| https://example.com/page1 | working | 200 | https://example.com/page1 | 2024-01-15T10:30:15 | ||
| https://example.com/broken | broken | 404 | https://example.com/broken | 2024-01-15T10:30:20 | ||
| https://timeout.com/page | error | Connection timeout | check | 2024-01-15T10:30:25 | 
- Regular Health Checks: Schedule weekly/monthly scans to catch broken links early
- Post-Migration Audits: Verify all links work after site migrations or redesigns
- Content Updates: Check links after major content updates
- SEO Audits: Broken links hurt SEO rankings - find and fix them
- User Experience: Ensure visitors don't hit dead ends
- Link Building: Verify outgoing links to maintain site credibility
- Pre-Launch Testing: Scan staging sites before going live
- Continuous Integration: Integrate into CI/CD pipelines
- Quality Assurance: Regular checks as part of QA processes
For small sites (< 100 pages):
--max-urls 200 --max-depth 3 --delay 0.5For medium sites (100-1000 pages):
--max-urls 500 --max-depth 2 --delay 1.0For large sites (1000+ pages):
--max-urls 1000 --max-depth 2 --delay 2.0- Always use delays (--delay) to avoid overwhelming servers
- Start with smaller scans to test site behavior
- Monitor server response and increase delays if needed
- Consider time zones - scan during off-peak hours for target sites
Solution: Increase delay (--delay 2) or reduce concurrent requests
Solution: Increase delay significantly (--delay 5) and reduce max URLs
Solution: Reduce --max-urls and run multiple smaller scans
Solution: Use --external flag carefully, consider separate scans for internal/external
The tool generates detailed reports that can be used for:
- Spreadsheet Analysis: Open CSV in Excel/Google Sheets
- Database Import: Import CSV into databases for further analysis
- API Integration: Use JSON output in other tools and services
- Reporting: Generate management reports from the data
Contributions are welcome! Areas for improvement:
- Additional export formats (XML, HTML reports)
- Web interface
- Database storage options
- Advanced filtering options
- Performance optimizations
This project is open source. Feel free to use, modify, and distribute.
# Install all dependencies
pip install -r requirements.txt
# Or install individually
pip install requests beautifulsoup4 lxml# Make script executable
chmod +x broken_link_checker.py
# Run with Python explicitly
python3 broken_link_checker.py https://example.com- Use smaller --max-urlsvalues
- Reduce --max-depth
- Run multiple focused scans instead of one large scan
A REST API is provided for programmatic scanning, status checking, and result retrieval.
uvicorn broken_link_checker_api:app --host 0.0.0.0 --port 8000 --reloadOr:
python broken_link_checker_api.py apiPOST /scan
Request JSON:
{
  "url": "https://example.com",
  "max_urls": 100,
  "max_depth": 2,
  "delay": 1.0,
  "same_domain_only": true
}- Only urlis required; other fields are optional.
Response:
{
  "message": "Scan completed",
  "scan_id": "e1b2c3d4-...",
  "result_file": "download/example_com_20250630.json",
  "statistics": { ... },
  "max_urls": 100
}GET /status/{scan_id}
Response:
{
  "status": "completed",
  "total_urls_processed": 100,
  "working_links": 90,
  "broken_links": 8,
  "error_links": 2,
  "broken_links_list": [...],
  "error_links_list": [...],
  "start_domain": "example.com",
  "max_urls": 100,
  "max_depth": 2,
  "delay": 1.0,
  "same_domain_only": true
}- statuscan be- in_progress,- completed, or- not_started.
GET /results/{scan_id}
Returns the full scan results in JSON format.
Result files are stored in the download/ directory with a name based on the URL and date, e.g.:
download/example_com_20250630.json
You can download this file directly from the server if you expose a static file endpoint or via SFTP.
Visit http://localhost:8000/docs for interactive Swagger UI.
Happy Link Checking! πβ¨
For questions, issues, or feature requests, please open an issue in the repository.