A powerful, type-safe web scraping library for TypeScript and Bun with zero external dependencies. Built entirely on Bun's native APIs for maximum performance and minimal footprint.
- π Zero Dependencies - Built entirely on Bun native APIs
- πͺ Fully Typed - Complete TypeScript support with type inference
- β‘οΈ High Performance - Optimized for speed with native Bun performance
- π¨ Client-Side Rendering - Support for JavaScript-heavy sites (React, Vue, Next.js)
- π Pagination - Automatic pagination detection and traversal
- π€ Ethical Scraping - Robots.txt support and user-agent management
- π Content Extraction - Readability-style main content extraction
- π§ Contact Information - Automatic extraction of emails, phones, addresses, social profiles
- π·οΈ Metadata Extraction - Open Graph, Twitter Cards, Schema.org structured data
- π Language Detection - Multi-language detection with confidence scoring
- βΏ Accessibility Analysis - WCAG compliance checking with scoring
- β‘ Performance Metrics - Resource analysis and optimization hints
- π€ ML-Ready Features - Sentiment analysis, entity extraction, text statistics
- π Change Detection - Track content changes over time with diff algorithms
- π Rate Limiting - Built-in token bucket rate limiter with burst support
- πΎ Smart Caching - LRU cache with TTL support and disk persistence
- π Automatic Retries - Exponential backoff retry logic with budgets
- π Monitoring - Performance metrics and analytics
- πͺ Session Management - Cookie jar and session persistence
- π§ Pipeline Architecture - Powerful pipeline-based data extraction and transformation
- π― Validation - Built-in schema validation for extracted data
- π Multiple Export Formats - JSON, CSV, XML, YAML, Markdown, HTML
- π Security Tested - Comprehensive XSS, injection, and edge case testing
bun add ts-web-scraper
import { createScraper } from 'ts-web-scraper'
// Create a scraper instance
const scraper = createScraper({
rateLimit: { requestsPerSecond: 2 },
cache: { enabled: true, ttl: 60000 },
retry: { maxRetries: 3 },
})
// Scrape a website
const result = await scraper.scrape('https://example.com', {
extract: doc => ({
title: doc.querySelector('title')?.textContent,
headings: Array.from(doc.querySelectorAll('h1')).map(h => h.textContent),
}),
})
console.log(result.data)
The main scraper class provides a unified API for all scraping operations:
import { createScraper } from 'ts-web-scraper'
const scraper = createScraper({
// Rate limiting
rateLimit: {
requestsPerSecond: 2,
burstSize: 5
},
// Caching
cache: {
enabled: true,
ttl: 60000,
maxSize: 100
},
// Retry logic
retry: {
maxRetries: 3,
initialDelay: 1000
},
// Performance monitoring
monitor: true,
// Change tracking
trackChanges: true,
// Cookies & sessions
cookies: { enabled: true },
})
Extract and transform data using pipelines:
import { extractors, pipeline } from 'ts-web-scraper'
const extractProducts = pipeline()
.step(extractors.structured('.product', {
name: '.product-name',
price: '.product-price',
rating: '.rating',
}))
.map('parse-price', p => ({
...p,
price: Number.parseFloat(p.price.replace(/[^0-9.]/g, '')),
}))
.filter('in-stock', products => products.every(p => p.price > 0))
.sort('by-price', (a, b) => a.price - b.price)
const result = await extractProducts.execute(document)
Track content changes over time:
const scraper = createScraper({ trackChanges: true })
// First scrape
const result1 = await scraper.scrape('https://example.com', {
extract: doc => ({ price: doc.querySelector('.price')?.textContent }),
})
// result1.changed === undefined (no previous snapshot)
// Second scrape
const result2 = await scraper.scrape('https://example.com', {
extract: doc => ({ price: doc.querySelector('.price')?.textContent }),
})
// result2.changed === false (if price hasn't changed)
Export scraped data to multiple formats:
import { exportData, saveExport } from 'ts-web-scraper'
// Export to JSON
const json = exportData(data, { format: 'json', pretty: true })
// Export to CSV
const csv = exportData(data, { format: 'csv' })
// Save to file (format auto-detected from extension)
await saveExport(data, 'output.csv')
await saveExport(data, 'output.json')
await saveExport(data, 'output.xml')
Automatically traverse paginated content:
for await (const page of scraper.scrapeAll('https://example.com/posts', {
extract: doc => ({
posts: extractors.structured('article', {
title: 'h2',
content: '.content',
}).execute(doc),
}),
}, { maxPages: 10 })) {
console.log(`Page ${page.pageNumber}:`, page.data)
}
Track and analyze scraping performance:
const scraper = createScraper({ monitor: true })
await scraper.scrape('https://example.com')
await scraper.scrape('https://example.com/page2')
const stats = scraper.getStats()
console.log(stats.totalRequests) // 2
console.log(stats.averageDuration) // Average time per request
console.log(stats.cacheHitRate) // Cache effectiveness
const report = scraper.getReport()
console.log(report) // Formatted performance report
Validate extracted data against schemas:
const result = await scraper.scrape('https://example.com', {
extract: doc => ({
title: doc.querySelector('title')?.textContent,
price: Number.parseFloat(doc.querySelector('.price')?.textContent || '0'),
}),
validate: {
title: { type: 'string', required: true },
price: { type: 'number', min: 0, required: true },
},
})
if (result.success) {
// Data is valid and typed
console.log(result.data.title, result.data.price)
}
else {
console.error(result.error)
}
For full documentation, visit https://ts-web-scraper.netlify.app
bun test
With comprehensive coverage of:
- Core scraping functionality (static & client-side rendered)
- Content extraction (main content, contact info, metadata)
- Analysis features (accessibility, performance, ML, language detection)
- Rate limiting, caching, and retry logic
- Data extraction pipelines and validation
- Change detection and monitoring
- Export formats and session management
- Security (XSS, injection attacks, sanitization)
- Edge cases (malformed HTML, extreme values, encoding issues)
Please see our releases page for more information on what has changed recently.
Please see CONTRIBUTING for details.
For help, discussion about best practices, or any other conversation that would benefit from being searchable:
For casual chit-chat with others using this package:
Join the Stacks Discord Server
"Software that is free, but hopes for a postcard." We love receiving postcards from around the world showing where Stacks is being used! We showcase them on our website too.
Our address: Stacks.js, 12665 Village Ln #2306, Playa Vista, CA 90094, United States π
We would like to extend our thanks to the following sponsors for funding Stacks development. If you are interested in becoming a sponsor, please reach out to us.
The MIT License (MIT). Please see LICENSE for more information.
Made with π