What is Scrapling?
Scrapling is an adaptive Python web scraping framework created by developer D4Vinci. It quickly gained popularity on GitHub Trending in 2026, earning over 57,000 Stars, becoming a representative of the new generation of scraping tools.
Why Do You Need Scrapling?
Traditional crawler frameworks (like Scrapy, BeautifulSoup) face three major pain points:
- Website structure changes break code: CSS selectors or XPath crash as soon as the page is redesigned
- Anti-bot systems block requests: Protections like Cloudflare Turnstile and Akamai return 403 directly for ordinary requests
- Poor scalability: Scaling from small-scale scraping to large-scale concurrent crawling requires rewriting lots of code
Scrapling's core design philosophy is "One library, zero compromises":
- Adaptive Parser: Automatically learns website structure and re-locates elements after page updates
- Built-in Anti-blocking: Works out of the box to bypass mainstream protections like Cloudflare Turnstile
- Scrapy-like API: Developers familiar with Scrapy can migrate seamlessly
- From Single Request to Full Crawling: The same API supports simple fetching and large-scale concurrent crawling
Scrapling vs Traditional Scraping Tools
| Feature | BeautifulSoup | Scrapy | Scrapling |
|---|---|---|---|
| Learning Curve | Low | Medium | Medium |
| Adaptive Parsing | ❌ | ❌ | ✅ |
| Bypass Cloudflare | ❌ | Needs Plugin | ✅ Built-in |
| Concurrent Crawling | ❌ | ✅ | ✅ |
| Dynamic Page Support | ❌ | Needs Middleware | ✅ Built-in Playwright |
| Pause/Resume | ❌ | Needs Extension | ✅ Built-in |
| Streaming Output | ❌ | ❌ | ✅ |
Installing Scrapling
Requirements
- Python 3.8+
- pip package manager
Quick Install
pip install scrapling
Verify Installation
from scrapling.fetchers import Fetcher
# Test basic functionality
p = Fetcher.fetch('https://example.com')
print(p.title) # Outputs page title
Optional Dependencies
If you need dynamic page rendering (JavaScript websites), install Playwright:
pip install playwright
playwright install chromium
Getting Started: Your First Crawler
Example 1: Simple Page Scraping
Let's start with a simple example—scraping headlines from Hacker News.
from scrapling.fetchers import Fetcher
# Send HTTP request
page = Fetcher.fetch('https://news.ycombinator.com/')
# Use CSS selector to extract data
stories = page.css('.titleline > a')
for story in stories[:5]: # Take only the first 5
title = story.text
link = story.attrs.get('href', '')
print(f"Title: {title}")
print(f"Link: {link}")
print("-" * 40)
Sample Output:
Title: Show HN: I built a real-time code collaboration tool
Link: https://github.com/example/collab-tool
----------------------------------------
Title: Ask HN: What's your favorite Python library in 2026?
Link: https://news.ycombinator.com/item?id=123456
----------------------------------------
Example 2: Adaptive Parsing (Auto-save)
Scrapling's core feature is adaptive parsing. When you first scrape data, you can enable auto_save=True, and Scrapling will learn the page structure and save features. When the website redesigns, just pass adaptive=True, and it will automatically find the target elements.
from scrapling.fetchers import Fetcher
page = Fetcher.fetch('https://quotes.toscrape.com/')
# First scrape: enable auto_save
quotes = page.css('.quote', auto_save=True)
for quote in quotes[:3]:
text = quote.css('.text::text').get()
author = quote.css('.author::text').get()
print(f"{text} — {author}")
If the website structure changes:
# Subsequent scrapes: pass adaptive=True
page = Fetcher.fetch('https://quotes.toscrape.com/')
quotes = page.css('.quote', adaptive=True) # Automatically adapts to new structure!
for quote in quotes[:3]:
text = quote.css('.text::text').get()
author = quote.css('.author::text').get()
print(f"{text} — {author}")
💡 How it works: Scrapling records multiple features of an element (tag type, nearby text, attribute patterns, etc.). Even if CSS class names change, it can re-locate the element through other features.
Advanced Features
1. Bypassing Cloudflare Turnstile
Many websites use Cloudflare Turnstile or other anti-bot systems. Scrapling's StealthyFetcher can handle these protections automatically.
from scrapling.fetchers import StealthyFetcher
# Enable adaptive mode
StealthyFetcher.adaptive = True
# Automatically bypass Cloudflare
page = StealthyFetcher.fetch(
'https://example-protected-site.com',
headless=True, # Headless browser mode
network_idle=True # Wait for network idle
)
# Extract data normally
products = page.css('.product-item')
for product in products:
name = product.css('.name::text').get()
price = product.css('.price::text').get()
print(f"{name}: {price}")
Key Parameter Explanation:
- headless=True: Uses headless browser to simulate real user behavior
- network_idle=True: Waits for all network activity to complete before extracting (suitable for SPA apps)
- adaptive=True: Enables adaptive parsing
2. Dynamic Page Rendering (Playwright)
For websites requiring JavaScript rendering, use DynamicFetcher.
from scrapling.fetchers import DynamicFetcher
page = DynamicFetcher.fetch(
'https://spa-example.com',
wait_for='.content-loaded', # Wait for specific element to appear
timeout=30000 # Timeout in milliseconds
)
# Extract dynamically loaded content
articles = page.css('article')
for article in articles:
title = article.css('h2::text').get()
summary = article.css('.summary::text').get()
print(f"{title}\n{summary}\n")
3. Asynchronous Concurrent Scraping
Use AsyncFetcher for high-concurrency scraping.
import asyncio
from scrapling.fetchers import AsyncFetcher
async def fetch_multiple_pages():
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
'https://example.com/page/3',
'https://example.com/page/4',
'https://example.com/page/5',
]
# Send requests concurrently
pages = await AsyncFetcher.fetch_many(urls, concurrency=3)
for url, page in zip(urls, pages):
if page:
title = page.title
print(f"{url}: {title}")
else:
print(f"{url}: Request failed")
asyncio.run(fetch_multiple_pages())
Spider Framework: Large-Scale Crawling
Scrapling provides a Scrapy-like Spider framework that supports large-scale concurrent crawling.
Basic Spider
from scrapling.spiders import Spider, Response
class QuoteSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
async def parse(self, response: Response):
# Extract quotes from current page
for quote in response.css('.quote'):
yield {
"text": quote.css('.text::text').get(),
"author": quote.css('.author::text').get(),
"tags": quote.css('.tag::text').getall()
}
# Go to next page
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield self.follow(next_page, callback=self.parse)
# Start the spider
QuoteSpider().start()
Concurrency Configuration
class MultiPageSpider(Spider):
name = "multi-page"
start_urls = [f"https://example.com/page/{i}" for i in range(1, 101)]
# Configure concurrency
custom_settings = {
"concurrency": 5, # Max concurrency
"download_delay": 1, # Download delay in seconds
"robots_txt_obey": True, # Obey robots.txt
}
async def parse(self, response: Response):
title = response.css('h1::text').get()
yield {"url": response.url, "title": title}
MultiPageSpider().start()
Pause and Resume
Scrapling supports checkpoint-based persistence. After gracefully exiting with Ctrl+C, it will automatically resume on the next start.
class LongRunningSpider(Spider):
name = "long-crawl"
start_urls = ["https://large-site.com/"]
custom_settings = {
"checkpoint_dir": "./checkpoints", # Checkpoint directory
}
async def parse(self, response: Response):
# Extract data...
yield {"data": "..."}
# Continue crawling
for link in response.css('a::attr(href)').getall():
yield self.follow(link, callback=self.parse)
LongRunningSpider().start()
Real-World Examples
Case 1: Scraping GitHub Trending Projects
from scrapling.fetchers import Fetcher
def scrape_github_trending():
page = Fetcher.fetch('https://github.com/trending')
repos = page.css('.Box-row')
trending = []
for repo in repos[:10]:
name = repo.css('h2 a::text').get('').strip()
description = repo.css('p.col-9::text').get('').strip()
stars = repo.css('[href$=stargazers] span::text').get('').strip()
language = repo.css('[itemprop=programmingLanguage]::text').get('').strip()
trending.append({
"name": name,
"description": description,
"stars": stars,
"language": language
})
return trending
if __name__ == "__main__":
results = scrape_github_trending()
for repo in results:
print(f"📦 {repo['name']}")
print(f" {repo['description'][:80]}...")
print(f" ⭐ {repo['stars']} | 📝 {repo['language']}")
print()
Case 2: E-commerce Price Monitoring
from scrapling.fetchers import StealthyFetcher
import json
from datetime import datetime
def monitor_prices():
urls = [
"https://amazon.com/dp/B08N5WRWNW",
"https://amazon.com/dp/B0BSHF7WHW",
"https://amazon.com/dp/B09G9FPHY6",
]
results = []
for url in urls:
page = StealthyFetcher.fetch(url, headless=True)
title = page.css('#productTitle::text').get('').strip()
price = page.css('.a-price .a-offscreen::text').get('').strip()
results.append({
"url": url,
"title": title,
"price": price,
"timestamp": datetime.now().isoformat()
})
print(f"✅ {title[:50]}... - {price}")
# Save to JSON
with open('price_monitor.json', 'w', encoding='utf-8') as f:
json.dump(results, f, ensure_ascii=False, indent=2)
print(f"\n📊 Saved {len(results)} price records to price_monitor.json")
if __name__ == "__main__":
monitor_prices()
Case 3: Streaming Output (Real-time Processing)
For long-running crawlers, you can use streaming mode to process data in real-time.
from scrapling.spiders import Spider, Response
class StreamingSpider(Spider):
name = "streaming-demo"
start_urls = ["https://quotes.toscrape.com/"]
async def parse(self, response: Response):
for quote in response.css('.quote'):
item = {
"text": quote.css('.text::text').get(),
"author": quote.css('.author::text').get()
}
yield item # Yield immediately, no need to wait for completion
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield self.follow(next_page, callback=self.parse)
# Stream consumption
async def main():
spider = StreamingSpider()
async for item in spider.stream():
# Process each item in real-time
print(f"Received: {item['text'][:50]}... by {item['author']}")
# Can immediately store in database, send to message queue, etc.
import asyncio
asyncio.run(main())
Advanced Tips
1. Proxy Rotation
from scrapling.spiders import Spider, Response
class ProxySpider(Spider):
name = "proxy-spider"
start_urls = ["https://httpbin.org/ip"]
custom_settings = {
"proxy_list": [
"http://proxy1.example.com:8080",
"http://proxy2.example.com:8080",
"http://proxy3.example.com:8080",
],
"proxy_rotation": "per_request", # Rotate proxy per request
}
async def parse(self, response: Response):
ip = response.json().get('origin')
print(f"Current IP: {ip}")
2. Custom Export Pipeline
from scrapling.spiders import Spider, Response
import csv
class CSVSpider(Spider):
name = "csv-export"
start_urls = ["https://quotes.toscrape.com/"]
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.csv_file = open('quotes.csv', 'w', newline='', encoding='utf-8')
self.writer = csv.writer(self.csv_file)
self.writer.writerow(['Text', 'Author', 'Tags'])
async def parse(self, response: Response):
for quote in response.css('.quote'):
text = quote.css('.text::text').get()
author = quote.css('.author::text').get()
tags = ', '.join(quote.css('.tag::text').getall())
self.writer.writerow([text, author, tags])
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield self.follow(next_page, callback=self.parse)
def close(self, reason):
self.csv_file.close()
print(f"✅ Data saved to quotes.csv")
3. Development Mode (Cache Responses)
When debugging parsing logic, avoid repeated requests to the server.
from scrapling.spiders import Spider, Response
class DevSpider(Spider):
name = "dev-mode"
start_urls = ["https://example.com"]
custom_settings = {
"dev_mode": True, # Enable dev mode
"cache_dir": "./http_cache", # Cache directory
}
async def parse(self, response: Response):
# First run: cache response to disk
# Subsequent runs: read directly from disk, no network request needed
title = response.css('h1::text').get()
yield {"title": title}
FAQ
Q1: What's the difference between Scrapling and Scrapy?
A: Scrapling can be seen as a modernized enhanced version of Scrapy: - Adaptive Parsing: Scrapy's selectors fail after page redesigns, while Scrapling adapts automatically - Built-in Anti-blocking: Scrapy needs extra middleware to bypass Cloudflare, Scrapling works out of the box - Simpler API: Scrapling's Fetcher API is better suited for small-scale quick scraping
If you're already familiar with Scrapy, migrating to Scrapling has almost no learning cost.
Q2: How to handle pages after login?
A: Use StealthyFetcher or DynamicFetcher to simulate login:
from scrapling.fetchers import DynamicFetcher
page = DynamicFetcher.fetch(
'https://example.com/login',
headless=True,
wait_for='#dashboard' # Wait for redirect after login
)
# Perform login operation (via Playwright)
page.page.fill('#username', 'your_username')
page.page.fill('#password', 'your_password')
page.page.click('#login-button')
page.page.wait_for_selector('#dashboard')
# Now you can scrape post-login content
data = page.css('.private-data::text').getall()
Q3: How to limit crawling speed to avoid being blocked?
A: Configure download delay and concurrency limits in the Spider:
class PoliteSpider(Spider):
custom_settings = {
"concurrency": 2, # Lower concurrency
"download_delay": 2, # 2-second interval between requests
"robots_txt_obey": True, # Obey robots.txt
}
Q4: What selectors does Scrapling support?
A: Supports both CSS selectors and XPath:
# CSS Selector
page.css('.class-name::text').get()
page.css('#id-name').getall()
# XPath
page.xpath('//div[@class="example"]/text()').get()
Summary
Scrapling is the most noteworthy Python crawler framework of 2026. It perfectly combines adaptive parsing, anti-blocking capabilities, and Scrapy-like API, allowing developers to implement the most stable crawlers with minimal code.
Core Advantages Recap: - ✅ Adaptive Parser: Automatically re-locates elements after page redesigns - ✅ Built-in Anti-blocking: Out-of-the-box bypass for Cloudflare Turnstile - ✅ From Single Request to Full Crawling: Same API covers all scenarios - ✅ Concurrency, Pause/Resume, Streaming Output: Production-grade features included
Resource Links: - GitHub Repository - Official Documentation - Discord Community
If you found this article helpful, feel free to share it with more developers!