Loading...
Loading...
Unlock efficient Python web scraping with requests, BeautifulSoup for static sites, Selenium for dynamic content, and advanced tools like Firecrawl and AgentQL. Includes code examples, error handling, and optimization tips for scalable data extraction.
## Core Principles for Effective Web Scraping
Adopt modular, readable Python code following PEP 8. Focus on efficiency, ethical practices like rate limiting, and respecting robots.txt. Always start with site exploration to map data structures.
## Scraping Static Websites with Requests and BeautifulSoup
For basic HTML pages, fetch content via HTTP and parse selectively.
```python
import requests
from bs4 import BeautifulSoup
import time
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
url = 'https://example.com'
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
titles = soup.find_all('h2', class_='title')
data = [{'title': title.text.strip()} for title in titles]
print(data)
time.sleep(1) # Rate limiting
```
## Dynamic Sites with Selenium
Handle JavaScript-rendered pages using a headless browser.
```python
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
driver.get('https://example.com')
time.sleep(3) # Wait for JS
elements = driver.find_elements(By.CSS_SELECTOR, '.dynamic-content')
data = [elem.text for elem in elements]
driver.quit()
print(data)
```
## Advanced Text Extraction with Firecrawl and Jina
Firecrawl excels at deep crawling; Jina for AI-enhanced structuring.
```python
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key='your_key')
result = app.crawl_url('https://example.com', params={'maxDepth': 2})
print(result['data'])
```
Use Jina for semantic parsing pipelines.
## Complex Interactions with AgentQL and Multion
Automate logins or forms with AgentQL for structured queries.
```python
from agentql import AgentQL
aql = AgentQL()
data = aql.run('https://example.com', 'form input[name="username"]')
print(data)
```
Multion suits exploratory tasks like ticket booking.
## Data Validation, Cleaning, and Storage
Ensure data integrity before saving.
```python
import pandas as pd
import json
# Validate
df = pd.DataFrame(data)
df = df.dropna().drop_duplicates()
# Store
df.to_csv('scraped_data.csv', index=False)
with open('data.json', 'w') as f:
json.dump(data, f, indent=2)
```
## Robust Error Handling and Retries
Use exponential backoff for reliability.
```python
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retry_strategy = Retry(total=3, backoff_factor=1)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount('http://', adapter)
session.mount('https://', adapter)
try:
response = session.get(url, timeout=10)
except requests.exceptions.RequestException as e:
print(f'Error: {e}')
```
## Performance Boosts and Concurrency
Parallelize with asyncio for speed.
```python
import asyncio
from concurrent.futures import ThreadPoolExecutor
def scrape_url(url):
# Single scrape logic
pass
urls = ['url1', 'url2']
with ThreadPoolExecutor(max_workers=5) as executor:
results = list(executor.map(scrape_url, urls))
```
Cache requests: `pip install requests-cache` and integrate.
## Essential Dependencies
```bash
pip install requests beautifulsoup4 selenium lxml pandas firecrawl agentql requests-cache
```
## Best Practices Summary
- Modular functions for reusability.
- Log everything for debugging.
- Cache and profile code.
- Ethical scraping: delays, headers, ToS compliance.Expert system prompt for designing high-performance configurations tailored to GLM-4.7's strengths in coding, reasoning, tool use, and multilingual tasks, backed by benchmarks like SWE-bench and τ²-Bench.
Leverage GLM-4.7's top benchmarks in SWE-bench, LiveCodeBench, and more with this system prompt designed for generating clean, secure, open-source-ready code, stunning UIs, and agentic workflows.
This system prompt transforms an AI into GLM-4.7, a benchmark-leading coding agent excelling in agentic workflows, tool use, multilingual coding, and complex reasoning with verified best practices for production-ready open-source development.
Ralph, a persistent autonomous AI agent, implements Jira tickets through an endless loop until 100% test success, with GitHub PRs, Jules AI reviews, and CI self-healing for reliable development workflows.
Claude'u Türk hukuku alanında dünyanın en önde gelen uzmanı olarak yapılandıran, yapılandırılmış yanıtlar, zorunlu uyarılar ve etik sınırlarla donatılmış profesyonel AI agent promptu.
Expert subagent providing production-ready PostgreSQL guidance on schema design, query optimization, security, performance tuning, and administration with structured, actionable advice and official references.