- JSON Endpoint Scraper - Fast scraping using Reddit's
.json
endpoints (no authentication required) - Advanced Requests Scraper - Custom pagination and bulk scraping capabilities
- Proxy Rotation - Automatic proxy switching with health monitoring
- Captcha Solving - Automated captcha handling using Capsolver API
- User Agent Rotation - Realistic browser simulation
- Rate Limiting - Respectful request throttling
- Rich CLI Interface - Beautiful command-line interface with progress bars
- Multiple Export Formats - JSON and CSV output with full comment thread data
git clone https://github.com/proxidize/reddit-scraper.git
cd reddit-scraper
uv venv
source .venv/bin/activate
uv pip install -e .
git clone https://github.com/proxidize/reddit-scraper.git
cd reddit-scraper
python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -e .[dev]
uv pip install -e .[dev]
python tests/run_tests.py
pytest tests/ -v --cov=reddit_scraper
pytest tests/unit/ -v -m unit
pytest tests/integration/ -v -m integration
pytest tests/ -v -m "not slow"
pytest tests/ --cov=reddit_scraper --cov-report=html
unit
- Fast unit testsintegration
- Integration tests that may hit external APIsslow
- Slow tests that should be skipped in CI
docker build -t reddit-scraper .
docker run -v $(pwd)/config.json:/app/config.json reddit-scraper interactive --config config.json
docker run -v $(pwd)/config.json:/app/config.json reddit-scraper json subreddit python --limit 10 --config config.json
docker run -v $(pwd)/config.json:/app/config.json -v $(pwd)/output:/app/output reddit-scraper json subreddit python --limit 10 --output output/posts.json --config config.json
python3 -m reddit_scraper.cli interactive
python3 -m reddit_scraper.cli interactive --config config.json
python3 -m reddit_scraper.cli json subreddit python --limit 10
python3 -m reddit_scraper.cli json subreddit technology --config config.json --limit 50
Note: If you've properly installed the package with pip install -e .
, you can use reddit-scraper
directly instead of python3 -m reddit_scraper.cli
The scraper uses a JSON configuration file to manage all settings including proxies, captcha solvers, and scraping preferences.
Copy config.example.json
to config.json
and edit:
{
"proxies": [
{
"host": "proxy1.example.com",
"port": 8080,
"username": "your_proxy_username",
"password": "your_proxy_password",
"proxy_type": "http"
},
{
"host": "proxy2.example.com",
"port": 1080,
"username": "your_proxy_username",
"password": "your_proxy_password",
"proxy_type": "socks5"
}
],
"captcha_solvers": [
{
"api_key": "CAP-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"provider": "capsolver",
"site_keys": {
"reddit.com": "6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-",
"www.reddit.com": "6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-"
}
}
],
"scraping": {
"default_delay": 1.0,
"max_retries": 3,
"requests_per_minute": 60,
"user_agent": "RedditScraper/1.0.0",
"rotate_user_agents": true
}
}
- Multiple Proxies: Add multiple HTTP and SOCKS5 proxies for automatic rotation
- Captcha Solving: Integrate with Capsolver for automated captcha handling with custom site keys
- Input Validation: Automatic validation of subreddit names, usernames, and other inputs
- Flexible Configuration: Easy JSON-based configuration management with validation
- Health Monitoring: Built-in proxy health checking and performance monitoring
cp config.example.json config.json
nano config.json
python3 -m reddit_scraper.cli status --config config.json
The scraper includes robust input validation and data processing capabilities:
- Subreddit Names: Validates format, length (1-21 chars), and checks for reserved names
- Usernames: Validates Reddit username format (3-20 chars, alphanumeric plus underscore/hyphen)
- Post IDs: Ensures proper Reddit post ID format
- URLs: Validates and normalizes Reddit URLs
- Comment Threading: Maintains proper parent-child relationships in comment trees
- Data Cleaning: Removes unnecessary metadata while preserving essential information
- Field Standardization: Consistent field names and data types across all scraped content
from reddit_scraper import ValidationError
try:
posts = scraper.scrape_subreddit("invalid-name!", "hot", 10)
except ValidationError as e:
print(f"Validation error: {e}")
python3 -m reddit_scraper.cli interactive [--config CONFIG_FILE]
python3 -m reddit_scraper.cli json subreddit SUBREDDIT_NAME [--config CONFIG_FILE] [options]
python3 -m reddit_scraper.cli json user USERNAME [options]
python3 -m reddit_scraper.cli json comments SUBREDDIT POST_ID [options]
python3 -m reddit_scraper.cli json subreddit-with-comments SUBREDDIT_NAME [options]
Extract rich comment data with full thread structure:
python3 -m reddit_scraper.cli json subreddit-with-comments python --limit 10 --include-comments --comment-limit 20 --output posts_with_comments.json
python3 -m reddit_scraper.cli json comments python POST_ID --sort best --output single_post_comments.json
python3 -m reddit_scraper.cli json user username --limit 25 --sort top --output user_posts.json
Comment Data Includes:
- Author information and scores
- Full comment text and timestamps
- Nested reply structure
- Thread hierarchy and relationships
- Community engagement metrics
Real Example (Actual Scraped Data):
{
"title": "A simple home server to wirelessly stream any video file",
"author": "Enzo10091",
"score": 8,
"num_comments": 1,
"comment_count_scraped": 1,
"comments": [
{
"id": "lwg8h3x",
"author": "ismail_the_whale",
"body": "nice, but you really have to clean this up. i guess you're not a python dev.\n\n- use snake_case\n- use a pyproject.toml file",
"score": 2,
"created_utc": 1755262448.0,
"parent_id": "t3_1mqw7zr",
"replies": []
}
]
}
python3 -m reddit_scraper.cli requests paginated SUBREDDIT_NAME [options]
python3 -m reddit_scraper.cli status --config config.json
python3 -m reddit_scraper.cli test-proxies --config config.json --test-urls 3
cp config.example.json config.json
nano config.json
python3 -m reddit_scraper.cli status --config config.json
python3 -m reddit_scraper.cli search "python tips" --subreddit python
python3 -m reddit_scraper.cli search "neural networks" --subreddit MachineLearning
Reddit has some protection against automated scraping:
- Some subreddits may trigger captcha challenges (r/webscraping, etc.)
- Large bulk requests may hit rate limits
- Search endpoints work but may be slower than direct scraping
Recommended approach:
- Use interactive mode for best success rate
- Start with popular, stable subreddits like
python
,technology
- Use proxies and captcha solving for reliable large-scale scraping
- Search functionality works well for targeted queries
python3 -m reddit_scraper.cli interactive --config config.json
python3 -m reddit_scraper.cli json subreddit python --limit 10
python3 -m reddit_scraper.cli json subreddit technology --config config.json --limit 50
python3 -m reddit_scraper.cli search "python tips" --subreddit python
python3 -m reddit_scraper.cli requests paginated python --max-posts 100
python3 -m reddit_scraper.cli status --config config.json
python3 -m reddit_scraper.cli test-proxies --config config.json
Subreddits that work well:
python
,programming
,technology
news
,todayilearned
entrepreneur
,startups
--config
,-c
- Path to configuration file--output
,-o
- Output file path--format
- Output format (json, csv)--limit
- Number of items to fetch--sort
- Sort method (hot, new, top, rising, etc.)--delay
- Delay between requests (seconds)
from reddit_scraper import JSONScraper, get_config_manager
scraper = JSONScraper()
posts = scraper.scrape_subreddit("python", "hot", 50)
config_manager = get_config_manager("config.json")
proxy_manager, captcha_solver = setup_advanced_features(config_manager)
advanced_scraper = JSONScraper(
proxy_manager=proxy_manager,
captcha_solver=captcha_solver,
delay=config_manager.get_scraping_config().default_delay
)
posts = advanced_scraper.scrape_subreddit("MachineLearning", "top", 1000)
from reddit_scraper import ProxyManager
proxy_manager = ProxyManager()
proxy_manager.add_proxy("proxy.example.com", 8080, "user", "pass", "http")
proxy_manager.health_check_all()
stats = proxy_manager.get_proxy_stats()
print(f"Healthy proxies: {stats['healthy_proxies']}/{stats['total_proxies']}")
from reddit_scraper import CaptchaSolverManager
solver = CaptchaSolverManager("YOUR_CAPSOLVER_API_KEY")
solution = solver.check_balance_and_solve(
solver.solver.solve_recaptcha_v2,
"https://reddit.com",
"site_key_here"
)
if solution.success:
print(f"Captcha solved: {solution.solution}")
- Always respect Reddit's Terms of Service
- Don't overload Reddit's servers
- Consider using the official API for commercial use
- Default: 1 second delay between requests
- Use appropriate delays between requests
- Increase delay for large-scale operations
- Monitor proxy health to avoid IP bans
- Store scraped data responsibly
- Respect user privacy
- Don't republish personal information
reddit-scraper test-proxies
reddit-scraper status
reddit-scraper status
- Increase
--delay
parameter - Use configuration file with multiple proxies
- Reduce
--limit
per request
This project integrates with capsolver for automated captcha solving, supporting:
- reCAPTCHA v2/v3
- hCaptcha
- FunCaptcha
- Image-to-text captchas
Compatible with Reddit's public JSON endpoints for FREE data access.
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
This project is for educational and research purposes. Please respect Reddit's Terms of Service and robots.txt.
For a detailed walkthrough of how this Reddit scraper was built, including the challenges faced and solutions implemented, read our comprehensive blog post:
Reddit Scraper: How to Scrape Reddit for Free
The blog post covers:
- Why Python was chosen for this project
- How pagination problems were solved
- Different approaches for small vs large scraping jobs
- Proxy rotation and error handling strategies
- Real-world examples and use cases
For issues, questions, or feature requests, please open an issue on GitHub or contact support@proxidize.com.
Note: This tool is designed for ethical data collection and research purposes. Always comply with Reddit's Terms of Service and respect rate limits.