Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Latest commit

 

History

History
History
45 lines (35 loc) · 1.78 KB

File metadata and controls

45 lines (35 loc) · 1.78 KB
Copy raw file
Download raw file
Edit and raw actions

##Web Mining

  1. scrapy
    Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
    Project Source: https://github.com/scrapy/scrapy
    Project Homepage: http://scrapy.org/

  2. Pattern
    Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
    Project Source: https://github.com/clips/pattern
    Project Homepage: http://www.clips.ua.ac.be/pages/pattern

  3. portia
    Portia is a tool for visually scraping web sites without any programming knowledge.
    Project Source: https://github.com/scrapinghub/portia

  4. python-goose
    Html Content / Article Extractor, web scrapping lib in Python.
    Project Source: https://github.com/grangier/python-goose

  5. newspaper
    News extraction, article extraction and content curation in python.
    Project Source: https://github.com/codelucas/newspaper
    Project Homepage: http://newspaper.readthedocs.org/en/latest/

  6. gensim
    Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.
    Project Source: https://github.com/piskvorky/gensim
    Project Homepage: http://radimrehurek.com/gensim/

  7. distribute_crawler
    A distributed web crawler.
    Project Source: https://github.com/gnemoug/distribute_crawler

  8. pyspider
    A spider system in python.
    Project Source: https://github.com/binux/pyspider

  9. tagger
    A Python module for extracting relevant tags from text documents.
    Project Source: https://github.com/apresta/tagger

  10. cola
    A distributed crawling framework.
    Project Source: https://github.com/chineking/cola

Morty Proxy This is a proxified and sanitized view of the page, visit original site.