Scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
tags: framework data-mining web-scrapingStormCrawler is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.
tags: web-crawlerHeritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
tags: web-crawler web-crawling web-data-crawlingMixnode is a fast, flexible and massively scalable web crawler in the cloud. Using Mixnode eliminates the need for upfront investment in infrastructure, hardware, software and labour that would be required if you built or ran your own web crawler.
tags: crawling web-crawler web-crawling web-scraper web-scrapingACHE is a web crawler for domain-specific search
tags: web-crawler web-crawling web-scraper web-scraping