Apache Nutch!

Apache Nutch alternatives

Scrapy
Scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
tags: framework data-mining web-scraping

StormCrawler
StormCrawler is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.
tags: web-crawler

Heritrix
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
tags: web-crawler web-crawling web-data-crawling

Mixnode
Mixnode is a fast, flexible and massively scalable web crawler in the cloud. Using Mixnode eliminates the need for upfront investment in infrastructure, hardware, software and labour that would be required if you built or ran your own web crawler.
tags: crawling web-crawler web-crawling web-scraper web-scraping

ACHE Crawler
ACHE is a web crawler for domain-specific search
tags: web-crawler web-crawling web-scraper web-scraping