Apache Nutch!

Apache Nutch alternatives

  • Scrapy

  • Scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

    tags: framework data-mining web-scraping
  • StormCrawler

  • StormCrawler is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.

    tags: web-crawler
  • Heritrix

  • Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

    tags: web-crawler web-crawling web-data-crawling
  • Mixnode

  • Mixnode is a fast, flexible and massively scalable web crawler in the cloud. Using Mixnode eliminates the need for upfront investment in infrastructure, hardware, software and labour that would be required if you built or ran your own web crawler.

    tags: crawling web-crawler web-crawling web-scraper web-scraping
  • ACHE Crawler

  • ACHE is a web crawler for domain-specific search

    tags: web-crawler web-crawling web-scraper web-scraping