Heritrix!

For future documentation improvements, we have a Documentation Wishlist http://webarchive.jira.com/wiki/display/Heritrix/Documentation+Wishlist.
An Introduction to Heritrix https://webarchive.jira.com/wiki/download/attachments/5441/Mohr-et-al-2004.pdf
provides more detailed information on the structure and design of Heritrix. Some very-old info can still be gleaned from the old wiki (http://web.archive.org/web/*/http://crawler.archive.org/cgi-bin/wiki.pl?HomePage.
# Mailing lists

Heritrix alternatives

  • Algolia

  • Algolia provides a developer-friendly RESTful API for website and app instant search. Most web services and mobile apps, such as Spotify, Salesforce or Amazon need to provide a fast and meaningful access to database objects via a simple search box. People want to find songs, invoices, products in just a few keystrokes.

    tags: api application-search developer-tools full-text-search indexed-search
  • Mixnode

  • Mixnode is a fast, flexible and massively scalable web crawler in the cloud. Using Mixnode eliminates the need for upfront investment in infrastructure, hardware, software and labour that would be required if you built or ran your own web crawler.

    tags: crawling web-crawler web-crawling web-scraper web-scraping
  • Apache Nutch

  • Apache Nutch --

    tags: web-crawler web-crawling web-scraper
  • Expertrec Search Engine

  • Expertrec site search engine helps you add ultra fast search to your website . It adds a superb autosuggest and search listing pages where people can find products with a few keystrokes. Along with this you get complete control over your search results with complete merchandising options and access to real time search analytics.

    tags: angularjs node.js objective python ruby
  • ACHE Crawler

  • ACHE is a web crawler for domain-specific search

    tags: web-crawler web-crawling web-scraper web-scraping
  • StormCrawler

  • StormCrawler is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.

    tags: web-crawler
  • Google Custom Search Engine

  • With Google Custom Search, add a search box to your homepage to help people find what they need on your website.

    tags: embeddable search-engine