I’m sure everyone is already familiar with Jakarta’s Lucene project. However, did you know that there are other lesser known open source full-text search engines written in Java? Well there are a bunch of them that are by no means less capable than the de-facto standard.
- Lucene – The de-facto open source search index used almost everywhere. Features include Ranked searching, boolean and phrase queries, fielded searching and date-range searching. Lucene also serves as the search engine of Nutch.
- Egothor – Impressive demo is worth a look. Key features include: HTML, PDF, PS, and Microsoft’s DOC, and XLS indexing; Golomb, Elias-Gamma and Block coding; Universal stemmer that can process almost any language; Boolean model and Vector model.
- Carrot2 – Carrot2 is a research framework for experimenting with automated querying of various data sources (such as search engines), processing search results and their visualization.
- BDDBot – BDDBot is a web robot, search engine, and web server written entirely in Java. It was written by Tim Macinta for his book (co-authored with Wes Sonnenreich), a Web Developer’s Guide to Search Engines. It was written as an example for a chapter on how to write your search engines, and as such it is very simplistic.
- MG4J – MG4J lets you build compressed full-text indices for large collections of documents using sophisticated techniques such as interpolative coding. Moreover, it provides utility classes that are essential in any serious text-processing activity.
- eXist – Primarily designed as an XML database however it includes an inverted index that speeds up XPath based queries. The author describes this “Indexing in eXist is based on a numbering scheme which supports quick identification of structural relationships between nodes, such as parent-child, ancestor-descendant or previous-/next-sibling. This way, a wide range of common path expressions is processed only using indexing information”.
- JXTA Search – JXTA Search is a JXTA service which enables efficient search in distributed networks. JXTA Search is based on technology originally developed by Infrasearch which was acquired by Sun in March 2001. JXTA Search searches for content and services on JXTA nodes and on the web from either network. I’m not 100% certain whether this project includes its own full-text search engine, however from a quick glance it appears to do.
- XQEngine – A full-text search engine for XML documents. Utilizes XQuery as its front-end query language. XPath expressions lets you specify constraints on attributes and element hierarchies, in addition to the specific word content.
- Zilverline – Search a collection a set of files and directories in a directory. PDF, Word, txt, java, CHM and HTML is supported, as well as zip and rar files. Search results of the search can be retrieved from local disk or remotely, if you run a webserver on your machine. Files inside zip, rar and chm files are extracted, indexed and can be cached. The cache can be mapped to sit behind your webserver as well.
- XXL – XXL is a Java library that contains a rich infrastructure for implementing advanced query processing functionality. The library offers low-level components like access to raw disks as well as high-level ones like a query optimizer. On the intermediate levels, XXL provides a demand-driven cursor algebra, a framework for indexing and a powerful package for supporting aggregation
- Red Piranha – Red Piranha combines Lucene (Searching Ability), XML-RDF (ability to learn), Tomcat (for P2P Power) and Spring (Ease of use) to not only let you find anything, anywhere, but to actually understand what you are looking for.
- Regain – Regain is a search engine that doesn’t search the web, but searches own files and documents. There are two versions of regain: The desktop search and the server search. The desktop search is to be used on a normal desktop computer and it offers you a fast search for documents or intranet webpages. The server search you can install on web servers. It provides searching functionality for a website or for intranet fileservers.
- Solr – Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP APIs, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat. It includes an Extensible Plugin Architecture.
- OpenGrok – OpenGrok is a fast and usable source code search and cross reference engine. It helps you search, cross-reference and navigate your source tree. It can understand various program file formats and version control histories like SCCS, RCS, CVS and Subversion. OpenGrok provides a fast search engine that can: search for full text, definitions, symbols, path and revision history; limit searches to any subtree (hierarchical search); search query with Google like syntax (eg. path:Makefile defs:target); search for files modified within a date range and search using wild cards like * (many characters) or ? (one character).
- Terrier – TErabyte RetrIEverR is a comprehensive, flexible, robust, and transparent platform for research and experimentation in text retrieval. Terrier has been tested to handle to at least 25 million documents. Out-of-the box indexing for documents of various formats, such as HTML, PDF, or Microsoft Word, Excel and Powerpoint files. Supports classic retrieval models, such as tf-idf, Okapi’s BM25 as well as several language models, and Rocchio’s query expansion.
- JZKit 2 – JZKit 2 is a toolset for building advanced search and retrieve applications. The framework provides components that cover all aspects of building searching applications from directory and collection description services through to record schema / syntax translation and aggregate item deduplication.
- Argos – Argos is a Java based interface designed to provided unified methods for querying internet search engines. Currently many search engines provide their own interfaces for programmatic access. These access mechanisms vary from simply providing search results in XML to supplying code in one or more languages.
- Snapper – This fulltext indexing and search engine is designed to work on millions of documents in unlimited different intranet / LAN “sites”. The included search client application is completely XML/XSLT based representing search and result pages as easily customizable XML/XSLT/HTML documents. Common office file formats are supported by native Java file parsers: MS Office, Outlook, PDF, HTML, TXT, ZIP, tar.gz, PST, Pictures and scanned images. Doument metadata from relational databases can be merged into the document index. Index data can be updated incrementally.
I’m sure that there are other projects out there that I have missed, so if you do know of them, please let me know!