MG4J – A free full-text search engine for large document collections

verytrivial · on April 26, 2017

That name sound very familiar, as does the feature set. Managing Gigabytes[1], or "mg" was the output of a University of Melbourne and RMIT research in the 1990s. It went on to be commercialized as SIM and later TeraText[2] and has largely disappeared into the government intelligence indexing and consulting-heavy systems space (where it is now presumably being trounced by Palantir).

[1] https://www.amazon.com/Managing-Gigabytes-Compressing-Indexi... - Note review from Peter Norvig!

[2] http://www.teratext.com/

timb07 · on April 26, 2017

That's exactly what I thought - I worked on index construction for MG back in 1994. (Note, although my name is Tim Bell, I'm not Timothy C. Bell, the coauthor of "Managing Gigabytes".)

vigna · on April 27, 2017

I don't how this project ended up here in this moment, but as one of the authors let me answer the main questions.

1) The name is just a coincidence. I learned originally about indexing from the "Managing Gigabytes" book, and that's the reason for the name, but the book is now completely obsolete, and, even at that time, it contained a significant number of red herrings. There's no connection or code or idea sharing of any kind.

2) MG4J is our playground for doing research in information retrieval. This means, for example, that we designed new data structures, such as Elias-Fano indexing, which make MG4J have ridiculously faster times in benchmarks (see https://github.com/lintool/IR-Reproducibility). Elias-Fano is now the main Facebook indexing algorithm and it is slowly percolating to Lucene (look in the sources).

3) You can define your queries using a very rich interval language with a very fast implementation based on new algorithms. You can easily create parallel indices with text and tagging and ask whether a phrase falls into an area tagged as "location", for example.

2) MG4J is a project of two people and at this time I'm the only maintainer. You cannot expect that it is refined as Lucene or Solr. But you can very easily hack into it (even without modifying the sources), which is why it has been popular with people experimenting with indexing. For example, there are many tools to manipulate index, splitting them with a specified strategy, combining them, etc.

3) So if you want an out-of-the-box solution for indexing, forget about it. If you want a fun playground for doing research or a very efficient backbone on which to build your infrastructure, MG4J might be useful to you. We used it recently for http://wikirank.di.unimi.it/ .

dumbfounder · on April 26, 2017

Blast from the past! Distributed is a bit of a stretch, I think you need to coordinate all of that yourself. It is no more distributed than Lucene (I think).

Their fastutil stuff is pretty interesting though for creating highly optimized algorithms. Lot's of primitive based data structures that are fast and memory efficient.

styfle · on April 26, 2017

How does this compare to Elasticsearch or Solr?

drdaeman · on April 26, 2017

I think it makes more sense to compare it with Lucene (which both ElasticSearch and Solr are based on) or, say, Xapian.

Based on a PDF http://grupoweb.upf.es/WRG/mir2ed/pdf/slides_appendixA.pdf page 14 (linked from http://stackoverflow.com/q/5028314/116546), I think the differences are MG4J has constant RAM usage (as opposed to Lucene's linear one), but is somewhat more CPU intensive.

Haven't used either directly.

fizx · on April 26, 2017

The comparison is with an 8-year-old version of lucene. Lucene is (optionally) constant RAM now.

fizx · on April 26, 2017

As someone who has run multi-terabyte Lucene-based installs, managing gigabytes isn't that interesting ;)

It's competing with a project with hundreds of committers, an amazing ecosystem, tons of diverse users, etc.

dozzie · on April 26, 2017

And both are written in Java. Using a different runtime would make a differentiating point in some applications.

regularfry · on April 26, 2017

There was a C port at one point - https://github.com/dbalmain/ferret, maybe others. No idea if it's current or what the feature set comparison might look like.

dozzie · on April 27, 2017

Good to know there is something like this. Maybe I'll do something with it in the future. Thanks.

woliveirajr · on April 26, 2017

Some links are broken inside the unimi.it

bawllz · on April 26, 2017

how is this on the first page of hackernews?

anigbrowl · on April 27, 2017

It's hard to answer that without any idea of why you find its presence surprising.

ww520 · on April 27, 2017

What is the problem exactly?