Indexing Project Gutenberg with MongoDB and Elasticsearch

I am a big fan of Project Gutenberg. I love reading, and I love technology, and Project Gutenberg certainly combines the two! So when I needed some test data for my MongoDB integration with Elasticsearch, I downloaded all the plain text books from project’s FTP site. There is a lot more data available on the site, but I was interested only in the actual books and not in the various book formats for this project. Once I had all the books, I put all of them into a MongoDB database, connected it to Elasticsearch, and indexed all of them.

At the end of this little project, I had 37 GB worth of text files, 35 GB Elasticsearch index, and close to 78K files, all searchable in milliseconds. Some books came in multiple files or had several versions, so the actual number of files is much higher than the number of books. Project Gutenberg offers 43,856 free ebooks to download.

I am sure there are much better book content search tools out there, and this little project is merely a demonstration of MongoDB and Elasticsearch integration, rather than an actual useful search of all the books available at Project Gutenberg, though it certainly could be turned into one with a little extra work.

This is what I did. First, get all the books! If you do this, it may take a while. You will be downloading a lot of books. About 37 GB worth of them at the time of this writing.

wget -nd -r -l 10 -A.txt ftp://ftp.ibiblio.org/pub/docs/books/gutenberg/

This will put all the txt files in one directory, rather than build Project Gutenberg’s filing system.

After all the books were available locally, I read them into MongoDB with a Ruby script available here: https://github.com/eglute/fun-with-gutenberg/blob/master/guten_mongo.rb. The biggest issue was Ruby encoding. Each book had it’s own encoding style, and I needed to turn them all into a valid encoding for MongoDB. This I think was related more to the Ruby driver for MongoDB rather than MongoDB itself.

Since some books were larger than allowed MongoDB BSON document size, books got stored as GridFS objects. To learn more about GridFS, check out this website: http://docs.mongodb.org/manual/core/gridfs/.

Once all the books were in MongoDB, it was time to index them with Elasticsearch. Elasticsearch plugins are called rivers, and I used a MongoDB river for indexing: https://github.com/richardwilly98/elasticsearch-river-mongodb.

First, configure the river:

curl -XPUT "localhost:9200/_river/mongogridfs/_meta" -d'
 {
   "type": "mongodb",
     "mongodb": {
       "db": "gutenberg",
       "collection": "fs",
       "gridfs": true
     },
     "index": {
       "name": "gutenberg",
       "type": "files"
     }
 }'

Note that both my MongoDB mongos process and Elasticsearch were on the same server, so no special configuration was needed. On the largest Rackspace cloud server instance, indexing took about an hour for all the books.

Once the books were indexed, I wrote another simple Ruby script for searching: https://github.com/eglute/fun-with-gutenberg/blob/master/guten_search.rb. Ruby client for Elasticsearch certainly makes it very easy to connect to the index as well as search. The script searches Elasticsearch index. If there are any hits, get the related MongoDB objects based on object id. Here is a sample subset of output:

ElasticSearch + MongoDB Based Project Gutenberg Word Search

It took 22 ms to search 77931 books for words 'Euroclydon'!
Found 53 results

-----------------------------------------------------------------
Book: How Spring Came in New England
By: Charles Dudley Warner
Release Date: March, 2002  [Etext #3131]
Language: English, original file name: cwsne10.txt

-----------------------------------------------------------------
Book:  Moby Dick; or The Whale
By:  Herman Melville
Release Date:  
Language: , original file name: moby11.txt

-----------------------------------------------------------------
Book: From Edinburgh to India & Burmah
By: William G. Burn Murdoch
Release Date: September 24, 2007 [EBook #22749]
Language: English, original file name: 22749-8.txt

Right now, the script will search for all the words individually, so it is not terribly useful. Perhaps one day the script will support actual phrase searching or something more interesting. Ruby client for Elasticsearch certainly makes it very easy to connect to the index as well as search.

Thanks for reading!

-eglute

Share this:

Related