Quantcast
Channel: Articles and Investigations - ProPublica
Viewing all articles
Browse latest Browse all 132

New Open Source Project: Daybreak, a Simple Key/Value Database for Ruby

$
0
0

A couple of weeks ago, in an article about the science behind the Message Machine project, we mentioned the custom key-value store we built to store non-relational data. Today, we're open sourcing the library which we're calling Daybreak.

Daybreak is a simple key-value store for Ruby that operates just like a Ruby hash. Commits to the database are stored in an append-only file and flushed asynchronously, though there are options to atomically commit a write. It is faster than pstore and is simpler to use than dbm. Because your data is stored in an in-memory hash table you also get Ruby conveniences like each, filter, map and reduce. You can install it by running gem install daybreak, and the code is over on github.

Daybreak's API mirrors Ruby's hash interface and it is convenient to use. The docs have a simple walkthrough of the api, but let's create a simple search engine to showcase Daybreak's abilities. Here's the class we'll fill out in this post:

classSearchdefinitialize(docs)# tkenddefquery# tkenddefadd(docs)# tkendend

First, let's store the documents in a database:

classSearchdefinitialize# create the storage database@docs_db=Daybreak::DB.new('./docs.db')end# ...# add some documentsdefadd(docs)# Daybreak keys are strings so we'll want to convert them back to# integers to find the next keymax=(@docs_db.keys.map(&:to_i).max||0)docs.eachdo|doc|max+=1@docs_db[max]=docend# we'll make sure our changes are flushed to disk@docs_db.flush!@index_db.flush!endend

In order to have an effective search engine, we'll also need to store each document's index. So let's add another database to handle those indexes:

classSearchdefinitialize# create the storage database@docs_db=Daybreak::DB.new('./docs.db')# Ruby objects work too, this db will have a default value of an empty# set@index_db=Daybreak::DB.new('./index.db'){|k|Set.new}end# add some documentsdefadd(docs)# Daybreak keys are strings so we'll want to convert them back to# integers to find the next keymax=(@docs_db.keys.map(&:to_i).max||0)docs.eachdo|doc|max+=1@docs_db[max]=doctokens=doc.split(/ +/)# create a simple index of ids by word frequenciestokens.each{|t|@index_db[t.downcase]=@index_db[t.downcase]<<max}end# we'll make sure our changes are flushed to disk@docs_db.flush!@index_db.flush!endend

And finally let's write the function that takes a query and returns documents that have the words that match the query:

classSearchdefquery(query)num_docs=@docs_db.lengthtokens=query.split(/ +/)# Find documents with the query termsids=tokens.reduce([]){|m,t|m+@index_db[t.downcase].to_a}.uniq# Finally grab the text and return it.ids.map{|id|@docs_db[id]}endend

Here's an example of how to use the above class:

searcher=Search.newsearcher.add(["To define the reality of the human condition and to make our definitions public.","To confront the new facts of history-making in our time, and their meaning for the problem of political responsibility.","Continually to investigate the causes of war, and among them to locate the decisions and defaults of elite circles.","To release the human imagination, to explore tall the alternatives now open to the human community by transcending both the mere exhortation of grand principle and the mere opportunist reaction.","To demand full information of relevance to human destiny and the end of decisions made in irresponsible secrecy.","To cease being the intellectual dupes of political patrioteers."])searcher.query("human condition")>>["To define the reality of the human condition and to make our definitions public.","To release the human imagination, to explore tall the alternatives now open to the human community by transcending both the mere exhortation of grand principle and the mere opportunist reaction.","To demand full information of relevance to human destiny and the end of decisions made in irresponsible secrecy."]

In another program we can reopen the database and perform another query:

search2=Search.newsearcher.query("explore")>>["To release the human imagination, to explore tall the alternatives now open to the human community by transcending both the mere exhortation of grand principle and the mere opportunist reaction."]

That's the basics. If it sounds useful to you, head on over to github and kick the tires. If you run into bugs, open up an issue on github, and of course we're always happy to receive pull requests!

Let us know if you end up using it in a project by emailing us at opensource@propublica.org.


Viewing all articles
Browse latest Browse all 132

Trending Articles