I’ve begun working with Sinatra, and I’m really enjoying it as a lightweight alternative to Rails. One of today’s Dev Bootcamp challenges involved creating a simple Sinatra app that would display all the anagrams of words in a database based on a submitted word.
Sinatra makes setting up such an app very easy – the real fun came in optimizing the app for speed.
We pre-populated our database with about 236,000 words. How can we make it quick and simple for our app to find all the anagrams of a word in our database?
The answer lies in database indices. In short, they work a lot like an index of a book. If you’ve got an app in which users can submit posts, and a post belongs to one user, it’d be a good idea to add an index of
user_id to your posts table. That way, when trying to retrieve all the posts of a specific user, the database can simply refer to the posts with the specific user ID rather than checking every post for that user ID.
For this challenge, however, we had to come up with our own method to reach ludicrous speed in a search through 236,000 records for the correct anagrams of a given word.
Thankfully, through examining solutions of my fellow cohort mates, and a discussion of the topic with fellow boot Johnathan Weisner, I managed to reach ludicrous speed.
One possible solution would be to index on each word in the database, getting all the permutations of a given word and then checking the database for each permutation.
For example, in our Word model, we could do something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
This will work, but it will be incredibly slow. For example, the word “banana” has 720 permutations, and most of those permutations aren’t valid words. There’s no point in checking our database for a few hundred invalid words. And the longer the word, the permutations grow at a factorial rate. While the database quickly skips over any invalid words because they don’t have an index, this isn’t the most optimal solution.
Another option could be adding a column to the database that keeps track of the length of each word, and adding the length as an index. We could then grab all the words that are the same length as the word in question. We could also split the given word into an array and sort its letters, do the same for all the words of the same length, and check whether the words sorted in alphabetical order match the sorted given word array:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
I felt like I was getting somewhere with this solution, but for words of a length below about eight, the number of permutations we were checking in the first solution was significantly smaller than the number of words that had a length below eight.
Fortunately, there’s a much better and much simpler solution than either of these, inspired by the sorting bit in the last solution: We can add a column of
sorted to our database where we’ll store the words with their letters in alphabetical order, index on that column, and then query the database for any word with the same arrangement of alphabetically sorted letters.
Now we can write our super simple
.get_anagrams() function like so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
The given word is sorted and assigned to a variable
word_to_check. Then I simply query the database for all records where the
sorted attribute matches
word_to_check. And because I have indexed on sorted words, the search has reached a much higher, dare I say, ludicrous, speed.
Also, notice the Active Record method,
.pluck() which plucks out the attributes you want so that you don’t end up with an array of a bunch of objects – just a nice array of words.
I’ve also included the Active Record callback I used to set the
sorted attribute of each word. I was reading each line of an extremely large data file to create each record in the database, and this callback allows me to alphabetically sort the characters of each word and store that value into the
sorted attribute. This happens inside of
sort_word which is called before each record is saved by the helpfully named
Database indexes can be a very powerful tool for optimizing what would otherwise be slow queries. You should add this knowledge to your web application-building arsenal.