Intro to Elasticsearch & Searchkick
Searchkick: The Problem with Gems
The first honest feedback I received from a classmate after working together on a project was:
"You should use gems more. Don’t be so afraid of them."
He was right. I was gem-averse because I was overwhelmed by foreign libraries – struggling to install them in my app, not understanding their documentation, and feeling like I needed to understand every bit of code that they added. So I avoided them. And that’s when I had to admit to myself that I was afraid of gems.
BAM! A straight shot, right to my pride.
But looking back on myself during my thrilling yet terrifying coding beginnings, I can empathize. Yes, I’ve come to learn that libraries save me valuable time because I don’t have to handroll common features like password encryption each time I build a new app. But would I instruct another code newbie to just to use Pundit or CanCanCan the first time that they implemented authorization? That would be completely irresponsible! It would be like telling a 3rd grader to use a calculator instead of learning their multiplication tables.
When I first learned about Abstraction as one of the four pillars of OOP, it was in the context of abstracting complexity away from average users. But now I’ve realized the joke is on me, it’s about abstracting things away from the programmers! I’ll be humming along, using gems left and right inside my application until BAM! System error that is hard to debug because I don’t understand how the gem works.
I’ve heard that being a good programmer is realizing that someone else has already solved many of the problems I will encounter. But as a code newbie, I believe it’s important to start at the bottom levels of abstraction. Before I can “stand on top of the shoulder of giants” (libraries), I want to understand the problems they had to solve first.
Just use Searchkick, they said.
At my job, I needed to implement a search function for matching invoices to existing payors in the database. If an invoice’s payor was Walmart, it wouldn’t mean that the invoice text would be that straightforward. It might read: “Walmart Inc.” or “Walmart Elgin branch” or sometimes for accounting reasons “0B223Walmart.”
My team decided to do a spike for this feature using the Searchkick gem. Like a responsible dev, I go straight to the documentation, which seems pretty comprehensive (I judge “comprehensiveness” by how long it takes me to scroll to the bottom).
In my development database, I have only four payors with the names: Costco, Walmart, Whole Foods and Amazon. All I have to do to make my payors searchable is:
1. Add the gem.
2. Add searchkick to the Payor model:
3. Hop into the rails console and reindex the Payor model:
4. And then run a search!
Unfortunately, I soon realize that my search is pretty dumb. When I search for anything other than exact name matches or very near matches (like “walmar”), my query results are empty. I need my Searchkick to handle real-world payor names like “Walmart Inc.” or “WAL001”.
Upon further scrolling of the Searchkick docs, I find a section on Partial Matches. This looks promising. I can add an or operator to my query, as well as add word_start to my Payor model, and this will make my searching a bit smarter.
This works. The problem is, I don’t feel smarter.
Here’s the thing: the Searchkick documentation tells me how to “add data to the search index” by “reindexing”. I know how to reindex in the console, and I know I need to reindex every time the database is updated, but I don’t know what reindexing actually means. I don’t even know what the “search index” is, and I don’t really understand what is happening when I add searchkick “partial matching” options to my Payor model. Searchkick is starting to feel like a blackbox, and that does not feel good.
The magic powering the Searchkick library is actually a technology called Elasticsearch. The great thing about Searchkick is that it allows anyone to easily leverage Elasticsearch’s powers (which are built in Java) within a Rails application. It does the integration for me. But when something is abstracted away, it can prevent me from learning an important technology or computer science concept. The hard part is knowing when I need to take the deep dive.
Elasticsearch: A Deep Dive
Disclaimer: In the grand scheme of Elasticsearch, this is nowhere close to a deep dive. But I’m assuming you, like I did, are starting from zero.
Right now, I know that Elasticsearch is used for search. That’s about it.
BUT WAIT… why can’t I just SQL?
Hold the phone people. If I’m thinking about tools I understand and how I can use them to solve my matching invoice payor problems, I think of SQL! I use SQL’s LIKE all the time in my database interface to quickly lookup records:
SELECT * FROM payors WHERE name LIKE '%Wal%'
I might love SQL, but it was not built for search. The above query took my database client approximately 1.8 seconds to search 6,000 payor records. It’s using a full table scan: every single payor name is sequentially read letter by letter until it doesn’t meet the WHERE condition. It can search, but for SQL “search is a feature, while in Elasticsearch it’s the essence.” To get more speed and power, we not only need a different type of search, but a completely different type of database. Introducing….
The Inverted Index
Elasticsearch is a NoSQL database. And until now, I’ve never encountered one or understood why I would want to. What else is even out there besides rows and columns?
DOCUMENTS, that’s what.
Imagine I need to store a bunch of e-books. How much sense would a table make in storing this data? I would have one column for the title, one column for the author, and one column for… the rest of the book! Document databases are more efficient when storing massive amounts of text or raw data. And with the rise of unstructured data (aka big data), document databases are pretty popular. This is just one of the use cases of document databases; search is another.
Coming from a relational mindset, document databases might feel weird. What’s more is that common terms used to describe structures in relational databases are different or completely absent in document land. So first, I’ll map a record in an RDBMS to one in a Document Store:
Now, I’ll zoom out to map common terms in RDBMS world to Document world:
So why is Elasticsearch, a document database, so good at searching? Remember how the Searchkick docs said I needed to reindex my model each time I create or update a record? When I reindex, I am updating the inverted index, a special structure in Elasticsearch. I think of the inverted index like an actual index in a textbook: its index lists common terms and their corresponding page in which they are mentioned. When documents are indexed, each unique word from the specified field is put into a table row, and each row points to the documents in which they are listed.
Say I have three payor documents in my database, each with a name field:
Doc 1: Organic Foods Inc
Doc 2: Almost Organic LLC
Doc 3: Barely Organic Food Inc
I want to index these documents. This is a representation of the inverted index that gets created:
The inverted index not only allows superior speed of retrieval, but it also provides relevancy. If I search for “organic” and “llc”, Elasticsearch will return all three docs, but Doc 2 will have the highest relevancy score because both terms are matched.
Pause. If it sounds like I’ve said the word index a lot in different contexts, you’re right and maybe confused. If you are coming from a relational mindset like me, I had only used the term indexing to mean adding an index to a column to make queries faster. But it’s not just semantics people! Indexing can refer to a few very different things. Let me break it down:
Document World:
Index (noun): Similar to a database in RDBMS world, except a document database can be made up of multiple indices.
Example: Ellen’s Lobster Buffet uses a document database which contains the Orders Index, the Customers Index, the Invoices index, etc. All of these indices are contained in an Elasticsearch cluster, but I’m not even gonna get into what that is.
Index or Reindex (verb): To update the inverted index. Reindexing lets the inverted index “recalculate” itself to take into account any new or updated records.
Example: After creating or updating a record, you must ALWAYS reindex the database.
Inverted Index (noun): A structure in Elasticsearch which you can think of like an index in a book. Words or phrases become keys that point to the document or documents they appear in.
Example: The inverted index is the secret sauce of fast search.
Relational World:
Index (noun): When I add an index to a column, I am actually mapping it’s value to a row for faster searching. This makes retrievals much faster: instead of performing a full-table scan and reading each row, the system finds that value in its internal index and can go directly to the row which contains it.
Example: Hey Percy, throw an index on the email column so we can have faster user lookups!
Interesting to note that indexing a column in relational land and using an inverted index in document land are very similar in nature: both are using keys in an index which increases the speed of data retrieval.
Getting Started with Kibana
Now that I have a handle on what makes Elasticsearch special, I’ll use it to search my app’s data without going through the Searchkick abstraction. When I was in my Rails console using the Searchkick DSL, the results returned a curl request in addition to the matching record. This curl request was generated by Searchkick and then sent to my Elasticsearch database. Without knowing much about Elasticserach, I might not have picked up on the fact that an HTTP interface is how I interact with it. Searchkick is abstracting this layer away from us, in the same way that ActiveRecord abstracts away SQL by generating it “under the hood” and sending it to our relational database.
To interface with my Elasticsearch database directly, I need to download a key tool in the Elastic Stack: Kibana. This is a browser based-interface where I can interact with the Elasticsearch database using the Elasticsearch DSL (the stuff in the body of the http request). Kibana is also used to visualize data and perform many other analytics and aggregations that I won’t get into here. Here’s how to get set up:
1. If you haven’t already, install and launch Elasticsearch:
brew install elasticsearch
services start elasticsearch
2. Install and start Kibana:
brew install kibana
kibana
3. Navigate to the Kibana’s port using:
http://localhost:5601
4. Connect Kibana with my Payor index which I unknowingly created when I added searchkick to my Payor model:
Here I see a complete list of my existing Elasticsearch indices, each with a timestamp. If I added Searchkick to more than one model, it will have its own index (i.e. the payors index, the invoices index, the borrowers index). I was a bit surprised by seeing one index per model; I thought I would only see one index. But document databases and relational databases don’t always have analogous parts. The Elasticsearch index (database) can be a collection of many indexes, whereas an RDBMS database is a collection of many tables.
One more note – an index is different from an index pattern. And index pattern is how Kibana accesses my Elasticsearch indexes. I can actually put several indexes inside of one index pattern by using wildcards (*). If I were to just enter a wildcard in the index pattern box, Kibana would combine all of my indexes into one. For now, I’ll just specify my payors development index:
I can checkout the index pattern I just created by clicking on it in the lefthand sidebar. Listed are all the fields that live on the Payor index. Fields are kind of like RDBMS columns. But this is where comparing relational and document databases is like comparing macaroons and macarons: their purpose is the same but they don’t have analogous parts!
On my Payor index I see fields that I expect to be listed: id, created_at, name …but HOLD THE PHONE what is name.analyzed? I did not create this field on my Payor model!
When I have a property (name) which is a string, Elasticsearch’s default behavior is to create TWO fields– one of datatype keyword and one of datatype text. Both fields get their own inverted indexes created, but the text field gets analyzed, whereas the keyword field does not. Analyzers are the bread and butter of Elasticsearch: they convert text into tokens which become keys in the field’s inverted index. I’ll get more into analyzers soon, but know that the analyzed field is far better for search, and the keyword field is used for sorting and aggregations.
Next, I’ll navigate to the development tools on the sidebar. This is where I will create a query using the Elasticsearch DSL to search my inverted index.
Building a Query
Like every HTTP request, I can break it down into three components:
1. HTTP verb: I use POST when I create records or indices. I’ve already created my payors_development index via Searchkick, so I will use GET requests for the rest of my queries.
2. URL: The index pattern that Kibana will search.
3. JSON body: This is the Elasticsearch DSL.
Together these components make up an Elasticsearch query:
GET payors_development_20190101165750784/_search
{
"query": {
"match": {
"name”: "walmart"
}
}
}
Using the Elasticsearch DSL to query my database is equivalent to using SQL to query my database. The Elasticsearch DSL is simply an HTTP request which will return a JSON object (the search results). This means that I can query the Elasticsearch server with any programming language I want and receive a universal JSON response back. Elasticsearch is language-agnostic!
Turns out I run the above query it returns zero results. Why? Because it’s not an exact match to “Walmart.” But surely Elasticsearch is supposed to be smarter than this…?
Remember that the name property is a multifield, with two inverted indexes! I need to search the name.analyzed because part of the analysis is lowercasing all the terms for consistency. I’ll try again:
Great! But I need to be more realistic and search for a payor name that is a little less straightforward:
It returns no results! BUT WHY?!?!?!
Tokenizers and Analyzers
The single most important advantage of Elasticsearch is that it can understand my intent, whereas SQL can’t do that very well. But how? Remember that the inverted index is essential to search, and it is the analyzer’s job to create keys or tokens in the inverted index. The query string 001walmart doesn’t seem to match anything on my inverted index.
Digging deeper into what an analyzer actually does, I discover it has two main functions:
1. normalizing terms (lowercasing letters, converting numerals into words, or removing irrelevant text like html tags)
2. tokenizing terms.
Tokenization is the action of breaking up words or phrases. What Elasticsearch allows me to control is how I break them down. Each analyzer can have several token filters (normalizes terms) but has only one tokenizer. There are many different types of analyzers that ES has prepackaged that I can choose from. I can also create my very own custom analyzer.
To figure out the problem at hand, I need to investigate how the field name.analyzed is being analyzed. To see the mappings configurations that Searchkick set up for me, I’ll run the query:
For property name, it’s field analyzed is using the searchkick_index”analyzer. But if I go to Elasticsearch’s list of built-in analyzers, it’s not there! That’s because this is a custom analyzer that comes with Searchkick. I can find out how this custom analyzer is built by running the query:
Lovely! I now see the components that make up the searchkick_index analyzer: its tokenizer and filters. Without needing to know about these analyzers, I can see how they tokenize words using a nifty Elasticsearch tool called the analyze API. Using this I’ll replicate the searchkick_index analyzer by specifying the same settings in my query, and a POST request:
What’s returned are the keys listed in my Payor inverted index. Looks like the standard analyzer delimits tokens based on whitespace.
However, to get the full picture I also need to find out how the query string 001walmart is analyzed. Elasticsearch automatically runs the query string through the same analysis of the field being queried. So, I’ll run query string using the same settings of the searchkick_index analyzer:
Clearly this does not match any of the terms in my inverted index, which explains why I get zero results when searching for it.
I can go to the amazing Elasticsearch documentation and try to find an analyzer that will break up words when it encounters a character changing from number to letter, OR I can build my own custom analyzer. The latter sounds more fun.
I’m looking for something that acts similar to Searchkick’s word_middle option. The word_start option allows me to search for any part of a token. I have no idea how this works, but I can find out by using the same settings query to see Searchkick’s complete list of custom analyzers:
The searchkick_word_middle_index uses a special filter called searchkick_ngram.
NGram-ming
My favorite analogy from the Elasticsearch docs is concerning NGrams: they act like sliding windows that move across the word and we control the length of this window. For example, the phrase “01walmart” with a min_gram of 1 and max_gram of 2 would be broken up into these tokens:
0 01 1 1w w wa a al m ma ar r rt
The NGram is actually a special type of tokenizer that can also be used as a filter, as in the case of Searchkick. This is useful because I want to break up words into smaller pieces. And the NGram way is a good way to do this. Note that it’s not the only way to solve my problem – I could have also used a Pattern tokenizer to delimit numbers and letters using regular expressions. However, using an NGram will match more documents.
My next logical step is to figure out how to change the analyzer of the name.analyzed field. However, I won’t be able to edit my existing index mapping through Kibana. This is because changing a mapping would mean invalidating the already existing inverted index. To change a mapping, I need to create a new index and reindex my data to it (this is another reason why Elasticsearch can be disadvantageous if you need to update records often because it needs to reindex all documents each time). To do this, I’ll hop back to Searchkick. I’ve got a much better handle on Elasticsearch and can actually leverage Searchkick’s power the way I want!
Searchkick With A Better View
Those Searchkick docs are much easier on the eyes now that I know Elasticsearch basics. I can actually bypass much of Searchkick’s built-on “magic” and just use the Elasticsearch DSL directly!
On my Payor model, first I must tell Searchkick to use custom mappings in addition to Searchkick’s built-in settings.
Next I create a custom mapping and specify that the field name.analyzed will use my custom-built analyzer my_special_analyzer.
Last, I create my_special_analyzer in the settings. Note that I specify this analyzer to use lowercasing and asciifolding filters in addition to a custom-built tokenizer called my_special_ngrammer which will use a min_gram and max_gram of three.
Finally, I’ll reindex my Payor model in my console, create a new Payor index pattern in Kibana, and then use the following query in Kibana:
If I wanted to make this same request using Searchkick instead, I’ll use the query:
Taaaadaaaaa! I just made my search a little bit smarter.
Resources
There’s a lot of Elasticsearch vocabulary. This is a good list of key concepts and terms: https://logz.io/blog/10-elasticsearch-concepts/
Old but not dated - this is great for understanding NoSQL databases coming from an RDBMS mindset: https://msdn.microsoft.com/en-us/magazine/hh547103.aspx
The Elasticsearch docs are a goldmine of information. If you read straight through them, I think it’s safe to put ES on your resume: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html
**BADASS LADY CODER ALERT**
Did you know Karen Spärck Jones laid the groundwork for the modern search engine? She combined statistics with linguistics with programming to create index-term weighting algorithms - looking at word frequency and how many documents in which it appears. In general, the more a term appears the less relevant it is to the search query. Read a fascinating interview with her here.