Most databases are astonishingly inept at extracting actionable knowledge from your data. Sure, they can filter by timestamp or exact values, but can they perform full-text search, handle synonyms, and score documents by relevance? Elastic Search is the answer.
Elasticsearch is a real-time distributed search and analytics engine. It allows you to explore your data at a speed and at a scale never before possible. It is used for full-text search, structured search, analytics, and all three in combination:
- A distributed real-time document store where every field is indexed and searchable
- A distributed search engine with real-time analytics
- Capable of scaling to hundreds of servers and petabytes of structured and unstructured data
Document Oriented
Objects in an application are seldom just a simple list of keys and values. More often than not, they are complex data structures that may contain dates, geo locations, other objects, or arrays of values.
Elasticsearch is document oriented, meaning that it stores entire objects or documents. It not only stores them, but also indexes the contents of each document in order to make them searchable. In Elasticsearch, you index, search, sort, and filter documents—not rows of columnar data. This is a fundamentally different way of thinking about data and is one of the reasons Elasticsearch can perform complex full-text search.
Elasticsearch uses JavaScript Object Notation, or JSON, as the serialization format for documents. JSON serialization is supported by most programming languages, and has become the standard format used by the NoSQL movement. It is simple, concise, and easy to read.
Indexing
An Elasticsearch cluster can contain multiple indices, which in turn contain multiple types. These types hold multiple documents, and each document has multiple fields.
- Index: To index a document is to store a document in an index (noun) so that it can be retrieved and queried. It is much like the
INSERT
keyword in SQL except that, if the document already exists, the new document would replace the old. - Inverted index : Relational databases add an index, such as a B-tree index, to specific columns in order to improve the speed of data retrieval. Elasticsearch and Lucene use a structure called an inverted index for exactly the same purpose. By default, every field in a document is indexed (has an inverted index) and thus is searchable. A field without an inverted index is not searchable.
So for employee directory example, consider following:
In practice, this is easy (even though it looks like a lot of steps). We can perform all of those actions in a single command:
PUT /megacorp/employee/1 { "first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ] }
Searching—The Basic Tools
We can throw JSON documents at Elasticsearch and retrieve each one by ID. But the real power of Elasticsearch lies in its ability to make sense out of chaos — to turn Big Data into Big Information.
This is the reason that we use structured JSON documents, rather than amorphous blobs of data. Elasticsearch not only stores the document, but also indexes the content of the document in order to make it searchable.
Every field in a document is indexed and can be queried. And it’s not just that. During a single query, Elasticsearch can use all of these indices, to return results at breath-taking speed. That’s something that you could never consider doing with a traditional database.
A search can be any of the following:
While many searches will just work out of the box, to use Elasticsearch to its full potential, you need to understand three subjects:
- Mapping
- How the data in each field is interpreted
- Analysis
- How full text is processed to make it searchable
- Query DSL
- The flexible, powerful query language used by Elasticsearch
Each of these is a big subject in its own right, and we explain them in de
-
Exact Values Versus Full Text
Data in Elasticsearch can be broadly divided into two types: exact values and full text.
Exact values are exactly what they sound like. Examples are a date or a user ID, but can also include exact strings such as a username or an email address. The exact value
Foo
is not the same as the exact valuefoo
. The exact value2014
is not the same as the exact value2014-09-15
.Full text, on the other hand, refers to textual data—usually written in some human language — like the text of a tweet or the body of an email.
Full text is often referred to as unstructured data, which is a misnomer—natural language is highly structured. The problem is that the rules of natural languages are complex, which makes them difficult for computers to parse correctly. For instance, consider this sentence:
May is fun but June bores me.
Does it refer to months or to people?
Exact values are easy to query. The decision is binary; a value either matches the query, or it doesn’t. This kind of query is easy to express with SQL:
WHERE name = "John Smith" AND user_id = 2 AND date > "2014-09-15"
Querying full-text data is much more subtle. We are not just asking, “Does this document match the query” but “How well does this document match the query?” In other words, how relevant is this document to the given query?
We seldom want to match the whole full-text field exactly. Instead, we want to search within text fields. Not only that, but we expect search to understand our intent:
- A search for
UK
should also return documents mentioning theUnited Kingdom
. - A search for
jump
should also matchjumped
,jumps
,jumping
, and perhaps evenleap
. johnny walker
should matchJohnnie Walker
, andjohnnie depp
should matchJohnny Depp
To facilitate these types of queries on full-text fields, Elasticsearch first analyzes the text, and then uses the results to build an inverted index
Inverted Index
Elasticsearch uses a structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.
For example, let’s say we have two documents, each with a
content
field containing the following:- The quick brown fox jumped over the lazy dog
- Quick brown foxes leap over lazy dogs in summer
To create an inverted index, we first split the
content
field of each document into separate words (which we callterms, or tokens), create a sorted list of all the unique terms, and then list in which document each term appears. The result looks something like this:Term Doc_1 Doc_2 ------------------------- Quick | | X The | X | brown | X | X dog | X | dogs | | X fox | X | foxes | | X in | | X jumped | X | lazy | X | X leap | | X over | X | X quick | X | summer | | X the | X | ------------------------
Now, if we want to search for
quick brown
, we just need to find the documents in which each term appears:Term Doc_1 Doc_2 ------------------------- brown | X | X quick | X | ------------------------ Total | 2 | 1
Both documents match, but the first document has more matches than the second. If we apply a naive similarity algorithm that just counts the number of matching terms, then we can say that the first document is a better match—is more relevant to our query—than the second document.
But there are a few problems with our current inverted index:
Quick
andquick
appear as separate terms, while the user probably thinks of them as the same word.fox
andfoxes
are pretty similar, as aredog
anddogs
; They share the same root word.jumped
andleap
, while not from the same root word, are similar in meaning. They are synonyms.
With the preceding index, a search for
+Quick +fox
wouldn’t match any documents. (Remember, a preceding+
means that the word must be present.) Both the termQuick
and the termfox
have to be in the same document in order to satisfy the query, but the first doc containsquick fox
and the second doc containsQuick foxes
.Our user could reasonably expect both documents to match the query. We can do better.
If we normalize the terms into a standard format, then we can find documents that contain terms that are not exactly the same as the user requested, but are similar enough to still be relevant. For instance:
Quick
can be lowercased to becomequick
.foxes
can be stemmed–reduced to its root form—to becomefox
. Similarly,dogs
could be stemmed todog
.jumped
andleap
are synonyms and can be indexed as just the single termjump
.
Now the index looks like this:
Term Doc_1 Doc_2 ------------------------- brown | X | X dog | X | X fox | X | X in | | X jump | X | X lazy | X | X over | X | X quick | X | X summer | | X the | X | X ------------------------
But we’re not there yet. Our search for
+Quick +fox
would still fail, because we no longer have the exact termQuick
in our index. However, if we apply the same normalization rules that we used on thecontent
field to our query string, it would become a query for+quick +fox
, which would match both documents!This is very important. You can find only terms that exist in your index, so both the indexed text and the query string must be normalized into the same form.
This process of tokenization and normalization is called analysis.
Analysis and Analyzers
Analysis is a process that consists of the following:
- First, tokenizing a block of text into individual terms suitable for use in an inverted index,
- Then normalizing these terms into a standard form to improve their “searchability,” or recall
This job is performed by analyzers. An analyzer is really just a wrapper that combines three functions into a single package:
- Character filters
- First, the string is passed through any character filters in turn. Their job is to tidy up the string before tokenization. A character filter could be used to strip out HTML, or to convert
&
characters to the wordand
. - Tokenizer
- Next, the string is tokenized into individual terms by a tokenizer. A simple tokenizer might split the text into terms whenever it encounters whitespace or punctuation.
- Token filters
- Last, each term is passed through any token filters in turn, which can change terms (for example, lowercasing
Quick
), remove terms (for example, stopwords such asa
,and
,the
) or add terms (for example, synonyms likejump
andleap
).
Elasticsearch provides many character filters, tokenizers, and token filters out of the box. These can be combined to create custom analyzers suitable for different purposes. We discuss these in detail in Custom Analyzers.
However, Elasticsearch also ships with prepackaged analyzers that you can use directly. We list the most important ones next and, to demonstrate the difference in behavior, we show what terms each would produce from this string:
"Set the shape to semi-transparent by calling set_trans(5)"
- Standard analyzer
-
The standard analyzer is the default analyzer that Elasticsearch uses. It is the best general choice for analyzing text that may be in any language. It splits the text on word boundaries, as defined by the Unicode Consortium, and removes most punctuation. Finally, it lowercases all terms. It would produce
set, the, shape, to, semi, transparent, by, calling, set_trans, 5
- Simple analyzer
-
The simple analyzer splits the text on anything that isn’t a letter, and lowercases the terms. It would produce
set, the, shape, to, semi, transparent, by, calling, set, trans
- Whitespace analyzer
-
The whitespace analyzer splits the text on whitespace. It doesn’t lowercase. It would produce
Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
- Language analyzers
-
Language-specific analyzers are available for many languages. They are able to take the peculiarities of the specified language into account. For instance, the
english
analyzer comes with a set of English stopwords (common words likeand
orthe
that don’t have much impact on relevance), which it removes. This analyzer also is able to stem English words because it understands the rules of English grammar.The
english
analyzer would produce the following:set, shape, semi, transpar, call, set_tran, 5
Note how
transparent
,calling
, andset_trans
have been stemmed to their root form.
When we index a document, its full-text fields are analyzed into terms that are used to create the inverted index.However, when we search on a full-text field, we need to pass the query string through the same analysis process, to ensure that we are searching for terms in the same form as those that exist in the index.
Full-text queries, which we discuss later, understand how each field is defined, and so they can do the right thing:
- When you query a full-text field, the query will apply the same analyzer to the query string to produce the correct list of terms to search for.
- When you query an exact-value field, the query will not analyze the query string, but instead search for the exact value that you have specified.
Mapping
In order to be able to treat date fields as dates, numeric fields as numbers, and string fields as full-text or exact-value strings, Elasticsearch needs to know what type of data each field contains. This information is contained in the mapping.
Every type has its own mapping, orschema definition. A mapping defines the fields within a type, the datatype for each field, and how the field should be handled by Elasticsearch. A mapping is also used to configure metadata associated with the type.
Elasticsearch supports the following simple field types:
- String:
string
- Whole number:
byte
,short
,integer
,long
- Floating-point:
float
,double
- Boolean:
boolean
- Date:
date
When you index a document that contains a new field—one previously not seen—Elasticsearch will use dynamic mapping to try to guess the field type from the basic datatypes available in JSON, using the following rules:
JSON type Field type Boolean: true
orfalse
boolean
Whole number: 123
long
Floating point: 123.45
double
String, valid date: 2014-09-15
date
String: foo bar
string
Search Lite
A
GET
is fairly simple—you get back the document that you ask for. Let’s try something a little more advanced, like a simple search!The first search we will try is the simplest search possible. We will search for all employees, with this request:
GET /megacorp/employee/_search
Search with Query DSL
Query-string search is handy for ad hoc searches from the command line, but it has its limitations . Elasticsearch provides a rich, flexible, query language called the query DSL, which allows us to build much more complicated, robust queries.
The domain-specific language (DSL) is specified using a JSON request body. We can represent the previous search for all Smiths like so:
GET /megacorp/employee/_search { "query" : { "match" : { "last_name" : "Smith" } } }
Combining Multiple Clauses
Query clauses are simple building blocks that can be combined with each other to create complex queries. Clauses can be as follows:
- Leaf clauses (like the
match
clause) that are used to compare a field (or fields) to a query string. - Compound clauses that are used to combine other query clauses. For instance, a
bool
clause allows you to combine other clauses that eithermust
match,must_not
match, orshould
match if possible. They can also include non-scoring, filters for structured search:
{ "bool": { "must": { "match": { "tweet": "elasticsearch" }}, "must_not": { "match": { "name": "mary" }}, "should": { "match": { "tweet": "full text" }}, "filter": { "range": { "age" : { "gt" : 30 }} } } }
It is important to note that a compound clause can combine any other query clauses, including other compound clauses. This means that compound clauses can be nested within each other, allowing the expression of very complex logic.
Queries and Filters
The DSL used by Elasticsearch has a single set of components called queries, which can be mixed and matched in endless combinations. This single set of components can be used in two contexts: filtering context and query context.
When used in filtering context, the query is said to be a “non-scoring” or “filtering” query. That is, the query simply asks the question: “Does this document match?”. The answer is always a simple, binary yes|no.
- Is the
created
date in the range2013
–2014
? - Does the
status
field contain the termpublished
? - Is the
lat_lon
field within10km
of a specified point?
When used in a querying context, the query becomes a “scoring” query. Similar to its non-scoring sibling, this determines if a document matches and how well the document matches.
A typical use for a query is to find documents:
- Best matching the words
full text search
- Containing the word
run
, but maybe also matchingruns
,running
,jog
, orsprint
- Containing the words
quick
,brown
, andfox
—the closer together they are, the more relevant the document - Tagged with
lucene
,search
, orjava
—the more tags, the more relevant the document
A scoring query calculates how relevant each document is to the query, and assigns it a relevance
_score
, which is later used to sort matching documents by relevance. This concept of relevance is well suited to full-text search, where there is seldom a completely “correct” answer.Filtering queries are simple checks for set inclusion/exclusion, which make them very fast to compute. There are various optimizations that can be leveraged when at least one of your filtering query is “sparse” (few matching documents), and frequently used non-scoring queries can be cached in memory for faster access.
In contrast, scoring queries have to not only find matching documents, but also calculate how relevant each document is, which typically makes them heavier than their non-scoring counterparts. Also, query results are not cacheable.
Thanks to the inverted index, a simple scoring query that matches just a few documents may perform as well or better than a filter that spans millions of documents. In general, however, a filter will outperform a scoring query. And it will do so consistently.
The goal of filtering is to reduce the number of documents that have to be examined by the scoring queries.
As a general rule, use query clauses for full-text search or for any condition that should affect the relevance score, and use filters for everything else.
Full-Text Search
The searches so far have been simple: single names, filtered by age. Let’s try a more advanced, full-text search—atask that traditional databases would really struggle with.
We are going to search for all employees who enjoy rock climbing:
GET /megacorp/employee/_search { "query" : { "match" : { "about" : "rock climbing" } } }
You can see that we use the same
match
query as before to search theabout
field for “rock climbing”. We get back two matching documents:{ ... "hits": { "total": 2, "max_score": 0.16273327, "hits": [ { ... "_score": 0.16273327, "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] } }, { ... "_score": 0.016878016, "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": [ "music" ] } } ] } }
By default, Elasticsearch sorts matching results by their relevance score, that is, by how well each document matches the query. The first and highest-scoring result is obvious: John Smith’s
about
field clearly says “rock climbing” in it.But why did Jane Smith come back as a result? The reason her document was returned is because the word “rock” was mentioned in her
about
field. Because only “rock” was mentioned, and not “climbing,” her_score
is lower than John’s.This is a good example of how Elasticsearch can search within full-text fields and return the most relevant results first. This concept of relevance is important to Elasticsearch, and is a concept that is completely foreign to traditional relational databases, in which a record either matches or it doesn’t.
Phrase Search
Finding individual words in a field is all well and good, but sometimes you want to match exact sequences of words or phrases. For instance, we could perform a query that will match only employee records that contain both “rock”and “climbing” and that display the words next to each other in the phrase “rock climbing.”
To do this, we use a slight variation of the
match
query called thematch_phrase
query:GET /megacorp/employee/_search { "query" : { "match_phrase" : { "about" : "rock climbing" } } }
Highlighting Our Searches
Many applications like to highlight snippets of text from each search result so the user can see why the document matched the query. Retrieving highlighted fragments is easy in Elasticsearch.
Let’s rerun our previous query, but add a new
highlight
parameter:GET /megacorp/employee/_search { "query" : { "match_phrase" : { "about" : "rock climbing" } }, "highlight": { "fields" : { "about" : {} } } }
When we run this query, the same hit is returned as before, but now we get a new section in the response called
highlight
Pagination
In the same way as SQL uses the
LIMIT
keyword to return a single “page” of results, Elasticsearch accepts thefrom
andsize
parameters:size
- Indicates the number of results that should be returned, defaults to
10
from
- Indicates the number of initial results that should be skipped, defaults to
0
-
GET /_search?size=5&from=5 GET /_search?size=5&from=10
Most Important Queries
Introduction to the most important queries.
The
match_all
query simply matches all documents. It is the default query that is used if no query has been specified:{ "match_all": {}}
This query is frequently used in combination with a filter—for instance, to retrieve all emails in the inbox folder. All documents are considered to be equally relevant, so they all receive a neutral
_score
of1
.The
match
query should be the standard query that you reach for whenever you want to query for a full-text or exact value in almost any field.If you run a
match
query against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search:{ "match": { "tweet": "About Search" }}
If you use it on a field containing an exact value, such as a number, a date, a Boolean, or a
not_analyzed
string field, then it will search for that exact value:{ "match": { "age": 26 }} { "match": { "date": "2014-09-01" }} { "match": { "public": true }} { "match": { "tag": "full_text" }}
The
multi_match
query allows to run the samematch
query on multiple fields:{ "multi_match": { "query": "full text search", "fields": [ "title", "body" ] } }
The
range
query allows you to find numbers or dates that fall into a specified range:{ "range": { "age": { "gte": 20, "lt": 30 } } }
The operators that it accepts are as follows:
gt
- Greater than
gte
- Greater than or equal to
lt
- Less than
lte
- Less than or equal to
The
term
query is used to search by exact values, be they numbers, dates, Booleans, ornot_analyzed
exact-value string fields:{ "term": { "age": 26 }} { "term": { "date": "2014-09-01" }} { "term": { "public": true }} { "term": { "tag": "full_text" }}
The
term
query performs no analysis on the input text, so it will look for exactly the value that is supplied.The
terms
query is the same as theterm
query, but allows you to specify multiple values to match. If the field contains any of the specified values, the document matches:{ "terms": { "tag": [ "search", "full_text", "nosql" ] }}
Like the
term
query, no analysis is performed on the input text. It is looking for exact matches (including differences in case, accents, spaces, etc).Combining queries together
Real world search requests are never simple; they search multiple fields with various input text, and filter based on an array of criteria. To build sophisticated search, you will need a way to combine multiple queries together into a single search request.
To do that, you can use the
bool
query. This query combines multiple queries together in user-defined boolean combinations. This query accepts the following parameters:must
- Clauses that must match for the document to be included.
must_not
- Clauses that must not match for the document to be included.
should
- If these clauses match, they increase the
_score
; otherwise, they have no effect. They are simply used to refine the relevance score for each document. filter
- Clauses that must match, but are run in non-scoring, filtering mode. These clauses do not contribute to the score, instead they simply include/exclude documents based on their criteria.
Because this is the first query we’ve seen that contains other queries, we need to talk about how scores are combined. Each sub-query clause will individually calculate a relevance score for the document. Once these scores are calculated, the
bool
query will merge the scores together and return a single score representing the total score of the boolean operation.The following query finds documents whose
title
field matches the query stringhow to make millions
and that are not marked asspam
. If any documents arestarred
or are from 2014 onward, they will rank higher than they would have otherwise. Documents that match both conditions will rank even higher:{ "bool": { "must": { "match": { "title": "how to make millions" }}, "must_not": { "match": { "tag": "spam" }}, "should": [ { "match": { "tag": "starred" }}, { "range": { "date": { "gte": "2014-01-01" }}} ] } }
If we don’t want the date of the document to affect scoring at all, we can re-arrange the previous example to use a
filter
clause:{ "bool": { "must": { "match": { "title": "how to make millions" }}, "must_not": { "match": { "tag": "spam" }}, "should": [ { "match": { "tag": "starred" }} ], "filter": { "range": { "date": { "gte": "2014-01-01" }} } } }
By moving the range query into the
filter
clause, we have converted it into a non-scoring query. It will no longer contribute a score to the document’s relevance ranking. And because it is now a non-scoring query, it can use the variety of optimizations available to filters which should increase performance.Any query can be used in this manner. Simply move a query into the
filter
clause of abool
query and it automatically converts to a non-scoring filter. - A search for