Elastic Search: Introduction, Basics, Architecture and Usage of Elastic Search


Most databases are astonishingly inept at extracting actionable knowledge from your data. Sure, they can filter by timestamp or exact values, but can they perform full-text search, handle synonyms, and score documents by relevance? Elastic Search is the answer.

Elasticsearch is a real-time distributed search and analytics engine. It allows you to explore your data at a speed and at a scale never before possible. It is used for full-text search, structured search, analytics, and all three in combination:

  • A distributed real-time document store where every field is indexed and searchable
  • A distributed search engine with real-time analytics
  • Capable of scaling to hundreds of servers and petabytes of structured and unstructured data

Document Oriented

Objects in an application are seldom just a simple list of keys and values. More often than not, they are complex data structures that may contain dates, geo locations, other objects, or arrays of values.

Elasticsearch is document oriented, meaning that it stores entire objects or documents. It not only stores them, but also indexes the contents of each document in order to make them searchable. In Elasticsearch, you index, search, sort, and filter documents—not rows of columnar data. This is a fundamentally different way of thinking about data and is one of the reasons Elasticsearch can perform complex full-text search.


Elasticsearch uses JavaScript Object Notation, or JSON, as the serialization format for documents. JSON serialization is supported by most programming languages, and has become the standard format used by the NoSQL movement. It is simple, concise, and easy to read.


An Elasticsearch cluster can contain multiple indices, which in turn contain multiple types. These types hold multiple documents, and each document has multiple fields.

  • IndexTo index a document is to store a document in an index (noun) so that it can be retrieved and queried. It is much like the INSERT keyword in SQL except that, if the document already exists, the new document would replace the old.
  • Inverted indexRelational databases add an index, such as a B-tree index, to specific columns in order to improve the speed of data retrieval. Elasticsearch and Lucene use a structure called an inverted index for exactly the same purpose. By default, every field in a document is indexed (has an inverted index) and thus is searchable. A field without an inverted index is not searchable.

So for employee directory example, consider following:

  • Index a document per employee, which contains all the details of a single employee.
  • Each document will be of type employee.
  • That type will live in the megacorp index.
  • That index will reside within our Elasticsearch cluster.

In practice, this is easy (even though it looks like a lot of steps). We can perform all of those actions in a single command:

PUT /megacorp/employee/1
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]


Searching—The Basic Tools

We can throw JSON documents at Elasticsearch and retrieve each one by ID. But the real power of Elasticsearch lies in its ability to make sense out of chaos — to turn Big Data into Big Information.

This is the reason that we use structured JSON documents, rather than amorphous blobs of data. Elasticsearch not only stores the document, but also indexes the content of the document in order to make it searchable.

Every field in a document is indexed and can be queried. And it’s not just that. During a single query, Elasticsearch can use all of these indices, to return results at breath-taking speed. That’s something that you could never consider doing with a traditional database.

A search can be any of the following:

  • A structured query on concrete fields like gender or age, sorted by a field like join_date, similar to the type of query that you could construct in SQL
  • A full-text query, which finds all documents matching the search keywords, and returns them sorted byrelevance
  • A combination of the two

While many searches will just work out of the box, to use Elasticsearch to its full potential, you need to understand three subjects:

How the data in each field is interpreted
How full text is processed to make it searchable
Query DSL
The flexible, powerful query language used by Elasticsearch

Each of these is a big subject in its own right, and we explain them in de

Exact Values Versus Full Text

Data in Elasticsearch can be broadly divided into two types: exact values and full text.

Exact values are exactly what they sound like. Examples are a date or a user ID, but can also include exact strings such as a username or an email address. The exact value Foo is not the same as the exact value foo. The exact value 2014 is not the same as the exact value 2014-09-15.

Full text, on the other hand, refers to textual data—usually written in some human language — like the text of a tweet or the body of an email.

Full text is often referred to as unstructured data, which is a misnomer—natural language is highly structured. The problem is that the rules of natural languages are complex, which makes them difficult for computers to parse correctly. For instance, consider this sentence:

May is fun but June bores me.

Does it refer to months or to people?

Exact values are easy to query. The decision is binary; a value either matches the query, or it doesn’t. This kind of query is easy to express with SQL:

WHERE name    = "John Smith"
  AND user_id = 2
  AND date    > "2014-09-15"

Querying full-text data is much more subtle. We are not just asking, “Does this document match the query” but “How well does this document match the query?” In other words, how relevant is this document to the given query?

We seldom want to match the whole full-text field exactly. Instead, we want to search within text fields. Not only that, but we expect search to understand our intent:

  • A search for UK should also return documents mentioning the United Kingdom.
  • A search for jump should also match jumped, jumps, jumping, and perhaps even leap.
  • johnny walker should match Johnnie Walker, and johnnie depp should match Johnny Depp

To facilitate these types of queries on full-text fields, Elasticsearch first analyzes the text, and then uses the results to build an inverted index

Inverted Index

Elasticsearch uses a structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.

For example, let’s say we have two documents, each with a content field containing the following:

  1. The quick brown fox jumped over the lazy dog
  2. Quick brown foxes leap over lazy dogs in summer

To create an inverted index, we first split the content field of each document into separate words (which we callterms, or tokens), create a sorted list of all the unique terms, and then list in which document each term appears. The result looks something like this:

Term      Doc_1  Doc_2
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |

Now, if we want to search for quick brown, we just need to find the documents in which each term appears:

Term      Doc_1  Doc_2
brown   |   X   |  X
quick   |   X   |
Total   |   2   |  1

Both documents match, but the first document has more matches than the second. If we apply a naive similarity algorithm that just counts the number of matching terms, then we can say that the first document is a better match—is more relevant to our query—than the second document.

But there are a few problems with our current inverted index:

  • Quick and quick appear as separate terms, while the user probably thinks of them as the same word.
  • fox and foxes are pretty similar, as are dog and dogs; They share the same root word.
  • jumped and leap, while not from the same root word, are similar in meaning. They are synonyms.

With the preceding index, a search for +Quick +fox wouldn’t match any documents. (Remember, a preceding +means that the word must be present.) Both the term Quick and the term fox have to be in the same document in order to satisfy the query, but the first doc contains quick fox and the second doc contains Quick foxes.

Our user could reasonably expect both documents to match the query. We can do better.

If we normalize the terms into a standard format, then we can find documents that contain terms that are not exactly the same as the user requested, but are similar enough to still be relevant. For instance:

  • Quick can be lowercased to become quick.
  • foxes can be stemmed–reduced to its root form—to become fox. Similarly, dogs could be stemmed todog.
  • jumped and leap are synonyms and can be indexed as just the single term jump.

Now the index looks like this:

Term      Doc_1  Doc_2
brown   |   X   |  X
dog     |   X   |  X
fox     |   X   |  X
in      |       |  X
jump    |   X   |  X
lazy    |   X   |  X
over    |   X   |  X
quick   |   X   |  X
summer  |       |  X
the     |   X   |  X

But we’re not there yet. Our search for +Quick +fox would still fail, because we no longer have the exact termQuick in our index. However, if we apply the same normalization rules that we used on the content field to our query string, it would become a query for +quick +fox, which would match both documents!

This is very important. You can find only terms that exist in your index, so both the indexed text and the query string must be normalized into the same form.

This process of tokenization and normalization is called analysis.

Analysis and Analyzers

Analysis is a process that consists of the following:

  • First, tokenizing a block of text into individual terms suitable for use in an inverted index,
  • Then normalizing these terms into a standard form to improve their “searchability,” or recall

This job is performed by analyzers. An analyzer is really just a wrapper that combines three functions into a single package:

Character filters
First, the string is passed through any character filters in turn. Their job is to tidy up the string before tokenization. A character filter could be used to strip out HTML, or to convert & characters to the word and.
Next, the string is tokenized into individual terms by a tokenizer. A simple tokenizer might split the text into terms whenever it encounters whitespace or punctuation.
Token filters
Last, each term is passed through any token filters in turn, which can change terms (for example, lowercasingQuick), remove terms (for example, stopwords such as a, and, the) or add terms (for example, synonyms likejump and leap).

Elasticsearch provides many character filters, tokenizers, and token filters out of the box. These can be combined to create custom analyzers suitable for different purposes. We discuss these in detail in Custom Analyzers.

Built-in Analyzers

However, Elasticsearch also ships with prepackaged analyzers that you can use directly. We list the most important ones next and, to demonstrate the difference in behavior, we show what terms each would produce from this string:

"Set the shape to semi-transparent by calling set_trans(5)"
Standard analyzer

The standard analyzer is the default analyzer that Elasticsearch uses. It is the best general choice for analyzing text that may be in any language. It splits the text on word boundaries, as defined by the Unicode Consortium, and removes most punctuation. Finally, it lowercases all terms. It would produce

set, the, shape, to, semi, transparent, by, calling, set_trans, 5
Simple analyzer

The simple analyzer splits the text on anything that isn’t a letter, and lowercases the terms. It would produce

set, the, shape, to, semi, transparent, by, calling, set, trans
Whitespace analyzer

The whitespace analyzer splits the text on whitespace. It doesn’t lowercase. It would produce

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
Language analyzers

Language-specific analyzers are available for many languages. They are able to take the peculiarities of the specified language into account. For instance, the english analyzer comes with a set of English stopwords (common words like and or the that don’t have much impact on relevance), which it removes. This analyzer also is able to stem English words because it understands the rules of English grammar.

The english analyzer would produce the following:

set, shape, semi, transpar, call, set_tran, 5

Note how transparent, calling, and set_trans have been stemmed to their root form.

When Analyzers Are Used

When we index a document, its full-text fields are analyzed into terms that are used to create the inverted index.However, when we search on a full-text field, we need to pass the query string through the same analysis process, to ensure that we are searching for terms in the same form as those that exist in the index.

Full-text queries, which we discuss later, understand how each field is defined, and so they can do the right thing:

  • When you query a full-text field, the query will apply the same analyzer to the query string to produce the correct list of terms to search for.
  • When you query an exact-value field, the query will not analyze the query string, but instead search for the exact value that you have specified.


In order to be able to treat date fields as dates, numeric fields as numbers, and string fields as full-text or exact-value strings, Elasticsearch needs to know what type of data each field contains. This information is contained in the mapping.

Every type has its own mapping, orschema definition. A mapping defines the fields within a type, the datatype for each field, and how the field should be handled by Elasticsearch. A mapping is also used to configure metadata associated with the type.

Core Simple Field Types

Elasticsearch supports the following simple field types:

  • String: string
  • Whole number: byte, short, integer, long
  • Floating-point: float, double
  • Boolean: boolean
  • Date: date

When you index a document that contains a new field—one previously not seen—Elasticsearch will use dynamic mapping to try to guess the field type from the basic datatypes available in JSON, using the following rules:

JSON type Field type
Boolean: true or false boolean
Whole number: 123 long
Floating point: 123.45 double
String, valid date: 2014-09-15 date
String: foo bar string


Search Lite

A GET is fairly simple—you get back the document that you ask for. Let’s try something a little more advanced, like a simple search!

The first search we will try is the simplest search possible. We will search for all employees, with this request:

GET /megacorp/employee/_search

Search with Query DSL

Query-string search is handy for ad hoc searches from the command line, but it has its limitations . Elasticsearch provides a rich, flexible, query language called the query DSL, which allows us to build much more complicated, robust queries.

The domain-specific language (DSL) is specified using a JSON request body. We can represent the previous search for all Smiths like so:

GET /megacorp/employee/_search
    "query" : {
        "match" : {
            "last_name" : "Smith"

Combining Multiple Clauses

Query clauses are simple building blocks that can be combined with each other to create complex queries. Clauses can be as follows:

  • Leaf clauses (like the match clause) that are used to compare a field (or fields) to a query string.
  • Compound clauses that are used to combine other query clauses. For instance, a bool clause allows you to combine other clauses that either must match, must_not match, or should match if possible. They can also include non-scoring, filters for structured search:
    "bool": {
        "must":     { "match": { "tweet": "elasticsearch" }},
        "must_not": { "match": { "name":  "mary" }},
        "should":   { "match": { "tweet": "full text" }},
        "filter":   { "range": { "age" : { "gt" : 30 }} }

It is important to note that a compound clause can combine any other query clauses, including other compound clauses. This means that compound clauses can be nested within each other, allowing the expression of very complex logic.

Queries and Filters

The DSL used by Elasticsearch has a single set of components called queries, which can be mixed and matched in endless combinations. This single set of components can be used in two contexts: filtering context and query context.

When used in filtering context, the query is said to be a “non-scoring” or “filtering” query. That is, the query simply asks the question: “Does this document match?”. The answer is always a simple, binary yes|no.

  • Is the created date in the range 20132014?
  • Does the status field contain the term published?
  • Is the lat_lon field within 10km of a specified point?

When used in a querying context, the query becomes a “scoring” query. Similar to its non-scoring sibling, this determines if a document matches and how well the document matches.

A typical use for a query is to find documents:

  • Best matching the words full text search
  • Containing the word run, but maybe also matching runs, running, jog, or sprint
  • Containing the words quick, brown, and fox—the closer together they are, the more relevant the document
  • Tagged with lucene, search, or java—the more tags, the more relevant the document

A scoring query calculates how relevant each document is to the query, and assigns it a relevance _score, which is later used to sort matching documents by relevance. This concept of relevance is well suited to full-text search, where there is seldom a completely “correct” answer.

Performance Differences between Queries and Filters

Filtering queries are simple checks for set inclusion/exclusion, which make them very fast to compute. There are various optimizations that can be leveraged when at least one of your filtering query is “sparse” (few matching documents), and frequently used non-scoring queries can be cached in memory for faster access.

In contrast, scoring queries have to not only find matching documents, but also calculate how relevant each document is, which typically makes them heavier than their non-scoring counterparts. Also, query results are not cacheable.

Thanks to the inverted index, a simple scoring query that matches just a few documents may perform as well or better than a filter that spans millions of documents. In general, however, a filter will outperform a scoring query. And it will do so consistently.

The goal of filtering is to reduce the number of documents that have to be examined by the scoring queries.

When to Use Which between Queries and Filters

As a general rule, use query clauses for full-text search or for any condition that should affect the relevance score, and use filters for everything else.

Full-Text Search

The searches so far have been simple: single names, filtered by age. Let’s try a more advanced, full-text search—atask that traditional databases would really struggle with.

We are going to search for all employees who enjoy rock climbing:

GET /megacorp/employee/_search
    "query" : {
        "match" : {
            "about" : "rock climbing"

You can see that we use the same match query as before to search the about field for “rock climbing”. We get back two matching documents:

   "hits": {
      "total":      2,
      "max_score":  0.16273327,
      "hits": [
            "_score":         0.16273327, 
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            "_score":         0.016878016, 
            "_source": {
               "first_name":  "Jane",
               "last_name":   "Smith",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]

By default, Elasticsearch sorts matching results by their relevance score, that is, by how well each document matches the query. The first and highest-scoring result is obvious: John Smith’s about field clearly says “rock climbing” in it.

But why did Jane Smith come back as a result? The reason her document was returned is because the word “rock” was mentioned in her about field. Because only “rock” was mentioned, and not “climbing,” her _score is lower than John’s.

This is a good example of how Elasticsearch can search within full-text fields and return the most relevant results first. This concept of relevance is important to Elasticsearch, and is a concept that is completely foreign to traditional relational databases, in which a record either matches or it doesn’t.

Phrase Search

Finding individual words in a field is all well and good, but sometimes you want to match exact sequences of words or phrases. For instance, we could perform a query that will match only employee records that contain both “rock”and “climbing” and that display the words next to each other in the phrase “rock climbing.”

To do this, we use a slight variation of the match query called the match_phrase query:

GET /megacorp/employee/_search
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"

Highlighting Our Searches

Many applications like to highlight snippets of text from each search result so the user can see why the document matched the query. Retrieving highlighted fragments is easy in Elasticsearch.

Let’s rerun our previous query, but add a new highlight parameter:

GET /megacorp/employee/_search
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
    "highlight": {
        "fields" : {
            "about" : {}

When we run this query, the same hit is returned as before, but now we get a new section in the response called highlight


In the same way as SQL uses the LIMIT keyword to return a single “page” of results, Elasticsearch accepts thefrom and size parameters:

Indicates the number of results that should be returned, defaults to 10
Indicates the number of initial results that should be skipped, defaults to 0
GET /_search?size=5&from=5
GET /_search?size=5&from=10

Most Important Queries

Introduction to the most important queries.

  • match_all Query

The match_all query simply matches all documents. It is the default query that is used if no query has been specified:

{ "match_all": {}}

This query is frequently used in combination with a filter—for instance, to retrieve all emails in the inbox folder. All documents are considered to be equally relevant, so they all receive a neutral _score of 1.

  • match Query

The match query should be the standard query that you reach for whenever you want to query for a full-text or exact value in almost any field.

If you run a match query against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search:

{ "match": { "tweet": "About Search" }}

If you use it on a field containing an exact value, such as a number, a date, a Boolean, or a not_analyzed string field, then it will search for that exact value:

{ "match": { "age":    26           }}
{ "match": { "date":   "2014-09-01" }}
{ "match": { "public": true         }}
{ "match": { "tag":    "full_text"  }}
  • multi_match Query

The multi_match query allows to run the same match query on multiple fields:

    "multi_match": {
        "query":    "full text search",
        "fields":   [ "title", "body" ]
  • range Query

The range query allows you to find numbers or dates that fall into a specified range:

    "range": {
        "age": {
            "gte":  20,
            "lt":   30

The operators that it accepts are as follows:

Greater than
Greater than or equal to
Less than
Less than or equal to
  • term Query

The term query is used to search by exact values, be they numbers, dates, Booleans, or not_analyzed exact-value string fields:

{ "term": { "age":    26           }}
{ "term": { "date":   "2014-09-01" }}
{ "term": { "public": true         }}
{ "term": { "tag":    "full_text"  }}

The term query performs no analysis on the input text, so it will look for exactly the value that is supplied.

  • terms Query

The terms query is the same as the term query, but allows you to specify multiple values to match. If the field contains any of the specified values, the document matches:

{ "terms": { "tag": [ "search", "full_text", "nosql" ] }}

Like the term query, no analysis is performed on the input text. It is looking for exact matches (including differences in case, accents, spaces, etc).

Combining queries together

Real world search requests are never simple; they search multiple fields with various input text, and filter based on an array of criteria. To build sophisticated search, you will need a way to combine multiple queries together into a single search request.

To do that, you can use the bool query. This query combines multiple queries together in user-defined boolean combinations. This query accepts the following parameters:

Clauses that must match for the document to be included.
Clauses that must not match for the document to be included.
If these clauses match, they increase the _score; otherwise, they have no effect. They are simply used to refine the relevance score for each document.
Clauses that must match, but are run in non-scoring, filtering mode. These clauses do not contribute to the score, instead they simply include/exclude documents based on their criteria.

Because this is the first query we’ve seen that contains other queries, we need to talk about how scores are combined. Each sub-query clause will individually calculate a relevance score for the document. Once these scores are calculated, the bool query will merge the scores together and return a single score representing the total score of the boolean operation.

The following query finds documents whose title field matches the query string how to make millions and that are not marked as spam. If any documents are starred or are from 2014 onward, they will rank higher than they would have otherwise. Documents that match both conditions will rank even higher:

    "bool": {
        "must":     { "match": { "title": "how to make millions" }},
        "must_not": { "match": { "tag":   "spam" }},
        "should": [
            { "match": { "tag": "starred" }},
            { "range": { "date": { "gte": "2014-01-01" }}}

Adding a filtering query

If we don’t want the date of the document to affect scoring at all, we can re-arrange the previous example to use afilter clause:

    "bool": {
        "must":     { "match": { "title": "how to make millions" }},
        "must_not": { "match": { "tag":   "spam" }},
        "should": [
            { "match": { "tag": "starred" }}
        "filter": {
          "range": { "date": { "gte": "2014-01-01" }} 

By moving the range query into the filter clause, we have converted it into a non-scoring query. It will no longer contribute a score to the document’s relevance ranking. And because it is now a non-scoring query, it can use the variety of optimizations available to filters which should increase performance.

Any query can be used in this manner. Simply move a query into the filter clause of a bool query and it automatically converts to a non-scoring filter.




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s