ELASTIC SEARCH : SEARCH DOCUMENT USING NEST IN .NET

Standard

For connecting with elastic nodes read this: CREATE INDEX USING NEST IN .NET

For inserting documents read this: INSERT DOCUMENTS IN INDEX USING NEST IN .NET

Query-string search is handy for ad hoc searches from the command line, but it has its limitations . Elasticsearch provides a rich, flexible, query language called the query DSL, which allows us to build much more complicated, robust queries.

Searching—The Basic Tools

We can throw JSON documents at Elasticsearch and retrieve each one by ID. But the real power of Elasticsearch lies in its ability to make sense out of chaos — to turn Big Data into Big Information.

This is the reason that we use structured JSON documents, rather than amorphous blobs of data. Elasticsearch not only stores the document, but also indexes the content of the document in order to make it searchable.

Every field in a document is indexed and can be queried. And it’s not just that. During a single query, Elasticsearch can use all of these indices, to return results at breath-taking speed. That’s something that you could never consider doing with a traditional database.

A search can be any of the following:

  • A structured query on concrete fieldslike gender or age, sorted by a field like join_date, similar to the type of query that you could construct in SQL
  • A full-text query, which finds all documents matching the search keywords, and returns them sorted byrelevance
  • A combination of the two

While many searches will just work out of the box, to use Elasticsearch to its full potential, you need to understand three subjects:

Mapping
How the data in each field is interpreted
Analysis
How full text is processed to make it searchable
Query DSL
The flexible, powerful query language used by Elasticsearch

Inverted Index

Elasticsearch uses a structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.

For example, let’s say we have two documents, each with a content field containing the following:

  1. The quick brown fox jumped over the lazy dog
  2. Quick brown foxes leap over lazy dogs in summer

To create an inverted index, we first split the content field of each document into separate words (which we callterms, or tokens), create a sorted list of all the unique terms, and then list in which document each term appears. The result looks something like this:

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

Now, if we want to search for quick brown, we just need to find the documents in which each term appears:

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
quick   |   X   |
------------------------
Total   |   2   |  1

Both documents match, but the first document has more matches than the second. If we apply a naive similarity algorithm that just counts the number of matching terms, then we can say that the first document is a better match—is more relevant to our query—than the second document.

But there are a few problems with our current inverted index:

  • Quick and quick appear as separate terms, while the user probably thinks of them as the same word.
  • fox and foxes are pretty similar, as are dog and dogs; They share the same root word.
  • jumped and leap, while not from the same root word, are similar in meaning. They are synonyms.

With the preceding index, a search for +Quick +fox wouldn’t match any documents. (Remember, a preceding +means that the word must be present.) Both the term Quick and the term fox have to be in the same document in order to satisfy the query, but the first doc contains quick fox and the second doc contains Quick foxes.

Our user could reasonably expect both documents to match the query. We can do better.

If we normalize the terms into a standard format, then we can find documents that contain terms that are not exactly the same as the user requested, but are similar enough to still be relevant. For instance:

  • Quick can be lowercased to become quick.
  • foxes can be stemmed–reduced to its root form—to become fox. Similarly, dogs could be stemmed todog.
  • jumped and leap are synonyms and can be indexed as just the single term jump.

Now the index looks like this:

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
dog     |   X   |  X
fox     |   X   |  X
in      |       |  X
jump    |   X   |  X
lazy    |   X   |  X
over    |   X   |  X
quick   |   X   |  X
summer  |       |  X
the     |   X   |  X
------------------------

But we’re not there yet. Our search for +Quick +fox would still fail, because we no longer have the exact termQuick in our index. However, if we apply the same normalization rules that we used on the content field to our query string, it would become a query for +quick +fox, which would match both documents!

This is very important. You can find only terms that exist in your index, so both the indexed text and the query string must be normalized into the same form.

This process of tokenization and normalization is called analysis.

Analysis and Analyzers

Analysis is a process that consists of the following:

  • First, tokenizing a block of text into individual terms suitable for use in an inverted index,
  • Then normalizing these terms into a standard form to improve their “searchability,” or recall

This job is performed by analyzers. An analyzer is really just a wrapper that combines three functions into a single package:

Character filters
First, the string is passed through any character filters in turn. Their job is to tidy up the string before tokenization. A character filter could be used to strip out HTML, or to convert & characters to the word and.
Tokenizer
Next, the string is tokenized into individual terms by a tokenizer. A simple tokenizer might split the text into terms whenever it encounters whitespace or punctuation.
Token filters
Last, each term is passed through any token filters in turn, which can change terms (for example, lowercasingQuick), remove terms (for example, stopwords such as a, and, the) or add terms (for example, synonyms likejump and leap).

Elasticsearch provides many character filters, tokenizers, and token filters out of the box. These can be combined to create custom analyzers suitable for different purposes. We discuss these in detail in Custom Analyzers.

Built-in Analyzers

However, Elasticsearch also ships with prepackaged analyzers that you can use directly. We list the most important ones next and, to demonstrate the difference in behavior, we show what terms each would produce from this string:

"Set the shape to semi-transparent by calling set_trans(5)"
Standard analyzer

The standard analyzer is the default analyzer that Elasticsearch uses. It is the best general choice for analyzing text that may be in any language. It splits the text on word boundaries, as defined by the Unicode Consortium, and removes most punctuation. Finally, it lowercases all terms. It would produce

set, the, shape, to, semi, transparent, by, calling, set_trans, 5
Simple analyzer

The simple analyzer splits the text on anything that isn’t a letter, and lowercases the terms. It would produce

set, the, shape, to, semi, transparent, by, calling, set, trans
Whitespace analyzer

The whitespace analyzer splits the text on whitespace. It doesn’t lowercase. It would produce

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
Language analyzers

Language-specific analyzers are available for many languages. They are able to take the peculiarities of the specified language into account. For instance, the english analyzer comes with a set of English stopwords (common words like and or the that don’t have much impact on relevance), which it removes. This analyzer also is able to stem English words because it understands the rules of English grammar.

The english analyzer would produce the following:

set, shape, semi, transpar, call, set_tran, 5

Note how transparent, calling, and set_trans have been stemmed to their root form.

When Analyzers Are Used

When we index a document, its full-text fields are analyzed into terms that are used to create the inverted index.However, when we search on a full-text field, we need to pass the query string through the same analysis process, to ensure that we are searching for terms in the same form as those that exist in the index.

Full-text queries, which we discuss later, understand how each field is defined, and so they can do the right thing:

  • When you query a full-text field, the query will apply the same analyzer to the query string to produce the correct list of terms to search for.
  • When you query an exact-value field, the query will not analyze the query string, but instead search for the exact value that you have specified.

Search Lite

A GET is fairly simple—you get back the document that you ask for. Let’s try something a little more advanced, like a simple search!

The first search we will try is the simplest search possible. We will search for all employees, with this request:

GET /megacorp/employee/_search

Search with Query DSL

Query-string search is handy for ad hoc searches from the command line, but it has its limitations . Elasticsearch provides a rich, flexible, query language called the query DSL, which allows us to build much more complicated, robust queries.

The domain-specific language (DSL) is specified using a JSON request body. We can represent the previous search for all Smiths like so:

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}

Combining Multiple Clauses

Query clauses are simple building blocks that can be combined with each other to create complex queries. Clauses can be as follows:

  • Leaf clauses (like the match clause) that are used to compare a field (or fields) to a query string.
  • Compound clauses that are used to combine other query clauses. For instance, a bool clause allows you to combine other clauses that either must match, must_not match, or should match if possible. They can also include non-scoring, filters for structured search:
{
    "bool": {
        "must":     { "match": { "tweet": "elasticsearch" }},
        "must_not": { "match": { "name":  "mary" }},
        "should":   { "match": { "tweet": "full text" }},
        "filter":   { "range": { "age" : { "gt" : 30 }} }
    }
}

It is important to note that a compound clause can combine any other query clauses, including other compound clauses. This means that compound clauses can be nested within each other, allowing the expression of very complex logic.

Queries and Filters

The DSL used by Elasticsearch has a single set of components called queries, which can be mixed and matched in endless combinations. This single set of components can be used in two contexts: filtering context and query context.

When used in filtering context, the query is said to be a “non-scoring” or “filtering” query. That is, the query simply asks the question: “Does this document match?”. The answer is always a simple, binary yes|no.

  • Is the created date in the range 20132014?
  • Does the status field contain the term published?
  • Is the lat_lon field within 10km of a specified point?

When used in a querying context, the query becomes a “scoring” query. Similar to its non-scoring sibling, this determines if a document matches and how well the document matches.

A typical use for a query is to find documents:

  • Best matching the words full text search
  • Containing the word run, but maybe also matching runs, running, jog, or sprint
  • Containing the words quick, brown, and fox—the closer together they are, the more relevant the document
  • Tagged with lucene, search, or java—the more tags, the more relevant the document

A scoring query calculates how relevant each document is to the query, and assigns it a relevance _score, which is later used to sort matching documents by relevance. This concept of relevance is well suited to full-text search, where there is seldom a completely “correct” answer.

Performance Differences between Queries and Filters

Filtering queries are simple checks for set inclusion/exclusion, which make them very fast to compute. There are various optimizations that can be leveraged when at least one of your filtering query is “sparse” (few matching documents), and frequently used non-scoring queries can be cached in memory for faster access.

In contrast, scoring queries have to not only find matching documents, but also calculate how relevant each document is, which typically makes them heavier than their non-scoring counterparts. Also, query results are not cacheable.

Thanks to the inverted index, a simple scoring query that matches just a few documents may perform as well or better than a filter that spans millions of documents. In general, however, a filter will outperform a scoring query. And it will do so consistently.

The goal of filtering is to reduce the number of documents that have to be examined by the scoring queries.

When to Use Which between Queries and Filters

As a general rule, use query clauses for full-text search or for any condition that should affect the relevance score, and use filters for everything else.

Most Important Queries

Introduction to the most important queries.

  • match_all Query

The match_all query simply matches all documents. It is the default query that is used if no query has been specified:

{ "match_all": {}}

This query is frequently used in combination with a filter—for instance, to retrieve all emails in the inbox folder. All documents are considered to be equally relevant, so they all receive a neutral _score of 1.

  • match Query

The match query should be the standard query that you reach for whenever you want to query for a full-text or exact value in almost any field.

If you run a match query against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search:

{ "match": { "tweet": "About Search" }}

If you use it on a field containing an exact value, such as a number, a date, a Boolean, or a not_analyzed string field, then it will search for that exact value:

{ "match": { "age":    26           }}
{ "match": { "date":   "2014-09-01" }}
{ "match": { "public": true         }}
{ "match": { "tag":    "full_text"  }}
  • multi_match Query

The multi_match query allows to run the same match query on multiple fields:

{
    "multi_match": {
        "query":    "full text search",
        "fields":   [ "title", "body" ]
    }
}
  • range Query

The range query allows you to find numbers or dates that fall into a specified range:

{
    "range": {
        "age": {
            "gte":  20,
            "lt":   30
        }
    }
}

The operators that it accepts are as follows:

gt
Greater than
gte
Greater than or equal to
lt
Less than
lte
Less than or equal to
  • term Query

The term query is used to search by exact values, be they numbers, dates, Booleans, or not_analyzed exact-value string fields:

{ "term": { "age":    26           }}
{ "term": { "date":   "2014-09-01" }}
{ "term": { "public": true         }}
{ "term": { "tag":    "full_text"  }}

The term query performs no analysis on the input text, so it will look for exactly the value that is supplied.

  • terms Query

The terms query is the same as the term query, but allows you to specify multiple values to match. If the field contains any of the specified values, the document matches:

{ "terms": { "tag": [ "search", "full_text", "nosql" ] }}

Like the term query, no analysis is performed on the input text. It is looking for exact matches (including differences in case, accents, spaces, etc).

Combining queries together

Real world search requests are never simple; they search multiple fields with various input text, and filter based on an array of criteria. To build sophisticated search, you will need a way to combine multiple queries together into a single search request.

To do that, you can use the bool query. This query combines multiple queries together in user-defined boolean combinations. This query accepts the following parameters:

must
Clauses that must match for the document to be included.
must_not
Clauses that must not match for the document to be included.
should
If these clauses match, they increase the _score; otherwise, they have no effect. They are simply used to refine the relevance score for each document.
filter
Clauses that must match, but are run in non-scoring, filtering mode. These clauses do not contribute to the score, instead they simply include/exclude documents based on their criteria.

Because this is the first query we’ve seen that contains other queries, we need to talk about how scores are combined. Each sub-query clause will individually calculate a relevance score for the document. Once these scores are calculated, the bool query will merge the scores together and return a single score representing the total score of the boolean operation.

The following query finds documents whose title field matches the query string how to make millions and that are not marked as spam. If any documents are starred or are from 2014 onward, they will rank higher than they would have otherwise. Documents that match both conditions will rank even higher:

{
    "bool": {
        "must":     { "match": { "title": "how to make millions" }},
        "must_not": { "match": { "tag":   "spam" }},
        "should": [
            { "match": { "tag": "starred" }},
            { "range": { "date": { "gte": "2014-01-01" }}}
        ]
    }
}

Adding a filtering query

If we don’t want the date of the document to affect scoring at all, we can re-arrange the previous example to use afilter clause:

{
    "bool": {
        "must":     { "match": { "title": "how to make millions" }},
        "must_not": { "match": { "tag":   "spam" }},
        "should": [
            { "match": { "tag": "starred" }}
        ],
        "filter": {
          "range": { "date": { "gte": "2014-01-01" }} 
        }
    }
}

By moving the range query into the filter clause, we have converted it into a non-scoring query. It will no longer contribute a score to the document’s relevance ranking. And because it is now a non-scoring query, it can use the variety of optimizations available to filters which should increase performance.

Any query can be used in this manner. Simply move a query into the filter clause of a bool query and it automatically converts to a non-scoring filter.

Full-Text Search

The searches so far have been simple: single names, filtered by age. Let’s try a more advanced, full-text search—atask that traditional databases would really struggle with.

We are going to search for all employees who enjoy rock climbing:

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}

You can see that we use the same match query as before to search the about field for “rock climbing”. We get back two matching documents:

{
   ...
   "hits": {
      "total":      2,
      "max_score":  0.16273327,
      "hits": [
         {
            ...
            "_score":         0.16273327, 
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         },
         {
            ...
            "_score":         0.016878016, 
            "_source": {
               "first_name":  "Jane",
               "last_name":   "Smith",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}

By default, Elasticsearch sorts matching results by their relevance score, that is, by how well each document matches the query. The first and highest-scoring result is obvious: John Smith’s about field clearly says “rock climbing” in it.

But why did Jane Smith come back as a result? The reason her document was returned is because the word “rock” was mentioned in her about field. Because only “rock” was mentioned, and not “climbing,” her _score is lower than John’s.

This is a good example of how Elasticsearch can search within full-text fields and return the most relevant results first. This concept of relevance is important to Elasticsearch, and is a concept that is completely foreign to traditional relational databases, in which a record either matches or it doesn’t.

Phrase Search

Finding individual words in a field is all well and good, but sometimes you want to match exact sequences of words or phrases. For instance, we could perform a query that will match only employee records that contain both “rock”and “climbing” and that display the words next to each other in the phrase “rock climbing.”

To do this, we use a slight variation of the match query called the match_phrase query:

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}

Highlighting Our Searches

Many applications like to highlight snippets of text from each search result so the user can see why the document matched the query. Retrieving highlighted fragments is easy in Elasticsearch.

Let’s rerun our previous query, but add a new highlight parameter:

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}

When we run this query, the same hit is returned as before, but now we get a new section in the response called highlight

Pagination

In the same way as SQL uses the LIMIT keyword to return a single “page” of results, Elasticsearch accepts thefrom and size parameters:

size
Indicates the number of results that should be returned, defaults to 10
from
Indicates the number of initial results that should be skipped, defaults to 0

Search Queries Example

In order to search specific document use SEARCH API. Please note if type of document has been customized by you during creating index, explicitly mention that type with index for instance (you need to do this if you have not set DefaultIndex API with connection setting object )

public static object SearchDocumentMethod1(string qr = "IT")
{
    var query = qr.ToLower();
    var response = EsClient.Search<Employee>(s => s
                    .From(0)
                    .Size(10000)
                    .Index("employee")
                    .Type("myEmployee")
                    .Query(q =>
                            q.Term(t => t.Department, query)
                            )
                    );
    return response;
}
public static object SearchDocumentMethod2(string qr = "IT")
{
    var response = EsClient.Search<Employee>(s => s
                    .From(0)
                    .Size(10000)
                    .Index("employee")
                    .Type("myEmployee")
                    .Query(q =>
                            q.Match(mq => mq.Field(f => f.Department).Query(qr))
                           )
                    );
    return response;
}

OR operation:

public static object SearchDocumentUsingOROperator(string dept = "IT", string name = "XYZ")
{
    var qDept = dept.ToLower();
    var qName = name.ToLower();
    var response = EsClient.Search<Employee>(s => s
                    .From(0)
                    .Size(10000)
                    .Index("employee")
                    .Type("myEmployee")
                    .Query(q => q
                            .Bool(b => b
                                .Should(
                                    bs => bs.Term(p => p.Department, qDept),
                                    bs => bs.Term(p => p.Name, qName)
                                )
                            )
                        )
                    );
    return response;
}

AND operation:

public static object SearchDocumentUsingANDOperator(string dept = "IT", string name = "XYZ")
{
    var qDept = dept.ToLower();
    var qName = name.ToLower();
    var response = EsClient.Search<Employee>(s => s
                    .From(0)
                    .Size(10000)
                    .Index("employee")
                    .Type("myEmployee")
                    .Query(q => q
                            .Bool(b => b
                                .Must(
                                    bs => bs.Term(p => p.Department, qDept),
                                    bs => bs.Term(p => p.Name, qName)
                                )
                            )
                        )
                    );
    return response;
}

NOT operation:

public static object SearchDocumentUsingNOTOperator(string dept = "IT", int empId = 45)
{
    var qDept = dept.ToLower();
    var qempId = empId;
    var response = EsClient.Search<Employee>(s => s
                    .From(0)
                    .Size(1)
                    .Index("employee")
                    .Type("myEmployee")
                    .Query(q => q
                            .Bool(b => b
                                .MustNot(
                                    bs => bs.Term(p => p.Department, qDept),
                                    bs => bs.Term(p => p.EmpId, qempId)
                                )
                            )
                        )
                    );
    return response;
}

Operator Overloading for Boolean operation:

public static object SearchDocumentUsingOperatorOverloading()
{
    var qDept = "IT".ToLower();
    var qName = "John1".ToLower();
    var response = EsClient.Search<Employee>(s => s
                    .From(0)
                    .Size(10000)
                    .Index("employee")
                    .Type("myEmployee")
                    .Query(q => 
                            q.Term(p => p.Name, qName) && 
                            (q.Term(p => p.Department, qDept) ||
                            q.Term(p => p.Salary, 45139)) 
                        )
                    );
    return response;
}

Filter operation:

public static object SearchDocumentUsingFilter()
{
    var response = EsClient.Search<Employee>(s => s
                    .From(0)
                    .Size(10000)
                    .Index("employee")
                    .Type("myEmployee")
                    .Query(q => q
                        .Bool(b => b
                            .Filter(f => f.Range(m => m.Field("salary").LessThan(45139)))
                            )
                        )
                    );
    return response;
}

Complex operation:

public static object SearchDocumentComplex1()
{
    var response = EsClient.Search<Employee>(s => s
                    .From(0)
                    .Size(10000)
                    .Index("employee")
                    .Type("myEmployee")
                    .Query(q => q
                        .Bool(b => b
                            .Must(
                                bs => bs.Term(p => p.Salary, "45112"),
                                bs => bs.Term(p => p.EmpId, "112"),
                                bs => bs.Range(m => m.Field("salary").LessThanOrEquals(45112))
                                )   
                            )
                        )
                    );
    return response;
}

Complex operation:

public static object SearchDocumentComplex2()
{
    var response = EsClient.Search<Employee>(s => s
                    .From(0)
                    .Size(10000)
                    .Index("employee")
                    .Type("myEmployee")
                    .Query(q => q
                        .Bool(b => b
                            .Must(
                                bs => bs.Term(p => p.Salary, "45112"),
                                bs => bs.Term(p => p.EmpId, "112")
                                )
                            .Filter(f => f.Range(m => m.Field("salary").GreaterThanOrEquals(45112)))
                            )
                        )
                    );
    return response;
}

Complex operation:

public static object SearchDocumentComplex3()
{
    var response = EsClient.Search<Employee>(s => s
                    .From(0)
                    .Size(10000)
                    .Index("employee")
                    .Type("myEmployee")
                    .Query(q =>
                        q.Term(p => p.Name, "john150") ||
                        q.Term(p => p.Salary, "45149") ||
                            (
                                q.TermRange(p => p.Field(f => f.Salary).GreaterThanOrEquals("45100")) &&
                                q.TermRange(p => p.Field(f => f.Salary).LessThanOrEquals("45105"))
                            )
                        )
                    );
    return response;
}

Notes: 

  • A bool query literally combines multiple queries of any type together with clauses such as must, must_not, and should.
  • A term query specifies a single field and a single term to determine if the field matches. Note that term queries are specifically for non-analyzed fields.
  • Analyzers process the text in order to obtain the terms that are finally indexed/searched. An analyzer of type standard is built using the Standard Tokenizer with the Standard Token Filter, Lower Case Token Filter, and Stop Token Filter. So always convert searching input to lower case for using Term API. Using the Standard Analyzer GET becomes get when stored in the index. The source document will still have the original “GET”. The match query will apply the same standard analyzer to the search term and will therefore match what is stored in the index. The term query does not apply any analyzers to the search term, so will only look for that exact term in the inverted index.To use the term query in your example, change the upper case “GET” to lower case “get” or change your mapping so the request.method field is set to not_analyzed.
  • Query DSL : Elasticsearch provides a full Query DSL based on JSON to define queries , consisting of two types of clauses:

    Leaf query clauses
    Leaf query clauses look for a particular value in a particular field, such as the match, term or range queries. These queries can be used by themselves.
    Compound query clauses
    Compound query clauses wrap other leaf or compound queries and are used to combine multiple queries in a logical fashion (such as the bool or dis_max query), or to alter their behaviour (such as the not orconstant_score query).

    Reference

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s