A Beginner Tutorial For Elasticsearch – 1 – The Null Pointer Exception

Introduction

Elasticsearch is a near-realtime search platform. Near-realtime means that there is a slight latency from the time a document is indexed until the time it becomes searchable. We will start with some terminology related to elasticsearch.

Document

A document is the basic unit of information that can be indexed. A document is expressed in JSON.

Example of a document:

{
    "_index": "partners",
    "_type": "_doc",
    "_id": "1",
    "_version": 2,
    "_seq_no": 1,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "location": {
            "lat": 12.9626629,
            "lon": 77.6403188
        },
        "name": "Stop N Shop",
        "offerings": "tea, salt, sugar"
    }
}

Index

An index is a collection of documents that have similar characteristics. For example, we can have an index for customer data and another for the product catalogue. Index is identified by a name and this name is referred for performing index, search and update operations against the documents in it. We can define as many indexes as we want in one cluster.

Node

A node is a single server that is part of the cluster. A node stores the data and participates in the cluster’s indexing and search capabilities. A node is identified by a name which by default is a UUID, that is assigned to the node at a startup.

Cluster

A cluster is a collection of one or more nodes that together holds our entire data and provides indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is “elasticsearch”. Each node joins the cluster called “elasticsearch” by default.

Shard

Each index in elastic search is divided into multiple shards and each shard can have multiple copies. These copies are known as replication group and must be kept in sync when documents are added or removed.

Data Replication Model

The process of keeping the shard copies in sync and serving reads from them is what we call the data replication model.

Motivation Behind Shard

An index can potentially store a large amount of data that can exceed the hardware limits of a single node. Sharding allows us to horizontally scale the content volume.

It allows us to distribute and parallelise the operations across shards and thus increasing performance/throughput.

Replication

Once replicated, each index will have primary shards and replica shards. It provides high availability in case a shard/node fails. For this reason it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.

Failure Handling

If the primary fails, the node hosting the primary will send a message to the master to abort it. The indexing operation will wait for the master to promote one of the replicas to be a new primary. The operation will be then forwarded to the new primary for the processing. Master also monitors the health of the nodes and may decide to proactively demote a primary. If any of the shard has problem in replication, that shard is removed and only after the removal of that shard, indexing operation is acked.

Usage

Full text search
Analytics store
Auto completer
Spell checker
Alerting engine
General purpose document store

References and further reading:

Elasticsearch official guide

If you liked this article and would like one such blog to land in your inbox every week, consider subscribing to our newsletter: https://skillcaptain.substack.com