Elasticsearch Guide [8.10]

Elasticsearch
Author

Shitao5

Published

2023-09-20

Modified

2023-09-25

Progress

Learning Progress: 34.38%.

Learning Source

1 What is Elasticsearch?

1.1 Data in: documents and indices

数据:文档和索引

Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents. When you have multiple Elasticsearch nodes in a cluster, stored documents are distributed across the cluster and can be accessed immediately from any node.

When a document is stored, it is indexed and fully searchable in near real-time–within 1 second. Elasticsearch uses a data structure called an inverted index that supports very fast full-text searches. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.

An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data. By default, Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data structure. For example, text fields are stored in inverted indices, and numeric and geo fields are stored in BKD trees. The ability to use the per-field data structures to assemble and return search results is what makes Elasticsearch so fast.

Elasticsearch also has the ability to be schema-less, which means that documents can be indexed without explicitly specifying how to handle each of the different fields that might occur in a document. When dynamic mapping is enabled, Elasticsearch automatically detects and adds new fields to the index. This default behavior makes it easy to index and explore your data— just start indexing documents and Elasticsearch will detect and map booleans, floating point and integer values, dates, and strings to the appropriate Elasticsearch data types.

Ultimately, however, you know more about your data and how you want to use it than Elasticsearch can. You can define rules to control dynamic mapping and explicitly define mappings to take full control of how fields are stored and indexed.

Elasticsearch是一个分布式文档存储系统。与将信息存储为列数据的行不同,Elasticsearch存储已序列化为JSON文档的复杂数据结构。当您在集群中拥有多个Elasticsearch节点时,存储的文档会分布在整个集群中,并可以从任何节点立即访问。

当文档存储时,它会被索引,并在几乎实时(在1秒内)内完全可搜索。Elasticsearch使用一种称为倒排索引的数据结构,支持非常快速的全文搜索。倒排索引列出了出现在任何文档中的每个唯一单词,并识别每个单词出现在哪些文档中。

索引可以看作是文档的优化集合,每个文档都是字段的集合,字段是包含您的数据的键值对。默认情况下,Elasticsearch在每个字段中索引所有数据,并且每个索引字段都有一个专用的、优化的数据结构。例如,文本字段存储在倒排索引中,数值和地理字段存储在BKD树中。能够使用每个字段的数据结构来组装和返回搜索结果是使Elasticsearch如此快速的关键。

Elasticsearch还具有无模式的能力,这意味着可以索引文档而不需要明确指定如何处理文档中可能出现的不同字段。当启用动态映射时,Elasticsearch会自动检测并将新字段添加到索引中。这种默认行为使索引和探索数据变得容易,只需开始索引文档,Elasticsearch将检测并将布尔值、浮点数和整数值、日期和字符串映射到适当的Elasticsearch数据类型。

然而,最终,您了解自己的数据以及如何使用它胜过Elasticsearch。您可以定义规则来控制动态映射,并明确定义映射以完全控制字段的存储和索引方式。

1.2 Information out: search and analyze

可伸缩性和弹性:集群、节点和分片

The Elasticsearch REST APIs support structured queries, full text queries, and complex queries that combine the two. Structured queries are similar to the types of queries you can construct in SQL.

Because aggregations leverage the same data-structures used for search, they are also very fast. This enables you to analyze and visualize your data in real time.

What’s more, aggregations operate alongside search requests. You can search documents, filter results, and perform analytics at the same time, on the same data, in a single request.

Elasticsearch的REST API支持结构化查询、全文查询以及结合两者的复杂查询。结构化查询类似于您可以在SQL中构建的查询类型。

由于聚合利用了与搜索相同的数据结构,因此它们也非常快速。这使您能够实时分析和可视化数据。

更重要的是,聚合与搜索请求并行运行。您可以在同一请求中同时搜索文档、过滤结果并执行分析,而不是在相同数据上执行分析。

1.3 Scalability and resilience: clusters, nodes, and shards

可伸缩性和弹性:集群、节点和分片

Elasticsearch is built to be always available and to scale with your needs. It does this by being distributed by nature. You can add servers (nodes) to a cluster to increase capacity and Elasticsearch automatically distributes your data and query load across all of the available nodes. No need to overhaul your application, Elasticsearch knows how to balance multi-node clusters to provide scale and high availability. The more nodes, the merrier.

How does this work? Under the covers, an Elasticsearch index is really just a logical grouping of one or more physical shards, where each shard is actually a self-contained index. By distributing the documents in an index across multiple shards, and distributing those shards across multiple nodes, Elasticsearch can ensure redundancy, which both protects against hardware failures and increases query capacity as nodes are added to a cluster. As the cluster grows (or shrinks), Elasticsearch automatically migrates shards to rebalance the cluster.

There are two types of shards: primaries and replicas. Each document in an index belongs to one primary shard. A replica shard is a copy of a primary shard. Replicas provide redundant copies of your data to protect against hardware failure and increase capacity to serve read requests like searching or retrieving a document.

The number of primary shards in an index is fixed at the time that an index is created, but the number of replica shards can be changed at any time, without interrupting indexing or query operations.

Elasticsearch的设计理念是始终可用并可以根据您的需求进行扩展。它通过本质上的分布式特性实现了这一点。您可以向集群添加服务器(节点)以增加容量,Elasticsearch会自动将您的数据和查询负载分布到所有可用节点上。无需彻底改变应用程序,Elasticsearch知道如何平衡多节点集群以提供规模和高可用性。节点越多越好。

这是如何工作的呢?在底层,Elasticsearch索引实际上只是一个或多个物理分片的逻辑分组,每个分片实际上是一个自包含的索引。通过将索引中的文档分布到多个分片中,并将这些分片分布到多个节点上,Elasticsearch可以确保冗余性,既可以保护免受硬件故障的影响,又可以在向集群添加节点时提高查询容量。随着集群的增长(或缩小),Elasticsearch会自动迁移分片以重新平衡集群。

有两种类型的分片:主分片和副本分片。索引中的每个文档都属于一个主分片。副本分片是主分片的副本。副本提供了数据的冗余副本,以保护免受硬件故障的影响,并增加容量以服务读取请求,如搜索或检索文档。

索引中主分片的数量在创建索引时是固定的,但可以随时更改副本分片的数量,而不会中断索引或查询操作。

There are a number of performance considerations and trade offs with respect to shard size and the number of primary shards configured for an index. The more shards, the more overhead there is simply in maintaining those indices. The larger the shard size, the longer it takes to move shards around when Elasticsearch needs to rebalance a cluster.

Querying lots of small shards makes the processing per shard faster, but more queries means more overhead, so querying a smaller number of larger shards might be faster. In short… it depends.

As a starting point:

  • Aim to keep the average shard size between a few GB and a few tens of GB. For use cases with time-based data, it is common to see shards in the 20GB to 40GB range.

  • Avoid the gazillion shards problem. The number of shards a node can hold is proportional to the available heap space. As a general rule, the number of shards per GB of heap space should be less than 20.

  • The best way to determine the optimal configuration for your use case is through testing with your own data and queries.

有关分片大小和为索引配置的主分片数量,存在一些性能考虑和权衡。分片越多,维护这些索引的开销就越大。分片大小越大,当Elasticsearch需要重新平衡集群时,移动分片所需的时间就越长。

查询许多小分片可以使每个分片的处理速度更快,但更多的查询意味着更多的开销,因此查询较少的较大分片可能更快。简而言之…这取决于情况。

作为起点:

  • 目标是保持平均分片大小在几GB到几十GB之间。对于基于时间的数据用例,通常会看到分片大小在20GB到40GB的范围内。

  • 避免分片数量过多的问题。节点可以容纳的分片数量与可用堆空间成正比。一般规则是,每GB堆空间的分片数量应小于20。

  • 确定适合您用例的最佳配置方式的最佳方法是使用自己的数据和查询进行测试。

A cluster’s nodes need good, reliable connections to each other. To provide better connections, you typically co-locate the nodes in the same data center or nearby data centers. However, to maintain high availability, you also need to avoid any single point of failure. In the event of a major outage in one location, servers in another location need to be able to take over. The answer? Cross-cluster replication (CCR).

CCR provides a way to automatically synchronize indices from your primary cluster to a secondary remote cluster that can serve as a hot backup. If the primary cluster fails, the secondary cluster can take over. You can also use CCR to create secondary clusters to serve read requests in geo-proximity to your users.

Cross-cluster replication is active-passive. The index on the primary cluster is the active leader index and handles all write requests. Indices replicated to secondary clusters are read-only followers.

集群的节点需要与彼此建立良好、可靠的连接。为了提供更好的连接,通常会将节点部署在同一个数据中心或附近的数据中心。然而,为了保持高可用性,还需要避免任何单一故障点。在某个位置发生重大故障的情况下,另一个位置的服务器需要能够接管。答案是跨集群复制(CCR)。

CCR提供了一种自动将主要集群中的索引同步到用作热备份的次要远程集群的方式。如果主要集群失败,次要集群可以接管。您还可以使用CCR在地理临近用户的位置创建次要集群来提供读取请求。

跨集群复制是主动-被动的。主要集群上的索引是活动的领导索引,并处理所有写入请求。复制到次要集群的索引是只读的跟随者。

Note

根据需要,阅读查询、聚合相关部分。

2 Search your data

2.1 Filter search results

Back to top