全文检索引擎Elasticsearch-1

1 快速了解Elasticsearch

为什么要学Elasticsearch？

针对海量数据计算分析，前面我们学习了MapReduce、Hive、Spark、Flink这些计算引擎和分析工具，但是它们侧重的都是对数据的清洗、聚合之类的需求。
如果想要在海量数据里面快速查询出一批满足条件的数据，这些计算引擎都需要生成一个任务，提交到集群中去执行，这样中间消耗的时间就长了。

并且针对多条件组合查询需求，这些计算引擎在查询的时候基本上都要实现全表扫描了，这样查询效率也是比较低的。

所以，为了解决海量数据下的快速检索，以及多条件组合查询需求，Elasticsearch就应运而生了。

Elasticsearch简介

1
2

Elasticsearch是一个分布式的全文检索引擎，它是对lucene的功能做了封装，能够达到实时搜索，稳定，可靠，快速等特点。
如果大家对Lucene有所了解的话，那么针对Elasticsearch其实就好理解了。

常见的全文检索引擎

Lucene
Lucene是Java家族中最为出名的一个开源搜索引擎，在Java世界中属于标准的全文检索程序，它提供了完整的查询引擎和索引引擎。
但是它也存在一些缺点
1：不支持分布式，无法扩展，海量数据下会存在瓶颈。
2：提供的都是低级API，使用繁琐。
3：没有提供web界面，不便于管理。

Solr
Solr是一个用java开发的独立的企业级搜索应用服务器，它是基于Lucene的。
它解决了Lucene的一些痛点，提供了web界面，以及高级API接口。
并且从Solr4.0版本开始，Solr开始支持分布式，称之为Solrcloud。

Elasticsearch
Elasticsearch是一个采用Java语言开发的，基于Lucene的开源、分布式的搜索引擎,能够实现实时搜索。
它最重要的一个特点是天生支持分布式，可以这样说，Elasticsearch就是为了分布式而生的。
它对外提供REST API接口，便于使用，通过外部插件实现web界面支持，便于管理集群。

Solr vs Elasticsearch

1
2
3

Elasticsearch一般我们会简称为ES。

从这里可以看出来，Solr和ES的功能基本是类似的，那在工作中该如何选择呢？

Solr从2007年就出现了，在传统企业中应用的还是比较广泛的，并且在2013年的时候，Solr推出了4.0版本，提供了Solrcloud，开始正式支持分布式集群。
ES在2014年的时候才正式推出1.0版本，所以它的出现要比Solr晚很多年。
但是ES从一开始就是为了解决海量数据下的全文检索，所以在分布式集群相关特性层面，ES会优于Solrcloud。
建议：

如果之前公司里面已经深度使用了Solr，现在为了解决海量数据检索问题，建议优先考虑使用Solrcloud。
如果之前没有使用过Solr，那么在海量数据的场景下，建议优先考虑使用ES。

MySQL VS Elasticsearch

1	为了便于理解ES，在这里我们拿MySQL和ES做一个对比分析：

解释：
1： MySQL中有Database（数据库）的概念，对应的在ES中有Index（索引库）的概念。
2：MySQL中有Table（表）的概念，对应的在ES中有Type（类型）的概念，不过需要注意，ES在1.x~5.x版本中是正常支持Type的，每一个Index下面可以有多个Type。

从6.0版本开始，每一个Index中只支持1个Type，属于过渡阶段。
从7.0版本开始，取消了Type，也就意味着每一个Index中存储的数据类型可以认为都是同一种，不再区分类型了。

为何要取消Type？

主要还是基于性能方面的考虑。
因为ES设计初期，是直接参考了关系型数据库的设计模型，存在了Type（表）的概念。
但是，ES的搜索引擎是基于Lucene的，这种基因决定了Type是多余的。
在关系型数据库中Table是独立的，但是在ES中同一个Index中不同Type的数据在底层是存储在同一个Lucene的索引文件中的。
如果在同一个Index中的不同Type中都有一个id字段，那么ES会认为这两个id字段是同一个字段，你必须在不同的Type中给这个id字段定义相同的字段类型，否则，不同Type中的相同字段名称就会在处理的时候出现冲突，导致Lucene处理效率下降。
除此之外，在同一个Index的不同Type下，存储字段个数不一样的数据，会导致存储中出现稀疏数据，影响Lucene压缩文档的能力，最终导致ES查询效率降低。

3：MySQL中有Row（行）的概念，表示一条数据，在ES中对应的有Document（文档）。
4：MySQL中有Column（列）的概念，表示一条数据中的某个列，在ES中对应的有Field（字段）。

Elasticsearch核心概念

ES中几个比较核心的概念：
Cluster：集群
Shard：分片
Replica：副本
Recovery：数据恢复

接下来具体分析一下这几个概念：

Cluster
代表ES集群，集群中有多个节点，其中有一个为主节点，这个主节点是通过选举产生的。

主从节点是对于集群内部来说的，ES的一个核心特性就是去中心化，字面上理解就是无中心节点，这是对于集群外部来说的，因为从外部来看ES集群，在逻辑上是一个整体，我们与任何一个节点的通信和与整个ES集群通信是等价的。

主节点的职责是负责管理集群状态，包括管理分片的状态和副本的状态，以及节点的发现和删除。

Shard
代表索引库分片，ES集群可以把一个索引库分成多个分片。

这样的好处是可以把一个大的索引库水平拆分成多个分片，分布到不同的节点上，构成分布式搜索，进而提高性能和吞吐量。

注意：分片的数量只能在创建索引库的时候指定，索引库创建后不能更改。

默认情况下一个索引库有1个分片。

每个分片中最多存储2,147,483,519条数据，其实就是Integer.MAX_VALUE-128。
因为每一个ES的分片底层对应的都是Lucene索引文件，单个Lucene索引文件最多存储Integer.MAX_VALUE-128个文档（数据）。

注意：在ES7.0版本之前，每一个索引库默认是有5个分片的。

Replica
代表分片的副本，ES集群可以给分片设置副本。

副本的第一个作用是提高系统的容错性，当某个分片损坏或丢失时可以从副本中恢复。第二个作用是提高ES的查询效率，ES会自动对搜索请求进行负载均衡。

注意：分片的副本数量可以随时修改。
默认情况下，每一个索引库只有1个主分片和1个副本分片（前提是ES集群有2个及以上节点，如果ES集群只有1个节点，那么索引库就只有1个主分片，不会产生副本分片，因为主分片和副本分片在一个节点里面是没有意义的）。
为了保证数据安全，以及提高查询效率，建议副本数量设置为2或者3。

Recovery
代表数据恢复或者数据重新分布。

ES集群在有节点加入或退出时会根据机器的负载对分片进行重新分配，挂掉的节点重新启动时也会进行数据恢复。

2 快速上手使用Elasticsearch

ES安装部署

ES支持单机和集群，在使用层面是完全一样的。
首先下载ES的安装包，目前ES最新版本是7.x，在这使用7.13.4版本。

下载地址：
https://www.elastic.co/cn/downloads/past-releases#elasticsearch
选择ES的对应版本。

1	注意：目前ES中自带的有open JDK，不用单独安装部署Oracle JDK。

1
2
3

在具体安装集群之前，先来分析一下ES中的核心配置文件：
在ES_HOME的config目录下有一个elasticsearch.yml配置文件，这个文件是一个yaml格式的文件。
elasticsearch.yml文件内容如下：

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
# 集群名称，默认是elasticsearch，如果想要将多个ES实例组成一个集群，那么它们的cluster.name必须一致
#cluster.name: my-application
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
# 节点名称，可以手工指定，默认也会自动生成
#node.name: node-1
#
# Add custom attributes to the node:
# 给节点指定一些自定义的参数信息
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
# 可以指定ES的数据存储目录，默认存储在ES_HOME/data目录下
#path.data: /path/to/data
#
# Path to log files:
# 可以指定ES的日志存储目录，默认存储在ES_HOME/logs目录下
#path.logs: /path/to/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
# 锁定物理内存地址，防止ES内存被交换出去,也就是避免ES使用swap交换分区中的内存
#bootstrap.memory_lock: true
# 确保ES_HEAP_SIZE参数设置为系统可用内存的一半左右
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
# 当系统进行内存交换的时候，会导致ES的性能变的很差
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# By default Elasticsearch is only accessible on localhost. Set a different
# address here to expose this node on the network:
# 为ES设置绑定的IP，默认是127.0.0.1，也就是默认只能通过127.0.0.1 或者localhost才能访问
# ES 1.x版本默认绑定的是0.0.0.0，但是从ES 2.x版本之后默认绑定的是127.0.0.1
#network.host: 192.168.0.1
#
# By default Elasticsearch listens for HTTP traffic on the first free port it
# finds starting at 9200. Set a specific HTTP port here:
# 为ES服务设置监听的端口，默认是9200
# 如果想要在一台机器上启动多个ES实例，需要修改此处的端口号
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
# 
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
# 
# 当启动新节点时，通过这个ip列表进行节点发现，组建集群
# 默认ip列表：
# 	127.0.0.1，表示ipv4的本地回环地址。
#	[::1]，表示ipv6的本地回环地址。
# 在ES 1.x中默认使用的是组播(multicast)协议，默认会自动发现同一网段的ES节点组建集群。
# 从ES 2.x开始默认使用的是单播(unicast)协议，想要组建集群的话就需要在这指定要发现的节点信息了。
# 
# 指定想要组装成一个ES集群的多个节点信息
#discovery.seed_hosts: ["host1", "host2"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
# 初始化一批具备成为主节点资格的节点【在选择主节点的时候会优先在这一批列表中进行选择】
#cluster.initial_master_nodes: ["node-1", "node-2"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
# 禁止使用通配符或_all删除索引, 必须使用名称或别名才能删除该索引。
#action.destructive_requires_name: true

这段配置的意思是，让Elasticsearch请求JVM锁定堆内存，防止内存被操作系统交换到磁盘上¹²。这样可以提高Elasticsearch 的性能和稳定性，因为垃圾回收时不会触及已经交换出去的内存页¹。如果启用了bootstrap.memory_lock设置，那么JVM会预留它需要的任何内存⁴。如果要使用这个设置，还需要在操作系统或Docker容器中配置相应的ulimit或sysctl参数²³。

您可以在一台机器上搭建多个ElasticSearch节点来组成一个集群。首先，您需要在机器上安装JDK环境，然后从官网下载ElasticSearch并解压。接着，您需要修改配置文件`elasticsearch.yml`，设置集群名称、节点名称、网络主机、HTTP端口等信息。此外，您还需要修改Linux系统设置，放行主节点端口，并创建ES用户来启动ES的多个节点。最后，您可以使用`curl`命令来查看ES节点的状态 ¹。

ES单机

1	1：将ES的安装包上传到bigdata01的/data/soft目录下

2：在Linux中添加一个普通用户：es。
因为ES目前不支持root用户启动。

[root@bigdata01 soft]# useradd -d /home/es -m es
[root@bigdata01 soft]# passwd es
Changing password for user es.
New password: bigdata1234
Retype new password: bigdata1234
passwd: all authentication tokens updated successfully.

3：修改Linux中最大文件描述符以及最大虚拟内存的参数
因为ES对Linux的最大文件描述符以及最大虚拟内存有一定要求，所以需要修改，否则ES无法正常启动。

[root@bigdata01 soft]# vi /etc/security/limits.conf 
* soft nofile 65536
* hard nofile 131072
* soft nproc 2048
* hard nproc 4096
[root@bigdata01 soft]# vi /etc/sysctl.conf
vm.max_map_count=262144

4：重启Linux系统。
前面修改的参数需要重启系统才会生效。

[root@bigdata01 soft]# reboot -h now

5：解压ES安装包。
[root@bigdata01 soft]# tar -zxvf elasticsearch-7.13.4-linux-x86_64.tar.gz

6：配置ES_JAVA_HOME环境变量，指向ES中内置的JDK。

[root@bigdata01 soft]# vi /etc/profile
......
export ES_JAVA_HOME=/data/soft/elasticsearch-7.13.4/jdk
......
[root@bigdata01 soft]# source /etc/profile

7：修改elasticsearch-7.13.4目录的权限
因为前面是使用root用户解压的，elasticsearch-7.13.4目录下的文件es用户是没有权限的。

[root@bigdata01 soft]# chmod 777 -R /data/soft/elasticsearch-7.13.4

8：切换到es用户

[root@bigdata01 soft]# su es

9：修改elasticsearch.yml配置文件内容
主要修改network.host、discovery.seed_hosts这两个参数。

注意：yaml文件的格式，参数和值之间需要有一个空格。

例如：network.host: bigdata01
bigdata01前面必须要有一个空格，否则会报错。

[es@bigdata01 soft]$ cd elasticsearch-7.13.4
[es@bigdata01 elasticsearch-7.13.4]$ vi config/elasticsearch.yml 
......
network.host: bigdata01
discovery.seed_hosts: ["bigdata01"]
......

10：启动ES服务【前台启动】
[es@bigdata01 elasticsearch-7.13.4]$ bin/elasticsearch

按ctrl+c停止服务。


11：启动ES服务【后台启动】
在实际工作中需要将ES放在后台运行。
[es@bigdata01 elasticsearch-7.13.4]$ bin/elasticsearch -d

12：验证ES服务。
通过jps命令验证进程是否存在。
[es@bigdata01 elasticsearch-7.13.4]$ jps
1849 Elasticsearch

1 2	通过web界面验证服务是否可以正常访问，端口为9200。 http://bigdata01:9200/

1	注意：需要关闭防火墙。

13：停止ES服务。
使用kill命令停止。

[es@bigdata01 elasticsearch-7.13.4]$ jps
1849 Elasticsearch
[es@bigdata01 elasticsearch-7.13.4]$ kill

1
2
3

14：日志排查方式。
如果发现ES服务启动有问题，需要查看ES的日志。
ES的相关日志都在ES_HOME的logs目录下，ES服务的核心日志在elasticsearch.log日志文件中。

ES集群

ES集群规划：
bigdata01
bigdata02
bigdata03

1：在bigdata01、bigdata02、bigdata03中创建普通用户：es。
具体创建步骤参考ES单机中的操作。

[root@bigdata01 soft]# useradd -d /home/es -m es
[root@bigdata01 soft]# passwd es
Changing password for user es.
New password: bigdata1234
Retype new password: bigdata1234
passwd: all authentication tokens updated successfully.

1
2
3

useradd es和useradd -d /home/es -m es的区别是，前者会创建一个名为es的用户，但不会指定或创建它的家目录，后者会创建一个名为es的用户，并指定它的家目录为/home/es，并且使用-m选项来创建这个目录¹。如果你想修改一个已经存在的用户的家目录，你可以使用usermod -d命令¹。

不一定。useradd es的默认家目录取决于/etc/default/useradd文件中的HOME参数³。如果没有指定或修改这个参数，那么默认家目录就是/home/es⁴。但是，如果你没有使用-m或--create-home选项，那么useradd es不会创建这个家目录⁵。你需要手动创建或者使用usermod -m -d命令来移动已有的内容到新的家目录¹。

2：在bigdata01、bigdata02、bigdata03中修改Linux中最大文件描述符以及最大虚拟内存的参数。
具体修改步骤参考ES单机中的操作。

[root@bigdata01 soft]# vi /etc/security/limits.conf 
* soft nofile 65536
* hard nofile 131072
* soft nproc 2048
* hard nproc 4096
[root@bigdata01 soft]# vi /etc/sysctl.conf
vm.max_map_count=262144

3：重启bigdata01、bigdata02、bigdata03，让前面修改的参数生效。
具体操作步骤参考ES单机中的操作。
4：在bigdata01、bigdata02、bigdata03中配置ES_JAVA_HOME环境变量，指向ES中内置的JDK。
具体配置步骤参考ES单机中的操作。
5：在bigdata01中重新解压ES的安装包以及修改目录权限

[root@bigdata01 soft]# tar -zxvf elasticsearch-7.13.4-linux-x86_64.tar.gz
[root@bigdata01 soft]# chmod 777 -R /data/soft/elasticsearch-7.13.4

6：修改elasticsearch.yml配置文件
主要修改network.host、discovery.seed_hosts和cluster.initial_master_nodes这三个参数。

7：将bigdata01中修改好配置的elasticsearch-7.13.4目录远程拷贝到bigdata02和bigdata03。
[root@bigdata01 soft]# scp -rq elasticsearch-7.13.4 bigdata02:/data/soft/
[root@bigdata01 soft]# scp -rq elasticsearch-7.13.4 bigdata03:/data/soft/

8：分别修改bigdata02和bigdata03中ES的elasticsearch.yml配置文件。
修改bigdata02中的elasticsearch.yml配置文件，主要修改network.host参数的值为当前节点主机名。
[root@bigdata02 elasticsearch-7.13.4]# vi config/elasticsearch.yml 
......
network.host: bigdata02
......

修改bigdata03中的elasticsearch.yml配置文件，主要修改network.host参数的值为当前节点主机名。

9：在bigdata01、bigdata02、bigdata03中分别启动ES。
在bigdata01上启动。

[root@bigdata01 elasticsearch-7.13.4]# su es
[es@bigdata01 elasticsearch-7.13.4]$ bin/elasticsearch -d

在bigdata02上启动。
在bigdata03上启动。

10：验证集群中的进程。
分别在bigdata01、bigdata02、bigdata03中验证进程是否存在。

11：验证这几个节点是否组成一个集群。
通过ES的REST API可以很方便的查看集群中的节点信息。
http://bigdata01:9200/_nodes/_all?pretty

ES集群监控管理工具-cerebro

1
2
3

为了便于我们管理监控ES集群，推荐使用cerebro这个工具。
1：首先到github上下载cerebro的安装包。
https://github.com/lmenezes/cerebro/releases

2：将下载好的cerebro-0.9.4.zip安装包上传到bigdata01的/data/soft目录中并且解压。

注意：cerebro部署在任意节点上都可以，只要能和ES集群通信即可。

[root@bigdata01 soft]# ll cerebro-0.9.4.zip 
-rw-r--r--. 1 root root 57251010 Sep 11  2021 cerebro-0.9.4.zip
[root@bigdata01 soft]# unzip cerebro-0.9.4.zip

3：启动cerebro。
将cerebro放在后台启动。

[root@bigdata01 cerebro-0.9.4]# nohup bin/cerebro 2>&1 >/dev/null &

这段命令是在Linux系统中运行的。它的意思是在后台运行`cerebro`程序，即使您退出终端，该程序也会继续运行。`nohup`命令用于在后台运行程序，`2>&1 >/dev/null`表示将标准错误输出重定向到标准输出，并将标准输出重定向到`/dev/null`，即丢弃所有输出信息。最后的`&`表示在后台运行该命令。

注意：默认cerebro监听的端口是9000，如果出现端口冲突，需要修改cerebro监控的端口

在启动cerebro的时候可以通过http.port参数指定端口号，如下命令：
bin/cerebro -Dhttp.port=1234

默认通过9000端口可以访问cerebro的web界面。

1 2	4：使用cerebro。在Node address中输入ES集群任意一个节点的连接信息即可。

1	5：使用cerebro监控管理ES集群。

注意：集群有三种状态，green、yellow、red。

green：表示集群处于健康状态，可以正常使用。
yellow：表示集群处于风险状态，可以正常使用，可能是分片的副本个数不完整。例如：分片的副本数为2，但是现在分片的副本只有1份。
red：表示集群处于故障状态，无法正常使用，可能是集群分片不完整。

1	6：cerebro的所有功能。

1	6.1：查看节点信息

1 2	6.2：rest功能。便于在页面中操作REST API接口

1 2	6.3：更多功能。包括创建索引、集群参数、别名、分词功能、索引模板等。

ES的基本操作

1 2	针对ES的操作，官方提供了很多种操作方式。 https://www.elastic.co/guide/index.html

在实际工作中使用ES的时候，如果想屏蔽语言的差异，建议使用REST API，这种兼容性比较好，但是个人感觉有的操作使用起来比较麻烦，需要拼接组装各种数据字符串。

针对Java程序员而言，还有一种选择是使用Java API，这种方式相对于REST API而言，代码量会大一些，但是代码层面看起来是比较清晰的。

下面在操作ES的时候，分别使用一下这两种方式。

使用REST API的方式操作ES

如果想要在Linux命令行中使用REST API操作ES，需要借助于CURL工具。
CURL是利用URL语法在命令行下工作的开源文件传输工具，使用CURL可以简单实现常见的get/post请求。

curl后面通过-X参数指定请求类型，通过-d指定要传递的参数。

索引库的操作（创建、删除）

HTTP协议中除了GET和POST请求之外，还有其他几种请求类型，包括：

- **HEAD**：与GET类似，但只返回HTTP头部信息，不返回实体内容。
- **PUT**：用于上传资源到服务器，通常用于更新资源。
- **DELETE**：用于删除服务器上的资源。
- **OPTIONS**：用于查询服务器支持的HTTP方法。
- **TRACE**：用于追踪请求-响应的传输路径。
- **CONNECT**：用于建立网络隧道，通常用于SSL加密。
- **GET**：用于从服务器获取数据。它将请求参数附加在URL后面，通过查询字符串传递给服务器。GET请求应该只用于获取数据，不应该用于产生副作用。
- **POST**：用于向服务器提交数据。它将请求参数放在HTTP请求体中，可以传输大量数据。POST请求通常用于提交表单或上传文件。

创建索引库：
curl -XPUT 'http://bigdata01:9200/test/'
这里使用PUT或者POST都可以。

[root@bigdata01 soft]# curl  -XPUT 'http://bigdata01:9200/test/'
{"acknowledged":true,"shards_acknowledged":true,"index":"test"}

1	注意：索引库名称必须要全部小写，不能以_、 -、 +开头，也不能包含逗号。

1 2	[root@bigdata01 soft]# curl -XDELETE 'http://bigdata01:9200/test/' {"acknowledged":true}

1	注意：索引库可以提前创建，也可以在后期添加数据的时候直接指定一个不存在的索引库，ES默认会自动创建这个索引库。

手工创建索引库和自动创建索引库的区别就是，手工创建可以自定义索引库的分片数量。
下面创建一个具有3个分片的索引库。

[root@bigdata01 soft]# curl -H "Content-Type: application/json" -XPUT 'http://bigdata01:9200/test/' -d'{"settings":{"index.number_of_shards":3}}'
{"acknowledged":true,"shards_acknowledged":true,"index":"test"}

-H参数用于在curl命令中设置HTTP请求头。在这个例子中，-H "Content-Type: application/json"表示设置HTTP请求头的Content-Type字段为application/json，即告诉服务器请求体中的数据是JSON格式。

1
2

其中实线的框表示是主分片，虚线框是副本分片。
索引分片编号是从0开始的，并且索引分片在物理层面是存在的，可以到集群中查看一下，从界面中也看到test索引库的1号和2号分片是在bigdata01节点上的。

到bigdata01节点中看一下，ES中的所有数据都在ES的数据存储目录中，默认是在ES_HOME下的data目录里面：
[root@bigdata01 1IQ2r-vqRxSsicd8BzWPtg]# pwd
/data/soft/elasticsearch-7.13.4/data/nodes/0/indices/1IQ2r-vqRxSsicd8BzWPtg
[root@bigdata01 1IQ2r-vqRxSsicd8BzWPtg]# ll
total 0
drwxrwxr-x. 5 es es 49 Feb 26 18:01 1
drwxrwxr-x. 5 es es 49 Feb 26 18:01 2
drwxrwxr-x. 2 es es 24 Feb 26 18:01 _state

1	这里面的1IQ2r-vqRxSsicd8BzWPtg表示的是索引库的UUID。

索引的操作（增、删、改、查）

添加索引

[root@bigdata01 soft]# curl -H "Content-Type: application/json" -XPOST 'http://bigdata01:9200/emp/_doc/1' -d '{"name":"tom","age":20}'
{"_index":"emp","_type":"_doc","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}

1
2
3

注意：

1.这里emp索引库是不存在的，在使用的时候ES会自动创建，只不过索引分片数量默认是1。

1	2.为了兼容之前的API，虽然ES现在取消了Type，但是API中Type的位置还是预留出来了，官方建议统一使用_doc 。

注意：在添加索引的时候，如果没有指定数据的ID，那么ES会自动生成一个随机的唯一ID。

[root@bigdata01 soft]# curl -H "Content-Type: application/json" -XPOST 'http://bigdata01:9200/emp/_doc' -d '{"name":"jack","age":30}' 
{"_index":"emp","_type":"_doc","_id":"EFND8aMBpApLBooiIWda","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":1,"_primary_term":1}

查询索引：
查看id=1的索引数据。

[root@bigdata01 soft]# curl -XGET 'http://bigdata01:9200/emp/_doc/1?pretty'
{
  "_index" : "emp",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "tom",
    "age" : 20
  }
}

只获取部分字段内容。

[root@bigdata01 soft]# curl -XGET 'http://bigdata01:9200/emp/_doc/1?_source=name&pretty'
{
  "_index" : "emp",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "tom"
  }
}
[root@bigdata01 soft]# curl -XGET 'http://bigdata01:9200/emp/_doc/1?_source=name,age&pretty'
{
  "_index" : "emp",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "tom",
    "age" : 20
  }
}

查询指定索引库中所有数据。

[root@bigdata01 soft]# curl -XGET 'http://bigdata01:9200/emp/_search?pretty'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "emp",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "tom",
          "age" : 20
        }
      },
      {
        "_index" : "emp",
        "_type" : "_doc",
        "_id" : "EVPO8aMBpApLBooib2e7",
        "_score" : 1.0,
        "_source" : {
          "name" : "jack",
          "age" : 30
        }
      }
    ]
  }
}

1	注意：针对这种查询操作，可以在浏览器里面执行，或者在cerebo中查询都是可以的，看起来更加清晰。

在这里扩展一个知识点，使用RestAPI执行query查询。

[root@bigdata01 ~]# curl -H "Content-Type: application/json" -XGET 'http://bigdata01:9200/stuinfo/_search?pretty' -d'{"query":{"match":{"address":"bj"}}}'
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "stuinfo",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "sex" : "man",
          "name" : "zs",
          "age" : 20
        }
      }
    ]
  }
}

更新索引

可以分为全部更新和局部更新
全部更新：同添加索引，如果指定id的索引数据（文档）已经存在，则执行更新操作。

注意：执行更新操作的时候，ES首先将旧的文标记为删除状态，然后添加新的文档

旧的文档不会立即消失，但是你也无法访问，ES会在你继续添加更多数据的时候在后台清理已经标记为删除状态的文档。

局部更新：可以添加新字段或者更新已有字段，必须使用POST请求。

1
2

[root@bigdata01 soft]# curl -H "Content-Type: application/json" -XPOST 'http://bigdata01:9200/emp/_doc/1/_update' -d '{"doc":{"age":25}}'
{"_index":"emp","_type":"_doc","_id":"1","_version":2,"result":"updated","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":2,"_primary_term":1}

删除索引

删除id=1的索引数据。

[root@bigdata01 soft]# curl -XDELETE 'http://bigdata01:9200/emp/_doc/1'
{"_index":"emp","_type":"_doc","_id":"1","_version":3,"result":"deleted","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":3,"_primary_term":1}

1
2

[root@bigdata01 soft]# curl -XDELETE 'http://bigdata01:9200/emp/_doc/1'
{"_index":"emp","_type":"_doc","_id":"1","_version":4,"result":"not_found","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":4,"_primary_term":1}

如果索引数据（文档）存在，ES返回的数据中，result属性值为deleted，_version（版本）属性的值+1。

如果索引数据不存在，ES返回的数据中，result属性值为not_found，但是_version属性的值依然会+1，这属于ES的版本控制系统，它保证了我们在多个节点间的不同操作的顺序都被正确标记了。
对于索引数据的每次写操作，无论是index，update还是delete，ES都会将_version增加 1。该增加是原子的，并且保证在操作成功返回时会发生。

注意：删除一条索引数据（文档）也不会立即生效，它只是被标记成已删除状态。ES将会在你之后添加更多索引数据的时候才会在后台清理标记为删除状态的内容。

Bulk批量操作

Bulk API可以帮助我们同时执行多个请求，提高效率。
格式：
{ action: { metadata }}
{ request body }

解释：

action：index/create/update/delete
metadata：_index,_type,_id
request body：_source(删除操作不需要) 

create和index的区别：如果数据存在，使用create操作失败，会提示文档已经存在，使用index则可以成功执行(相当于更新操作)。

1
2
3

下面来看一个案例，假设在MySQL中有一批数据，首先需要从MySQL中把数据读取出来，然后将数据转化为Bulk需要的数据格式。

在这直接手工生成Bulk需要的数据格式。

[root@bigdata01 elasticsearch-7.13.4]# vi request 
{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "1" } }
{ "field1" : "value1" }

{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "2" } }
{ "field1" : "value1" }

{ "delete" : { "_index" : "test", "_type" : "_doc", "_id" : "2" } }

{ "create" : { "_index" : "test", "_type" : "_doc", "_id" : "3" } }
{ "field1" : "value1" }

{ "update" : {"_index" : "test", "_type" : "_doc","_id" : "1" } }
{ "doc" : {"field2" : "value2"} }

执行Bulk API

[root@bigdata01 elasticsearch-7.13.4]# curl -H "Content-Type: application/json"  -XPUT 'http://bigdata01:9200/test/_doc/_bulk' --data-binary @request
{"took":167,"errors":false,"items":[{"index":{"_index":"test","_type":"_doc","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":0,"_primary_term":1,"status":201}},{"index":{"_index":"test","_type":"_doc","_id":"2","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":0,"_primary_term":1,"status":201}},{"delete":{"_index":"test","_type":"_doc","_id":"2","_version":2,"result":"deleted","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":1,"_primary_term":1,"status":200}},{"create":{"_index":"test","_type":"_doc","_id":"3","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":2,"_primary_term":1,"status":201}},{"update":{"_index":"test","_type":"_doc","_id":"1","_version":2,"result":"updated","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":1,"_primary_term":1,"status":200}}]}

[root@bigdata01 elasticsearch-7.13.4]# curl -XGET 'http://bigdata01:9200/test/_search?pretty'
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc", 
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "field1" : "value1"
        }
      },
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "field1" : "value1",
          "field2" : "value2"
        }
      }
    ]
  }
}

Bulk一次最大可以处理多少数据量？

Bulk会把将要处理的数据加载到内存中，所以数据量是有限制的，最佳的数据量不是一个确定的数值，它取决于集群硬件，文档大小、文档复杂性，索引以及ES集群的负载。

一般建议是1000-5000个文档，如果文档很大，可以适当减少，文档总大小建议是5-15MB，默认不能超过100M。
如果想要修改最大限制大小，可以在ES的配置文件中修改http.max_content_length: 100mb，但是不建议，因为太大的话Bulk操作也会慢。

使用Java API的方式操作ES

针对Java API，目前ES提供了两个Java REST Client版本:

1.Java Low Level REST Client：
低级别的REST客户端，通过HTTP与集群交互，用户需自己组装请求JSON串，以及解析响应JSON串。兼容所有Elasticsearch版本。
这种方式其实就相当于使用Java对前面讲的REST API做了一层简单的封装，前面我们是使用的CURL这个工具执行的，现在是使用Java代码模拟执行HTTP请求了。

2.Java High Level REST Client：
高级别的REST客户端，基于低级别的REST客户端进行了封装，增加了组装请求JSON串、解析响应JSON串等相关API，开发代码使用的ES版本需要和集群中的ES版本一致，否则会有版本冲突问题。
这种方式是从ES 6.0版本开始加入的，目的是以Java面向对象的方式进行请求、响应处理。
高级别的REST客户端会兼容高版本的ES集群，例如：使用ES7.0版本开发的代码可以和任何7.x版本的ES集群交互。
如果ES集群后期升级到了8.x版本，那么也要升级之前基于ES 7.0版本开发的代码。

1
2
3

如果考虑到代码后期的兼容性，建议使用Java Low Level REST Client。
如果考虑到易用性，建议使用Java High Level REST Client。
在这我们使用Java High Level REST Client。

创建maven项目：db_elasticsearch
创建包：com.imooc.es
在pom.xml文件中添加ES的依赖和日志的依赖。

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
<version>7.13.4</version>
</dependency>
<dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-core</artifactId>
    <version>2.14.1</version>
</dependency>

在resources目录下添加log4j2.properties。

appender.console.type = Console
appender.console.name = console
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = [%d{ISO8601}][%-5p][%-25c] %marker%m%n

rootLogger.level = info
rootLogger.appenderRef.console.ref = console

索引库的操作（创建、删除）

package com.imooc.es;


import org.apache.http.HttpHost;
import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.common.settings.Settings;

import java.io.IOException;

/**
 * 针对ES中索引库的操作
 * 1：创建索引库
 * 2：删除索引库
 * Created by xuwei
 */
public class EsIndexOp {
    public static void main(String[] args) throws Exception{
        //获取RestClient连接
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("bigdata01", 9200, "http"),
                        new HttpHost("bigdata02", 9200, "http"),
                        new HttpHost("bigdata03", 9200, "http")));

        //创建索引库
        //createIndex(client);

        //删除索引库
        //deleteIndex(client);


        //关闭连接
        client.close();
    }


    private static void deleteIndex(RestHighLevelClient client) throws IOException {
        DeleteIndexRequest deleteRequest = new DeleteIndexRequest("java_test");
        //执行
        client.indices().delete(deleteRequest, RequestOptions.DEFAULT);
    }


    private static void createIndex(RestHighLevelClient client) throws IOException {
        CreateIndexRequest createRequest = new CreateIndexRequest("java_test");
        //指定索引库的配置信息
        createRequest.settings(Settings.builder()
                .put("index.number_of_shards", 3)//指定分片个数
        );

        //执行
        client.indices().create(createRequest, RequestOptions.DEFAULT);
    }

}

1	执行代码的时候会有一个警告信息，提示ES集群没有开启权限校验机制，其实在企业中只要在运维层面控制好了ES集群IP和端口的访问其实就足够了。

索引的操作（增、删、改、查、Bulk批量操作）

package com.imooc.es;

import org.apache.commons.logging.LogFactory;
import org.apache.http.HttpHost;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.elasticsearch.action.bulk.BulkItemResponse;
import org.elasticsearch.action.bulk.BulkRequest;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.action.delete.DeleteRequest;
import org.elasticsearch.action.get.GetRequest;
import org.elasticsearch.action.get.GetResponse;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.update.UpdateRequest;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.Strings;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.search.fetch.subphase.FetchSourceContext;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

/**
 * 针对ES中索引数据的操作
 * 增删改查
 * Created by xuwei
 */
public class EsDataOp {
    private static Logger logger = LogManager.getLogger(EsDataOp.class);

    public static void main(String[] args) throws Exception{
        //获取RestClient连接
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("bigdata01", 9200, "http"),
                        new HttpHost("bigdata02", 9200, "http"),
                        new HttpHost("bigdata03", 9200, "http")));

        //创建索引
        //addIndexByJson(client);
        //addIndexByMap(client);

        //查询索引
        //getIndex(client);
        //getIndexByFiled(client);

        //更新索引
        //注意：可以使用创建索引直接完整更新已存在的数据
        //updateIndexByPart(client);//局部更新

        //删除索引
        //deleteIndex(client);

        //Bulk批量操作
        //bulkIndex(client);

        //关闭连接
        client.close();
    }

    private static void bulkIndex(RestHighLevelClient client) throws IOException {
        BulkRequest request = new BulkRequest();
        request.add(new IndexRequest("emp").id("20")
                .source(XContentType.JSON,"field1", "value1","field2","value2"));
        request.add(new DeleteRequest("emp", "10"));//id为10的数据不存在，但是执行删除是不会报错的
        request.add(new UpdateRequest("emp", "11")
                .doc(XContentType.JSON,"age", 19));
        request.add(new UpdateRequest("emp", "12")//id为12的数据不存在，这一条命令在执行的时候会失败
                .doc(XContentType.JSON,"age", 19));
        //执行
        BulkResponse bulkResponse = client.bulk(request, RequestOptions.DEFAULT);
        //如果Bulk中的个别语句出错不会导致整个Bulk执行失败，所以可以在这里判断一下是否有返回执行失败的信息
        for (BulkItemResponse bulkItemResponse : bulkResponse) {
            if (bulkItemResponse.isFailed()) {
                BulkItemResponse.Failure failure = bulkItemResponse.getFailure();
                logger.error("Bulk中出现了异常："+failure);
            }
        }
    }

    private static void deleteIndex(RestHighLevelClient client) throws IOException {
        DeleteRequest request = new DeleteRequest("emp", "10");
        //执行
        client.delete(request, RequestOptions.DEFAULT);
    }

    private static void updateIndexByPart(RestHighLevelClient client) throws IOException {
        UpdateRequest request = new UpdateRequest("emp", "10");
        String jsonString = "{\"age\":23}";
        request.doc(jsonString, XContentType.JSON);
        //执行
        client.update(request, RequestOptions.DEFAULT);
    }

    private static void getIndexByFiled(RestHighLevelClient client) throws IOException {
        GetRequest request = new GetRequest("emp", "10");
        //只查询部分字段
        String[] includes = new String[]{"name"};//指定包含哪些字段
        String[] excludes = Strings.EMPTY_ARRAY;//指定多滤掉哪些字段
        FetchSourceContext fetchSourceContext = new FetchSourceContext(true, includes, excludes);
        request.fetchSourceContext(fetchSourceContext);
        //执行
        GetResponse response = client.get(request, RequestOptions.DEFAULT);
        //通过response获取index、id、文档详细内容（source）
        String index = response.getIndex();
        String id = response.getId();
        if(response.isExists()){//如果没有查询到文档数据，则isExists返回false
            //获取json字符串格式的文档结果
            String sourceAsString = response.getSourceAsString();
            System.out.println(sourceAsString);
            //获取map格式的文档结果
            Map<String, Object> sourceAsMap = response.getSourceAsMap();
            System.out.println(sourceAsMap);
        }else{
            logger.warn("没有查询到索引库{}中id为{}的文档!",index,id);
        }
    }

    private static void getIndex(RestHighLevelClient client) throws IOException {
        GetRequest request = new GetRequest("emp", "10");
        //执行
        GetResponse response = client.get(request, RequestOptions.DEFAULT);
        //通过response获取index、id、文档详细内容（source）
        String index = response.getIndex();
        String id = response.getId();
        if(response.isExists()){//如果没有查询到文档数据，则isExists返回false
            //获取json字符串格式的文档结果
            String sourceAsString = response.getSourceAsString();
            System.out.println(sourceAsString);
            //获取map格式的文档结果
            Map<String, Object> sourceAsMap = response.getSourceAsMap();
            System.out.println(sourceAsMap);
        }else{
            logger.warn("没有查询到索引库{}中id为{}的文档!",index,id);
        }
    }

    private static void addIndexByMap(RestHighLevelClient client) throws IOException {
        IndexRequest request = new IndexRequest("emp");
        request.id("11");
        HashMap<String, Object> jsonMap = new HashMap<String, Object>();
        jsonMap.put("name", "tom");
        jsonMap.put("age", 17);
        request.source(jsonMap);
        //执行
        client.index(request, RequestOptions.DEFAULT);
    }

    private static void addIndexByJson(RestHighLevelClient client) throws IOException {
        IndexRequest request = new IndexRequest("emp");
        request.id("10");
        String jsonString = "{" +
                "\"name\":\"jessic\"," +
                "\"age\":20" +
                "}";
        request.source(jsonString, XContentType.JSON);
        //执行
        client.index(request, RequestOptions.DEFAULT);
    }
}

比你优秀的人都努力，有什么理由不努力！

大数据开发工程师-全文检索引擎Elasticsearch-1

全文检索引擎Elasticsearch-1

1 快速了解Elasticsearch

为什么要学Elasticsearch？

Elasticsearch简介

常见的全文检索引擎

Solr vs Elasticsearch

MySQL VS Elasticsearch

Elasticsearch核心概念

2 快速上手使用Elasticsearch

ES安装部署

ES单机

ES集群

ES集群监控管理工具-cerebro

ES的基本操作

使用REST API的方式操作ES

索引库的操作（创建、删除）

索引的操作（增、删、改、查）

更新索引

删除索引

Bulk批量操作

使用Java API的方式操作ES

索引库的操作（创建、删除）

索引的操作（增、删、改、查、Bulk批量操作）