Hive的基本原理和架构学习

=Start=

缘由：

最近在全面整理各大数据存储/处理系统的架构、日志记录和安全监控能力，这次是 Hive ，先整理学习一下 Hive 的基本原理及其架构，方便理解一些问题和设计思路，也方便后面有需要的时候参考。

正文：

参考解答：

Hive 是基于 Hadoop 的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供类 SQL 查询功能。其本质是将 SQL 转换为 MapReduce/Spark 的任务进行运算，底层由 HDFS 来提供数据的存储，说白了 Hive 可以理解为一个将 SQL 转换为 MapReduce/Spark 的任务的工具，甚至更进一步可以说 Hive 就是一个 MapReduce/Spark Sql 的客户端。

Hive 的主要组件

UI – The user interface for users to submit queries and other operations to the system.
用户界面（UI）–用户向系统提交查询和其他操作的用户界面。

Driver – The component which receives the queries. This component implements the notion of session handles and provides execute and fetch APIs modeled on JDBC/ODBC interfaces.
驱动程序 – 接收查询的组件。该组件实现会话句柄的概念，并提供以 JDBC/ODBC 接口为模型的执行和获取 API。

Compiler – The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore.
编译器 – 该组件负责解析查询，对不同的查询块和查询表达式进行语义分析，并借助从元存储中查找的表和分区元数据最终生成执行计划。

Metastore – The component that stores all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored.
元存储 – 存储仓库中各种表和分区的所有结构信息的组件，包括列和列类型信息、读写数据所需的序列化器和反序列化器以及存储数据的相应 HDFS 文件。

Execution Engine – The component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components.
执行引擎 – 执行编译器创建的执行计划的组件。该计划是由多个阶段组成的 DAG。执行引擎管理计划不同阶段之间的依赖关系，并在适当的系统组件上执行这些阶段。

Hive与Hadoop的交互流程

上图还显示了一个典型查询是如何在系统中流动的。

用户界面调用驱动程序的执行接口（上图中的步骤 1）。
驱动程序为查询创建会话句柄，并将查询发送给编译器以生成执行计划（步骤 2）。
编译器从元存储中获取必要的元数据（步骤 3 和 4）。
元数据用于对查询树中的表达式进行类型检查，并根据查询谓词剪切分区。编译器生成的计划（步骤 5）是一个阶段 DAG，每个阶段可以是一个 map/reduce 作业、一个元数据操作或 HDFS 上的一个操作。
对于映射/还原阶段，计划包含映射运算树（在映射器上执行的运算树）和还原运算树（用于需要还原器的操作）。
执行引擎会将这些阶段提交给相应的组件（步骤 6、6.1、6.2 和 6.3）。
在每个任务（映射器/还原器）中，与表格或中间输出相关的反序列化器用于从 HDFS 文件中读取行，并通过相关的运算树。输出生成后，将通过序列化器写入临时 HDFS 文件（如果操作不需要还原，则在映射器中进行）。临时文件用于为计划的后续映射/还原阶段提供数据。对于 DML 操作，最终临时文件会被移动到表的位置。该方案用于确保不读取脏数据（文件重命名在 HDFS 中是原子操作）。对于查询，执行引擎直接从 HDFS 读取临时文件的内容，作为从驱动程序获取调用的一部分（步骤 7、8 和 9）。

1、 executeQuery：用户通过Hive界面（CLI/Web UI）将查询语句发送到Driver（驱动有JDBC、ODBC等）来执行；

2、 getPlan ：Driver根据查询编译器解析query语句，验证query语句的语法、查询计划、查询条件；

3、 getMetaData：编译器将元数据请求发送给Metastore；

4、 send MetaData：Metastore将元数据作为响应发送给编译器；

5、 send Plan：编译器检查要求和重新发送Driver的计划。至此，查询的解析和编译完成；

6、 execute Plan：Driver将执行计划发送给执行引擎；

6.1、 MetaDataOps for DDLs：执行引擎发送任务的同时，对hive元数据进行相应操作（直接对数据库表进行操作的（创建表、删除表等），直接与MetaStore进行交互）。

6.1、 execute Job：mapreduce执行job的过程。执行引擎发送任务到resourcemanager，resourcemanager将任务分配给nodenameger，由nodemanager分布式执行mapreduce任务。

6.2、任务执行结束，返回执行结果给执行引擎，同步执行6.3；

6.3、找Namenode获取数据

7、fetch Results：执行引擎接收来自数据节点(data node)的结果

8、sendResults：执行引擎发送这些合成值到Driver

9、sendResults：Driver将结果发送到hive接口

一个简化版本的示意图：

上面的MapReduce可以替换成Tez、Spark。

参考链接：

一文弄懂Hive基本架构和原理
https://blog.csdn.net/oTengYue/article/details/91129850

Apache Hive
https://cwiki.apache.org/confluence/display/Hive/Home

Hive design and architecture
https://cwiki.apache.org/confluence/display/Hive/Design

10、Hive核心概念和架构原理
https://blog.51cto.com/u_10312890/2465756

Hive的架构剖析
https://jiamaoxiang.top/2020/06/27/Hive%E7%9A%84%E6%9E%B6%E6%9E%84%E5%89%96%E6%9E%90/

一篇文章搞懂 Hive 的系统架构
https://blog.csdn.net/Shockang/article/details/118035262

Hive 架构与表类型
https://xie.infoq.cn/article/6949b16d4ebaeef022c003834

Hadoop之Hive架构与设计
https://www.cnblogs.com/thesungod/p/17612231.html

Hive service, HiveServer2 & MetaStore service?
https://stackoverflow.com/questions/49799838/hive-service-hiveserver2-metastore-service

How HiveServer2 Brings Security and Concurrency to Apache Hive
https://blog.cloudera.com/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/

Securing Apache Hive
https://docs.cloudera.com/cdw-runtime/1.5.1/securing-hive/hive_securing_hive.pdf

A Deep Dive into Apache Hive Architecture: From Data Storage to Data Analysis with SQL-like Hive Query Language
https://nexocode.com/blog/posts/what-is-apache-hive/

Architecture and Working of Hive
https://www.geeksforgeeks.org/architecture-and-working-of-hive/

Hive，Hive on Spark和SparkSQL区别
https://www.cnblogs.com/lixiaochun/p/9446350.html

大数据时代的技术hive：hive介绍
https://www.cnblogs.com/sharpxiajun/archive/2013/06/02/3114180.html

Hive与Hadoop的交互流程
https://www.cnblogs.com/MrFee/p/hive_hadoop.html

Hive架构原理-官网中文翻译
https://blog.csdn.net/JacksonKing/article/details/89637131

hive基本概念原理与底层架构
https://blog.csdn.net/u013129109/article/details/81453582

Hive 架构
https://www.hadoopdoc.com/hive/hive-architecture

=END=

18 5 月, 2024

Docker

Database, KnowledgeBase

architecture, Beeline, Hadoop, HDFS, Hive, HiveServer2, MapReduce, Metastore, Spark, Tez, 大数据, 架构