Hive SQL学习整理_2

=Start=

缘由：

继续整理一下常见Hive SQL的用法，方便要用的时候参考。

正文：

参考解答：

1、explode函数的简单使用

总结起来一句话：explode就是将Hive一行中复杂的array或者map结构拆分成多行。

select explode(array('A','B','C'));
select explode(array('A','B','C')) as col;
/*
A
B
C
*/

select explode(map('A',10,'B',20,'C',30));
select explode(map('A',10,'B',20,'C',30)) as (key,value);
/*
key	value
A	10
B	20
C	30
*/

2、lateral view使用时的一些注意事项

select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf;
/*
sql语法解析错误, SQL在 '('A','B','C')) tf' 的附近有语法错误，出错位置: 第1行, 第74列
*/

select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf as col; -- 这里的 tf as col 必不可少
/*
A
B
C
*/

SELECT myCol1, myCol2 FROM (select 0)
LATERAL VIEW explode(array('A','B','C')) myTable1 AS myCol1
LATERAL VIEW explode(array('x','y','z')) myTable2 AS myCol2
/*
FAILED: ParseException line 2:8 missing EOF at 'VIEW' near 'LATERAL'
*/

SELECT myCol1, myCol2 FROM (select 0) t
LATERAL VIEW explode(array('A','B','C')) myTable1 AS myCol1
LATERAL VIEW explode(array('x','y','z')) myTable2 AS myCol2
/*
A	x
A	y
A	z
B	x
B	y
B	z
C	x
C	y
C	z

Lateral View通常和UDTF一起出现，为了解决UDTF不允许在select字段的问题。
Multiple Lateral View可以实现类似笛卡尔乘积。
Outer关键字可以把不输出的UDTF的空结果，输出成NULL，防止丢失数据。
*/

select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf as key,value;
/*
tf.key	tf.value
A	10
B	20
C	30
*/

-- lateral view与explode等udtf就是天生好搭档，explode将复杂结构一行拆成多行，然后再用lateral view做各种聚合。

3、字符串截取

select substr('abcde',3,2) -- 'cd'
select substr('abcde',0,2) -- 'ab'

-- 第3个参数指的是 length 而不是结尾字符的 index
select substr('abcde',0,-2) -- ''
select substring('abcde',0,-2) -- ''

4、json字符串解析、提取

参见之前记录的「Hive中如何加载和解析(嵌套)json数据」。

5、内置的聚合函数

返回类型	聚合函数名称	描述说明
DOUBLE	sum(col), sum(DISTINCT col)	求列的和 (returns the sum of the elements in the group or the sum of the distinct values of the column in the group)
DOUBLE	avg(col), avg(DISTINCT col)	求列的平均数(returns the average of the elements in the group or the average of the distinct values of the column in the group)
DOUBLE	min(col)	求列的最小值(returns the minimum value of the column in the group)
DOUBLE	max(col)	求列的最大值(returns the maximum value of the column in the group)

参考链接：

=END=

12 1 月, 2019

Docker

Database, KnowledgeBase, Tools

explode, Hive, Lateral View, SQL, UDTF

《 “Hive SQL学习整理_2” 》有 14 条评论

hi说道：

2019-01-15 18:50

【hive】去重操作
https://blog.csdn.net/TheSnowBoy_2/article/details/80135861

Hive之distinct多字段中出现null问题
https://blog.csdn.net/cjf_wei/article/details/84704241

Counting DISTINCT over multiple columns
https://stackoverflow.com/questions/1471250/counting-distinct-over-multiple-columns

回复
hi说道：

2019-02-28 19:56

hive的统计函数
https://blog.csdn.net/haramshen/article/details/52668586

Hive常用函数大全一览——8 集合统计函数
https://www.iteblog.com/archives/2258.html#i-8
http://lxw1234.com/archives/2015/06/251.htm

回复
hi说道：

2019-03-21 19:51

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
https://blog.csdn.net/kancy110/article/details/77074618
https://stackoverflow.com/questions/22463444/hive-error-parseexception-missing-eof
`
在Hive中不支持 ‘create temporary view table_name as’ 这种语法，但 spark 支持，所以在测试的时候需要指定引擎为 spark 而非默认的 hive 。
`

回复
hi说道：

2019-04-01 20:30

多个Lateral View
https://stackoverflow.com/questions/20667473/hive-explode-lateral-view-multiple-arrays
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-posexplode
https://blog.csdn.net/oopsoom/article/details/26001307
`
2.2 多个Lateral View
From语句后可以跟多个Lateral View。
A FROM clause can have multiple LATERAL VIEW clauses. Subsequent LATERAL VIEWS can reference columns from any of the tables appearing to the left of the LATERAL VIEW.
`

回复
hi说道：

2019-04-04 10:37

hive sql 调优相关总结
https://zhuanlan.zhihu.com/p/23174042
`
代码规范调优

1、在代码中经常使用到一些跑出来的中间表，又不想手动创建，使用with xxxtable as ()创建临时中间表。复杂的逻辑也可以尝试建中间表。

2、少用select * 没有必要的数据列不要查询出来，消耗资源，增加io压力。尽量尽早地过滤数据，减少每个阶段的数据量，对于分区表要加分区，同时只选择需要使用到的字段

3、做一些分组统计时，经常要用到不同维度下面的汇总，可以用group by with cube，效率比union all 汇总快2 3倍。

4、Order by（全局排序，一个reducer）效率比Sort by（不是全局排序，其在数据进入reducer前完成排序）低，窗口函数使用：row_number()over(distribute BY uid sort by dt desc)row_number()over(distribute BY uid sort by dt desc)

5、尽可能减少数据集，做关联时避免出现过多的笛卡尔数据集。

6、经常反复使用的表可以考虑建中间表。

7、避免使用count(distinct)操作，解决方法：先使用group by去重，再count计算。尽可能用group by 代替distinct

8、join时小表放在左边，大表放在右边？两张数据量相差特别大的表进行join操作时，如果小表特别小（比如几百几千条记录），使用mapjoin。

9、如果union all的部分个数大于2，或者每个union部分数据量大，应该拆成多个insert into 语句，实际测试过程中，执行时间能提升50%。
`

回复
hi说道：

2019-05-01 20:33

Hive SQL 解析及应用
https://www.jianshu.com/p/7cd2afacc9bb

回复
hi说道：

2019-06-26 11:06

Lateral View用法与 Hive UDTF explode
https://blog.csdn.net/oopsoom/article/details/26001307
`
Lateral view is used in conjunction with user-defined table generatingfunctions such as explode(). As mentioned in Built-in Table-Generating Functions, a UDTF generates zero or more output rows foreach input row. A lateral view first applies the UDTF to each row of base tableand then joins resulting output rows to the input rows to form a virtual tablehaving the supplied table alias.

Lateral view 其实就是用来和像类似explode这种UDTF函数联用的。lateral view 会将UDTF生成的结果放到一个虚拟表中，然后这个虚拟表会和输入行即每个game_id进行join 来达到连接UDTF外的select字段的目的。

在要使用 LATERAL VIEW explode 功能的SQL中，如果SQL语句中包含where条件的限制，需要将 LATERAL VIEW explode 放在 from 之后，where 之前，否则会报错！

可以在2个地方用Lateral view：
1. 在 UDTF 前面用
2. 在 from tbl_name 后面用

UDTF有两种使用方法，一种直接放到select后面，一种和lateral view一起使用。
`

Hive 中的复合数据结构简介以及一些函数的用法说明
https://my.oschina.net/leejun2005/blog/120463

hive中UDF、UDAF和UDTF详解
https://blog.csdn.net/zhaoli081223/article/details/46637517

回复
hi说道：

2019-12-09 17:47

Hive与SQL零碎知识汇总
https://yongyuan.name/blog/Hive-notes.html

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

解析Hive复杂字段1–lateral view explode
https://www.jianshu.com/p/f93adc8c57d0
`
get_json_object
regexp_replace
lateral view explode
str_to_map
`

回复
hi说道：

2019-12-24 10:54

hive SQL中如何将一行中的多列转换成多行？
https://stackoverflow.com/questions/39353425/how-to-convert-columns-into-rows-in-hive
`
/*
col1, col2, col3, col4, col5
->
col1, col2, col3
col1, col2, col4
col1, col2, col5
*/

select col1, col2, value
from
orig_table
lateral view explode(array(col3,col4,col5)) orig_table_alias as value
;
`

回复
hi说道：

2020-04-15 11:43

Hive 的insert into 和 insert overwrite
https://blog.csdn.net/qq_31382921/article/details/70911108
`
insert overwrite 会覆盖已经存在的数据，假如原始表使用overwrite 上述的数据，先现将原始表的数据remove，再插入新数据。

insert into 只是简单的插入，不考虑原始表的数据，直接追加到表中。
`

hive分区表insert into vs insert overwrite
https://www.cnblogs.com/lenmom/p/11373183.html

回复
hi说道：

2020-04-15 11:44

Hive之insert into 和insert overwrite
https://www.iteye.com/blog/snv-2085453
`
两种方式的相同点：

1.两个表的维度必须一样，才能够正常写入

2.如果查询出来的数据类型和插入表格对应的列数据类型不一致，将会进行转换，但是不能保证转换一定成功，比如如果查询出来的数据类型为int，插入表格对应的列类型为string，可以通过转换将int类型转换为string类型；但是如果查询出来的数据类型为string，插入表格对应的列类型为int，转换过程可能出现错误，因为字母就不可以转换为int，转换失败的数据将会为NULL。

不同点：

1.insert into是增加数据

2.insert overwrite是删除原有数据然后在新增数据，如果有分区那么只会删除指定分区数据，其他分区数据不受影响
`

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
`
Synopsis

* INSERT OVERWRITE will overwrite any existing data in the table or partition. unless IF NOT EXISTS is provided for a partition (as of Hive 0.9. 0).

* INSERT INTO will append to the table or partition, keeping the existing data intact. (Note: INSERT INTO syntax is only available starting in version 0.8.)
`

回复
hi说道：

2020-08-03 14:39

Hive MapJoin
https://www.cnblogs.com/MOBIN/p/5702580.html
`
MapJoin是Hive的一种优化操作，其适用于小表JOIN大表的场景，由于表的JOIN操作是在Map端且在内存进行的，所以其并不需要启动Reduce任务也就不需要经过shuffle阶段，从而能在一定程度上节省资源提高JOIN效率

方法一：
在Hive 0.11前，必须使用MAPJOIN来标记显示地启动该优化操作，由于其需要将小表加载进内存所以要注意小表的大小
SELECT /*+ MAPJOIN(smalltable)*/ .key,value
FROM smalltable JOIN bigtable ON smalltable.key = bigtable.key

方法二：
在Hive 0.11后，Hive默认启动该优化，也就是不在需要显示的使用MAPJOIN标记，其会在必要的时候触发该优化操作将普通JOIN转换成MapJoin，可以通过以下两个属性来设置该优化的触发时机

hive.auto.convert.join #默认值为true，自动开户MAPJOIN优化
hive.mapjoin.smalltable.filesize #默认值为2500000(25M),通过配置该属性来确定使用该优化的表的大小，如果表的大小小于此值就会被加载进内存中

注意：使用默认启动该优化的方式如果出现默名奇妙的BUG(比如MAPJOIN并不起作用),就将以下两个属性置为fase手动使用MAPJOIN标记来启动该优化
hive.auto.convert.join=false(关闭自动MAPJOIN转换操作)
hive.ignore.mapjoin.hint=false(不忽略MAPJOIN标记)
`

MapJoin: A Simple Way to Speed Up Your Hive Queries
https://grisha.org/blog/2013/04/19/mapjoin-a-simple-way-to-speed-up-your-hive-queries/

回复
hi说道：

2020-08-03 14:40

hive使用技巧（四）——巧用MapJoin解决数据倾斜问题
https://blog.csdn.net/kwu_ganymede/article/details/51365002
`
Hive的MapJoin，在Join 操作在 Map 阶段完成，如果需要的数据在 Map 的过程中可以访问到则不再需要Reduce。
小表关联一个超大表时，容易发生数据倾斜，可以用MapJoin把小表全部加载到内存在map端进行join，避免reducer处理。
`

LanguageManual JoinOptimization
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

Map Join in Hive | Map Side Join
https://data-flair.training/blogs/map-join-in-hive/

http://www.openkb.info/2016/01/difference-between-hivemapjoinsmalltabl.html

回复
abc说道：

2020-10-20 15:24

`
/*+ MAPJOIN(smalltable)*/ 这里的smalltable字符串不能包含「.」否则会报错，可以使用自定义别名来处理这种情况。

FAILED: SemanticException failed to parse query hint: line 1:20 mismatched input ‘.’ expecting RPAREN
org.apache.hadoop.hive.ql.parse.SemanticException: failed to parse query hint: line 1:20 mismatched input ‘.’ expecting RPAREN
`

回复

ASPIRE

Hive SQL学习整理_2

缘由：

正文：

参考解答：

1、explode函数的简单使用

2、lateral view使用时的一些注意事项

3、字符串截取

4、json字符串解析、提取

5、内置的聚合函数

参考链接：

《 “Hive SQL学习整理_2” 》有 14 条评论

回复 hi 取消回复

Hive SQL学习整理_2

缘由：

正文：

参考解答：

1、explode函数的简单使用

2、lateral view使用时的一些注意事项

3、字符串截取

4、json字符串解析、提取

5、内置的聚合函数

参考链接：

《 “Hive SQL学习整理_2” 》 有 14 条评论

回复 hi 取消回复

《 “Hive SQL学习整理_2” 》有 14 条评论