Hive SQL学习整理

=Start=

缘由：

最近用Hive SQL查数据比较多，虽然HQL大体上和常用的SQL没什么区别，但是也有一些特别的地方需要注意，在此整理一下，方便要用的时候参考。

正文：

参考解答：

1、不等于

请使用 <> 而不要用 !=

2、对于NULL列的处理

column1 is not NULL AND column1 <> '' AND length(column1) > 0

3、两表联合查询

-- 在MySQL中将关联条件放在 where 中
SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name;

-- 在Hive SQL中将关联条件放在 on 中
SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name);

4、多表联合查询

-- 两表单条件
SELECT a.* FROM a JOIN b ON (a.id = b.id)
-- 两表多条件
SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)

--多表联合join
SELECT a.val, b.val, c.val
FROM
a
JOIN
b ON (a.key = b.key1)
JOIN
c ON (c.key = b.key2)

5、模糊匹配

like用于指定特定的字符串做模糊匹配（用 % 指代任意个数的任意字符，用 _ 指代一个任意字符）
select uid from dw.today where tunittype like '%wew.%'

rlike是Hive中对like的扩展，可以将原先多个like才能完成的任务，使用一个rlike就可以搞定
select uid from dw.today where tunittype rlike '.*(you|me).*'

点号(.)：表示和任意字符匹配；星号(*)：表示重复“左边的字符串”；（x|y）表示和x或者y匹配

对like取反
一般，like的语法形式是： A LIKE B，取反的语法形式：NOT A LIKE B
select uid from dw.today where not tunittype like '%wew.%'

6、当前时间&将秒格式的时间戳按特定格式展示

SELECT from_unixtime(unix_timestamp())

7、将毫秒格式的时间戳转换成可读的日期格式

select from_unixtime(1543908748678,'yyyyMMdd');	# 508940709
select from_unixtime(1543908748.678,'yyyyMMdd') # FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments ''yyyyMMdd'': No matching method for class org.apache.hadoop.hive.ql.udf.UDFFromUnixTime with (double, string). Possible choices: _FUNC_(bigint)  _FUNC_(bigint, string)  _FUNC_(int)  _FUNC_(int, string)
select from_unixtime(1543908748,'yyyyMMdd') # 20181204

select 1543908748678, from_unixtime(CAST(1543908748678/1000 as BIGINT), 'yyyy-MM-dd'); # 1543908748678	2018-12-04

from_unixtime(CAST(1543908748678/1000 as BIGINT), 'yyyy-MM-dd_HH:mm:ss') # 1543908748678	2018-12-04_15:32:28

/*
$ date -r 1543908748
2018年12月 4日 星期二 15时32分28秒 CST
*/

8、如何检查逗号分隔的字符串中是否包含某个字符串？

list_ids = "abc,cde,efg";
select * from table_name where array_contains(split(list_ids,','), 'cde');

9、如何判断某个元素是否在一个array中？

方法一：
array_contains(Array<T>, value)

方法二：
先 explode 取出特定列之后，再进行常规的判等操作。

10、字符串连接函数

-- 字符串连接函数：concat
-- 语法: concat(string A, string B...)
hive> select concat('www','.ixyzero','.com');
www.ixyzero.com

-- 带分隔符字符串连接函数：concat_ws
-- 语法: concat_ws(string SEP, string A, string B...)
hive> select concat_ws('.','www','ixyzero','com');
www.ixyzero.com

参考链接：

=END=

11 1 月, 2019

Docker

Database, Programing, Tools

Hive, join, SQL

《 “Hive SQL学习整理” 》有 21 条评论

hi说道：

2019-01-11 21:51

SQL中的Case When Else End的用法
https://blog.csdn.net/ruidongliu/article/details/11735507
`
update操作：
update table_name
set 字段1=case
when 条件1 then 值1
when 条件2 then 值2
else 值3
end
where …

select操作：
select 字段1, 字段2,
case 字段3
when 值1 then 新值
when 值2 then 新值
end as 重新命名字段3的名字
from table_name
where …
order by …
`
how to write case and group by in hive query
https://stackoverflow.com/questions/37440828/how-to-write-case-and-group-by-in-hive-query

回复
hi说道：

2019-01-24 19:20

hive 正则表达式详解
https://blog.csdn.net/bitcarmanlee/article/details/51106726

regular expression in hive
https://stackoverflow.com/questions/46060475/regular-expression-in-hive
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-RelationalOperators

Hive SQL中regexp/regexp_extract的简单使用样例（主要是为了说明需要用「\\」表示常规的「\」）
`
select 13112349876 regexp ‘\\d{11}’
— true

select 13112349876 regexp ‘\d{11}’
— false

select ‘big-corporation’ regexp ‘^big-‘, ‘big-corporation’ regexp ‘^big-inc’, ‘big-corporation’ regexp ‘(.+?)corp(.+?)’
— true false true

select ‘big corporation’ regexp ‘^abc|^btw|corp$’
— false

select ‘big corp’ regexp ‘^abc|^btw|corp$’
— true

select regexp_extract(‘{“bssid”:”12:34:56:78:c4:90″,”ssid”:”wifi-guest”}’, ‘bssid\\”:\\”(.*?)\\”‘)
— 12:34:56:78:c4:90
`

回复
hi说道：

2019-01-24 19:33

Hive 临时表
https://blog.csdn.net/opensure/article/details/51378754
`
Hive从0.14.0开始提供创建临时表的功能，表只对当前session有效，session退出后，表自动删除。

语法：
CREATE TEMPORARY TABLE TABLE_NAME_HERE (key string, value string)

注意点：
1、如果创建的临时表表名已存在，那么当前session引用到该表名时实际用的是临时表，只有drop或rename临时表名才能使用原始表；
2、临时表限制：不支持分区字段和创建索引。
`

Create temporary table in Hive?
https://stackoverflow.com/questions/5385163/create-temporary-table-in-hive
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-TemporaryTables

hive 学习系列三（表格的创建create-table）
https://blog.csdn.net/eases_stone/article/details/80607109

回复
hi说道：

2019-01-29 14:30

hive:条件判断函数
https://www.cnblogs.com/kxdblog/p/4034243.html
`
• If 函数 : if
语法: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
返回值: T
说明: 当条件testCondition为TRUE时，返回valueTrue；否则返回valueFalseOrNull

hive> select if(1=2,100,200);
200
hive> select if(1=1,100,200);
100

• 非空查找函数 : COALESCE
语法: COALESCE(T v1, T v2, …)
返回值: T
说明: 返回参数中的第一个非空值；如果所有值都为NULL，那么返回NULL

hive> select COALESCE(null,’100′,’50’);
100

• 条件判断函数： CASE
语法 : CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
返回值 : T
说明：如果 a 等于 b ，那么返回 c ；如果 a 等于 d ，那么返回 e ；否则返回 f

hive> select case 100
when 50 then ‘tom’
when 100 then ‘mary’
else ‘tim’
end;
mary
`

回复
hi说道：

2019-01-29 14:33

hive 常用运算
http://www.cnblogs.com/mliudong/archive/2013/02/27/2934710.html
`
第一部分：关系运算
第二部分：逻辑运算与数学运算
第三部分：数值运算
第四部分：日期函数
第五部分：条件函数
第六部分：字符串函数
`

回复
hi说道：

2019-01-29 17:00

Hive SQL操作与函数自定义（一）
https://blog.csdn.net/u013980127/article/details/52604882
Hive SQL操作与函数自定义（二）
https://blog.csdn.net/u013980127/article/details/52606024

Hive基本操作
http://www.cnblogs.com/smartloli/p/4354291.html

回复
hi说道：

2019-02-20 20:42

Hive SQL中如何获取当前时间？
https://stackoverflow.com/questions/17905873/how-to-select-current-date-in-hive-sql
`
select unix_timestamp(), — 1550666245
from_unixtime(unix_timestamp()), — 2019-02-20 20:37:25
to_date(from_unixtime(unix_timestamp())), — 2019-02-20
CURRENT_DATE, — 2019-02-20
YEAR(CURRENT_DATE()), — 2019
CURRENT_TIMESTAMP — 2019-02-20 20:37:24.641
;
— 1550666245 2019-02-20 20:37:25 2019-02-20 2019-02-20 2019 2019-02-20 20:37:24.641

select from_unixtime(1550666245,’yyyyMMdd’); — 20190220
`

回复
hi说道：

2019-03-19 20:17

7张图学会SQL
https://zhuanlan.zhihu.com/p/57451327
`
第1周：SQL入门
第2周：查询基础
第3周：聚合与排序
第4周：数据更新
第5周：复杂查询
第6周：函数、谓词、case表达式
第7周：集合运算
`

回复
hi说道：

2019-03-26 20:31

Hive SQL中如何获取某个字符串中特定字符的出现次数？
Hive – Count number of occurences of character
https://stackoverflow.com/questions/36736022/hive-count-number-of-occurences-of-character
`
— 先将「非特定字符」替换成「空」，然后再计算字符串的剩余长度
SELECT LENGTH(regexp_replace(‘220138|251965797?AIRFR?150350161961|||||’,'[^|]’,”));
— 6
`

回复
hi说道：

2019-04-01 21:15

CONCAT_WS()函数对于参数类型的限制
Argument 2 of function CONCAT_WS must be “string or array”, but “bigint” was found.
`
# 解决办法一，使用 concat 一点点拼接
hive> select concat(‘www’,’.ixyzero’,’.com’);
http://www.ixyzero.com

# 解决办法二，强制类型转换
暂略
`
http://www.thelandbeyondspreadsheets.com/how-to-join-strings-in-sql-for-hive/
http://hadooptutorial.info/string-functions-in-hive/
https://stackoverflow.com/questions/16009608/combine-columns-from-multiple-columns-into-one-in-hive
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions

回复
hi说道：

2019-04-01 21:21

Hive数据类型转换
https://www.iteblog.com/archives/892.html
`
任何整数类型都可以隐式地转换成一个范围更大的类型。TINYINT,SMALLINT,INT,BIGINT,FLOAT和STRING都可以隐式地转换成DOUBLE；是的你没看出，STRING也可以隐式地转换成DOUBLE！但是你要记住，BOOLEAN类型不能转换为其他任何数据类型！
`

Hive – Converting a string to bigint
https://stackoverflow.com/questions/28867438/hive-converting-a-string-to-bigint
`
SELECT CAST(‘00321’ AS BIGINT);
— ‘321’
`

Hive Converting from Double to String Not in Scientific
https://stackoverflow.com/questions/32576187/hive-converting-from-double-to-string-not-in-scientific
`
select 9999999902.0, cast(9999999902.0 as BIGINT), cast(cast(9999999902.0 as BIGINT) as string);
— 9.999999902E9 9999999902 9999999902
`

回复
hi说道：

2019-04-08 19:41

https://stackoverflow.com/questions/21088865/error-message-tok-allcolref-is-not-supported-in-current-context-while-using-d
`
报错信息：
FAILED: SemanticException TOK_ALLCOLREF is not supported in current context

Hive doesn’t support DISTINCT * syntax. You can manually specify every field of the table to get the same result.
错误原因：执行distinct *时报错，hive不支持。
解决办法：手动写明每一个字段即可解决。
`
https://blog.csdn.net/ltliyue/article/details/52292819
https://stanzhai.site/blog/post/stanzhai/Hive%E6%89%A7%E8%A1%8CSQL%E6%8F%90%E7%A4%BASemanticException-TOK_ALLCOLREF-is-not-supported-in-current-context%E7%9A%84%E9%97%AE%E9%A2%98-2

回复
hi说道：

2019-04-10 18:03

HiveSQL中的类型转换
HiveSQL中如何将字符串数组转换成整型数组？ (Hive : casting array to array in query)
https://stackoverflow.com/questions/32871399/hive-casting-arraystring-to-arrayint-in-query
`
hive> select array(‘1′,’2′,’3’) string_array;
[“1″,”2″,”3”]

hive> select explode(string_array) array_element –对字符串数组进行切分
from (select array(‘1′,’2′,’3’) string_array –初始化字符串数组
)s
1
2
3

hive> select collect_list(cast(array_element as int)) int_array –cast and collect array
from( select explode(string_array) array_element –explode array
from (select array(‘1′,’2′,’3’) string_array –initial array
)s
)s;
[1,2,3]
`

How to convert an array to array in Hive
https://stackoverflow.com/questions/43462561/how-to-convert-an-arraydate-to-arraystring-in-hive
`
select concat_ws(‘,’,collect_set(date)) from table;
`

Hive: Convert String to Integer
https://stackoverflow.com/questions/12346750/hive-convert-string-to-integer
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions
`
cast(str_column as int)
`

回复
hi说道：

2019-12-23 11:31

SQL中如何选择多列进行排序
https://stackoverflow.com/questions/2051162/sql-multiple-column-ordering
`
# 针对每一列「均」需要「显式」指定它的排序方式(正序/逆序)
ORDER BY column1 DESC, column2

ORDER BY column1 DESC, column3 DESC, column2

ORDER BY column1 DESC, column3 DESC, column4 DESC, column2
`

回复
hi说道：

2019-12-28 11:53

如何写好 5000 行的 SQL 代码
https://mp.weixin.qq.com/s/p0HwSDkAdF1i4GToGMAi6A
`
上千行的 SQL 代码常见，且永不过时！

下面的领悟来自我实战中真实的想法，趟过无数次的坑，用教训总结出来的几条自认为极有用的经验。

1、理解业务
你肯定不会去写没有业务逻辑的代码。充分理解业务逻辑对你有两个好处：一是写出可执行的并且可扩展的代码；二是主动了解业务将有利于职业生涯升级。

2、快速实现
村上春树、海明威、博尔赫斯，从来写小说都是第一遍爽快的写下去了，一旦写得卡壳了怎么办，束之高阁，明儿继续。我这里想说的策略，大家都可以猜得到了。先把业务实现了再说，命名规则，变量申明，事务控制以及性能优化，统统先放起来。写好 CRUD 交上第一稿，存档，Over!

作家们要是等灵感来了再动笔写，我们哪能看到那么多有趣的故事。同样，我们写代码哪能等到全盘都考虑好了再动手呢。想到一个数据流，用到哪些表，直接就可以写了。等着等着就慌了，写着写着思路就来了。

3、重构与测试
《巴黎评论》中，村上春树提到他的小说经常修改 4 – 5 遍才交稿，而且编辑还需要修改。我们一遍过的 SQL 就免检了？这个时候才考验你 SQL 真实功底和编码素质。

再检查命名规则，变量申明，事务控制以及性能优化。你会发现还有很多事情要做。

4、版本控制
如果你的团队没有 git, SVN, TFS 这些 Source Code Version Control, 赶紧上一个。没有自动化部署工具，自己想办法整一个。都 2020 年了，别偷懒吧。

5、复盘记录
做好上面4步，对公司项目是有个交代了。但做这一步，才是对自己有交代。

就好比刚才重构的时候，提到 CTE, UNPIVOT , 代码简化的策略，可能因为一时灵感或责任心爆棚，反正你当时想到了，但你不及时记录下来，可能很久过后就忘记你曾做过这么神奇的操作。

所以，等你费尽心思写完很长的代码，一定要通过复盘记录下来，放到你的 blog， github, 等你以后碰到类似情况，却想不出来如何解，你可以随时拿出来用上。
`

回复
hi说道：

2020-02-20 10:47

SQL join的一些简略写法
`
— 常规写法1
select
x1.*
from
(select *
from
tmp.tableA
)x1
left join
(select *
from
tmp.tableB
)x2
on x1.col1 = x2.col1
where
x2.col22 is NULL
;

— 替代写法2
select
x1.*
from
tmp.tableA x1
left join
tmp.tableB x2
on x1.col1 = x2.col1
where
x2.col22 is NULL
;
/*
利用别名的方式省略了一些不必要的内容，但效果都是一样的，可以学习一下。
*/
`

回复
abc说道：

2020-09-16 11:02

Hive sql 中如何获取 array 中的最后一个元素？
https://stackoverflow.com/questions/13832500/how-to-access-the-last-element-in-an-array
`
# session 列的格式，想取最后一列 userID
ip-sessionID-userID
area-sessionID-userID

# 方法一：reverse+split+reverse
reverse(split(reverse(session), ‘-‘)[0])

# 方法二：split+size
select array[size(array)-1] as userID
from
( select split(session,’-‘) array from your_table ) s;

# 方法三：regexp_extract
regexp_extract(session, ‘([^\-]+)$’, 1)
`

https://stackoverflow.com/questions/46044654/extract-last-two-elements-of-an-array-in-hive
https://stackoverflow.com/questions/49999869/finding-the-first-last-of-array-struct

回复
abc说道：

2020-11-03 16:52

使用 WITH … AS … 语句增强 SQL 查询的可读性
https://liam.page/2020/03/13/the-WITH-AS-clause-in-Hive-SQL/
`
最近在读各种 Hive SQL，发现一个可读性问题，让我很无奈。这里记录一下，希望看到的读者能够写出可读性更好的代码。

因为业务比较复杂，所以大家经常会用到各种子查询（sub-query）。于是会写成类似这样：

SELECT
t1.foo,
t2.bar,
t3.baz
FROM (
SELECT
foo,
bar
FROM
tb1
WHERE

) AS t1 INNER JOIN (
SELECT
baz
FROM
tb2
WHERE

) AS t2 ON
WHERE
;

这样写子查询会有两个问题。一是子查询的结果如果要在多个不同地方用到，那么就要复制粘贴多次，实际执行的时候也可能执行多次。二是当子查询或/和 JOIN 特别多的时候，整个查询就会变得无比复杂，可读性极差。为解决问题，可用视图（VIEW）解决，也可用 WITH … AS … 子句来解决。

WITH AS (SELECT … FROM WHERE ) 和视图类似，可以创建一个临时表，供之后使用。按上述例子，改用 WITH … AS … 子句可以写成这样：

WITH t1 AS (
SELECT
foo,
bar
FROM
tb1
WHERE

),
t2 AS (
SELECT
baz
FROM
tb2
WHERE

)

SELECT
t1.foo,
t2.bar,
t3.baz
FROM
t1 INNER JOIN t2 ON
WHERE
;
很明显，这样写起来可读性会强很多。
`

回复
abc说道：

2020-11-04 21:13

hive中如何计算某个字符串出现的次数？
Count particular substring text within column
https://stackoverflow.com/questions/21309213/count-particular-substring-text-within-column
`
— 先按关键字 split 然后 size 计算切分出的array大小最后减一即可
SELECT
size(split(fruits,”Apple”))-1 as number_of_apples
FROM
fruits;
`

How to find the length of hive json array field using hive query
https://stackoverflow.com/questions/62589861/how-to-find-the-length-of-hive-json-array-field-using-hive-query

Hive获取array数组长度
https://blog.csdn.net/qq_31573519/article/details/77542756

回复
abc说道：

2020-11-04 21:15

hive中的字段类型
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

https://acadgild.com/blog/hive-complex-data-types-with-examples

回复
abc说道：

2021-06-16 20:55

`
# Hive SQL中如何截取 input_str 字符串的最后 25 个字符的内容
# 借助 reverse 函数进行一次逆序之后取前 25 个字符的内容，然后再次逆序即可
# Hive 中的字符串截取函数 substr/substring 中的索引是从 1 开始计算的（当然你写0其实也是1的位置）
reverse(substr(reverse(input_str),1,25))
`

回复

ASPIRE

Hive SQL学习整理

缘由：

正文：

参考解答：

1、不等于

2、对于NULL列的处理

3、两表联合查询

4、多表联合查询

5、模糊匹配

6、当前时间&将秒格式的时间戳按特定格式展示

7、将毫秒格式的时间戳转换成可读的日期格式

8、如何检查逗号分隔的字符串中是否包含某个字符串？

9、如何判断某个元素是否在一个array中？

10、字符串连接函数

参考链接：

《 “Hive SQL学习整理” 》有 21 条评论

回复 hi 取消回复

Hive SQL学习整理

缘由：

正文：

参考解答：

1、不等于

2、对于NULL列的处理

3、两表联合查询

4、多表联合查询

5、模糊匹配

6、当前时间&将秒格式的时间戳按特定格式展示

7、将毫秒格式的时间戳转换成可读的日期格式

8、如何检查逗号分隔的字符串中是否包含某个字符串？

9、如何判断某个元素是否在一个array中？

10、字符串连接函数

参考链接：

《 “Hive SQL学习整理” 》 有 21 条评论

回复 hi 取消回复

《 “Hive SQL学习整理” 》有 21 条评论