ASPIRE

Hive SQL学习整理_4

=Start=

缘由：

继续整理最近学到或是用到的Hive SQL知识，方便以后参考。

正文：

参考解答：

0、Hive SQL中的 having 子句以及它与 where 子句的区别

-- Hive在0.7.0版本中增加了对HAVING子句的支持。在Hive的老版本中，通过使用子查询可以达到同样的效果，例如：
SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10
-- 老版本的实现
SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE t2.col2sum > 10

having子句与where都是过滤语句。

where 子句的作用是在对查询结果进行分组前，将不符合where条件的行去掉，即在分组之前过滤数据，条件中不能包含聚合函数，使用where条件显示特定的行。
having 子句的作用是筛选满足条件的组，即在分组之后过滤数据，条件中经常包含聚合函数，使用having条件显示特定的组，也可以使用多个分组标准进行分组。和group by一样，having 子句中的每一个元素也必须出现在select列表中。
总之，select 用where过滤，找到符合条件的元组。而having 用在group by后，配合使用，过滤结果。

-- 当一个查询语句同时出现了where,group by,having,order by的时候，执行顺序和编写顺序
SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE column_name operator value
GROUP BY column_name
HAVING aggregate_function(column_name) operator value
ORDER BY column_name

1、Hive SQL中的条件判断函数（类似于编程语言中的if..else）

• If 函数 : if
语法: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
返回值: T
说明: 当条件testCondition为TRUE时，返回valueTrue；否则返回valueFalseOrNull

hive> select if(1=2,100,200);
200
hive> select if(1=1,100,200);
100

2、Hive SQL中的非空查找函数

• 非空查找函数 : COALESCE
语法: COALESCE(T v1, T v2, …)
返回值: T
说明:  返回参数中的第一个非空值；如果所有值都为NULL，那么返回NULL

hive> select COALESCE(null,'100','50');
100

3、Hive SQL中嵌套的if..else该如何表达？

• 条件判断函数： CASE
语法 : CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
返回值 : T
说明：如果 a 等于 b ，那么返回 c ；如果 a 等于 d ，那么返回 e ；否则返回 f

hive> select case 100
        when 50 then 'tom'
        when 100 then 'mary'
        else 'tim'
      end;
-- mary

if(a, 1, if(b, 2, if(c, 3, 4)))
# 这两者等价
if (a) {
  1
} else if (b) {
  2
} else if (c) {
  3
} else {
  4
}
# 用 case 语句进行表达
CASE
  WHEN (condition1) THEN result1
  WHEN (condition2) THEN result2
  WHEN (condition3) THEN result3 
  WHEN (condition4) THEN result4
  ELSE result_default 
END AS attribute_name

4、Hive SQL中union all的使用

SQL中 union all 有什用？
简而言之：就是把2个具有相同列及数据类型的结果放到一起显示，并且不去重。

什情况下用union？什情况下用union all？
union与union all的区别是：
前者会把两个记录集中相同的记录合并，而后者不会，性能上前者优。当前Hive只支持UNION ALL(bag union)。不能消除重复行，每个select语句返回的列的数量和名字必须一样，否则会抛出语法错误。

select 'a','b','c' from (select 0) t
union all
select 'ca','cb','cc' from (select 0) t
/*
a	b	c
ca	cb	cc
*/

select 'a','b','c'
union all
select 'ca','cb','cc'
/*
a	b	c
ca	cb	cc
*/

select 'a','b','c'
union all
select 'ca','cb'
/*
sql语法解析错误, Union语句的列数不匹配
*/

5、Hive SQL中字符串切分函数split的使用示例

对于括号「)」这种特殊符号需要进行转义，而且需要多个转义符！

select split('python-requests/2.9.1', '/')
-- ["python-requests","2.9.1"]

select split('python-requests/2.9.1', ')')
select split('python-requests/2.9.1', '\)')
/*
    FAILED: PatternSyntaxException Unmatched closing ')'
*/

select split('python-requests/2.9.1', '\\)')
-- ["python-requests/2.9.1"]

select split('python-requests/2.9.1', '\\)')[0]
-- python-requests/2.9.1

参考链接：

=END=

30 1 月, 2019

Docker

KnowledgeBase, Programing, Tools

having, Hive, split, SQL, union all

《 “Hive SQL学习整理_4” 》有 19 条评论

hi说道：

2019-01-30 10:45

纠正一下，对于split中手动指定的字符串参数而言，如果其中包含分号「;」需要进行转义，而非里面的括号「)」
`
select split(‘Mozilla/5.0 (xxx) AppleWebKit/537.36 (xxx)’, ‘\\)’)
/*
[“Mozilla/5.0 (xxx”,” AppleWebKit/537.36 (xxx”,””]
*/

select split(‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (xxx)’, ‘\\)’)
/*
Statement Intel Mac OS X 10_13_6) AppleWebKit/537.36 (xxx)’, ‘\\)’) is not supported
*/
select split(‘Mozilla/5.0 (Macintosh\; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (xxx)’, ‘\\)’)
/*
[“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6″,” AppleWebKit/537.36 (xxx”,””]
*/
`

回复
hi说道：

2019-02-15 14:53

Hive 的字符串UrlDecode 中文解码
http://blog.dezai.cn/article.asp?id=660
`
# 方法一
select reflect(“java.net.URLDecoder”, “decode”, trim(字段名), “UTF-8”) from 库名.表名 limit 10;

select reflect(“java.net.URLDecoder”, “decode”, trim(‘https%3A%2F%2Fixyzero.com%2Fblog%2Fhi+world’), “UTF-8”)
— https://ixyzero.com/blog/hi world

select reflect(“java.net.URLDecoder”, “decode”, ‘https%3A%2F%2Fixyzero.com%2Fblog%2Fhi+world’, “UTF-8”)
— https://ixyzero.com/blog/hi world

# 方法二
自构建一个UDF函数，需要继承UDF，实现其evaluate()方法。
`
HOW TO DECODE URLS IN HIVE
http://bigdatums.net/2016/11/13/how-to-decode-urls-in-hive/

Decode Raw URL in Hadoop Hive, prefer non-Java solutions
https://stackoverflow.com/questions/12291580/decode-raw-url-in-hadoop-hive-prefer-non-java-solutions

回复
hi说道：

2019-02-22 17:17

单步快速字符串截取
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
`
select substring_index(‘www.apache.org’, ‘.’, 2),
substring_index(‘1.2.3.4’, ‘.’, 3),
substring_index(‘1.2.3.4’, ‘x’, 3);

— ‘www.apache’ ‘1.2.3’ ‘1.2.3.4’

— substring_index(string A, string delim, int count)
— 返回字符串A中，第count个分隔符delim所在位置和字符串开头之间的子串内容；如果没有找到分隔符delim，就返回原始字符串内容。（此功能从1.3.0版本开始支持）
— Returns the substring from string A before count occurrences of the delimiter delim (as of Hive 1.3.0). If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. Substring_index performs a case-sensitive match when searching for delim. Example: substring_index(‘www.apache.org’, ‘.’, 2) = ‘www.apache’.
`

回复
hi说道：

2019-02-22 17:24

对于文章中的第5点，可以使用substring_index进行相同功能的实现，记录一下
`
select split(‘python-requests/2.9.1’, ‘\\)’)[0],
substring_index(‘python-requests/2.9.1’, ‘(‘, 1),
substring_index(‘Mozilla/5.0 (iPhone’, ‘(‘, 1);
— ‘python-requests/2.9.1’ ‘python-requests/2.9.1’ ‘Mozilla/5.0’

select split(‘python-requests/2.9.1’, ‘\\)’)[0],
substring_index(‘python-requests/2.9.1’, ‘(‘, 1),
substring_index(‘Mozilla/5.0 (iPhone)’, ‘(‘, 1),
substring_index(‘Mozilla/5.0 (iPhone)’, ‘)’, 1);
— ‘python-requests/2.9.1’ ‘python-requests/2.9.1’ ‘Mozilla/5.0’ ‘Mozilla/5.0 (iPhone’
`

回复
hi说道：

2019-02-22 19:20

hive: Using collect_set with a delimiter
https://stackoverflow.com/questions/28949914/hive-using-collect-set-with-a-delimiter

Concat multiple rows with a delimiter in hive
https://stackoverflow.com/questions/42949153/concat-multiple-rows-with-a-delimiter-in-hive

Apache Hive group_concat Alternative and Example
http://dwgeek.com/apache-hive-group_concat-alternative-example.html/

Hive中实现group concat功能（不用udf）
https://superlxw1234.iteye.com/blog/1886846

回复
hi说道：

2019-03-05 11:59

Hive SQL中的正则表达式语法及其使用
https://stackoverflow.com/questions/46060475/regular-expression-in-hive
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

hive 正则表达式详解
https://blog.csdn.net/bitcarmanlee/article/details/51106726
`
select ‘密码：ABCD-1234-EFGH-98DA ‘ as raw,
regexp_extract(‘密码：ABCD-1234-EFGH-98DA ‘, ‘密码：([a-zA-Z0-9\\-]{2,})\\s+’, 0) as passwd,
regexp_extract(‘密码：ABCD-1234-EFGH-98DA ‘, ‘密码：([a-zA-Z0-9\\-]{2,})\\s+’, 1) as passwd2,
regexp_extract(‘密码：ABCD-1234-EFGH-98DA ‘, ‘密码：([a-zA-Z0-9\\-]{2,})\\s+’) as passwd3,
regexp_extract(‘密码：ABCD-1234-EFGH-98DA ‘, ‘密码：(.+?)\\s+’) as passwd4,
regexp_extract(‘密码：ABCD-1234-EFGH-98DA ‘, ‘密码：(.*?)\\s+’, 0) as passwd5
;
/*
密码：ABCD-1234-EFGH-98DA 密码：ABCD-1234-EFGH-98DA ABCD-1234-EFGH-98DA ABCD-1234-EFGH-98DA ABCD-1234-EFGH-98DA 密码：ABCD-1234-EFGH-98DA
*/
空格要用 \\s 来表示，而不是 \s ；
数字要用 \\d 来表示，而不是 \d ，用 [0-9] 也行；

总之就是，Hive中的正则可以用，但是有所区别，区别在于原来的’\’转义，这里变成了双斜杠了’\\’
`

回复
hi说道：

2019-03-05 11:59

hive正则表达式
https://blog.csdn.net/u014078192/article/details/38388943
Hive 正则提取英文名称和中文名称
https://blog.csdn.net/mulangren1988/article/details/83994491

回复
hi说道：

2019-03-21 16:56

在 Hive SQL 中如何在 select 语句中执行「除法」操作？
https://stackoverflow.com/questions/34124522/how-to-perform-division-in-hive-with-a-select-statement
https://stackoverflow.com/questions/53075834/hive-and-presto-integer-division-truncation-problem
`
— Hive SQL
select 1 / 3;
— 0.3333333333333333

select cast(1 as double) / 3;
— 0.3333333333333333
`

回复
hi说道：

2019-03-22 12:26

Hive的正则匹配中文
https://blog.csdn.net/changzoe/article/details/80251700
`
利用Hive的正则匹配中文时需要注意：

中文的字符集合为[\u4e00-\u9fa5]
但是hive在hive执行中会被转义，因此需要增加一次java的转义字符才能够正确使用
例如：
select ‘密码：在入职时收到的短信中’ rlike ‘密码：[\\u4e00-\\u9fa5]+’,
regexp_extract(‘密码：在入职时收到的短信中’, ‘密码[：\\:\\=]{1}([\\u4e00-\\u9fa5]+).*?’, 0);
`

hive匹配全中文字段
https://superlxw1234.iteye.com/blog/1622414

回复
hi说道：

2019-03-22 12:27

HIVE笔记 #相当的全面和细致
http://tuling56.site/ref/manual/hive.html

回复
hi说道：

2019-03-26 20:51

How to get Substring in Hadoop Hive?
https://stackoverflow.com/questions/30470273/how-to-get-substring-in-hadoop-hive
`
Hive SQL中的字符串提取
— 如果是想从中间或任意位置进行提取，可以使用
regexp_extract
split
— 如果是想从头开始截取，可以使用
substring_index
substring
`

回复
hi说道：

2019-05-08 19:46

hive函数 — split 字符串分割函数
https://blog.csdn.net/lxpbs8851/article/details/18712407
https://stackoverflow.com/questions/4065999/does-hive-have-a-string-split-function
`
split(str, regex) – Splits str around occurances that match regex
要注意HQL中的 split 函数的第二个参数是「正则」字符串，可以同时指定多个分隔符进行切分操作，但要注意进行转义，比如：
split(input, ‘[\\[\\]]’) as output #表示使用 [ 或者 ] 作为分隔符
`

回复
hi说道：

2019-06-19 15:23

Hive query with multiple LIKE operators
https://stackoverflow.com/questions/33626645/hive-query-with-multiple-like-operators
`
— Hive SQL中如何查找某个字段是否存在多个可能的语句
SELECT *
FROM some_table
WHERE
some_col LIKE ‘%abc%’
OR
some_col LIKE ‘%xyz%’
OR
some_col LIKE ‘%pqr%’
OR
… (some more LIKE statements)

— 或者
WHERE some_col RLIKE ‘abc|pqr|xyz’
`

回复
hi说道：

2019-10-31 20:19

Hive SQL中如何截取「变长」「特定分隔符」字符串的除最后一段内容之外的内容？
`
/*
比如希望获取原始字符串：
“a-b-c-d-e-f-g-h”
希望提取出：
“a-b-c-d-e-f-g”
或
“a-b-c-d-e-f”
中最后一个、两个分隔符之前的内容，但分隔符的个数不确定，不适于使用 index/locate 等方式先定位再截取，这里最好用正则的方式进行提取。
*/

select
raw,
regexp_extract(raw, ‘(.+?)-([^\\-]{1,})$’, 1) as raw1
from
(select
“a-b-c-d-e-f-g-h” as raw
)tmp
;
— a-b-c-d-e-f-g-h a-b-c-d-e-f-g
`

回复
hi说道：

2020-01-07 20:03

Hive collect、explode函数详解（包括concat、Lateral View）
https://blog.csdn.net/Abysscarry/article/details/81505953
`
一、collect_set 和 collect_list 函数
二、扩展：concat / concat_ws 函数
三、Explode 函数
四、lateral view 详解
`

回复
hi说道：

2020-02-17 16:53

substring_index——按关键字截取字符串
https://blog.csdn.net/xzw_123/article/details/43233361
`
select
substring_index(substring_index(‘aa.bb.cc.dd’, ‘.’, 1), ‘.’, -1) as a1,
substring_index(substring_index(‘aa.bb.cc.dd’, ‘.’, 2), ‘.’, -1) as a2,
substring_index(substring_index(‘aa.bb.cc.dd’, ‘.’, 3), ‘.’, -1) as a3,
substring_index(substring_index(‘aa.bb.cc.dd’, ‘.’, 4), ‘.’, -1) as a4
;
/*
用两个substring_index是因为substring_index(‘aa.bb.cc.dd’, ‘.’, 2)的结果为aa.bb，所以得再从后截取一次。

aa bb cc dd
*/
`

回复
hi说道：

2020-02-23 09:37

https://stackoverflow.com/questions/6067673/urldecoder-illegal-hex-characters-in-escape-pattern-for-input-string
https://docs.oracle.com/javase/8/docs/api/java/net/URLDecoder.html
https://inneka.com/programming/java/urldecoder-illegal-hex-characters-in-escape-pattern-for-input-string/
`
Caused by: java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern – For input string: “ýý”
at java.net.URLDecoder.decode(URLDecoder.java:194)
`

回复
hi说道：

2024-08-29 11:59

`
/* 使用 case when 进行数值比较从而判断区间时，需要注意的是——顺序是从上往下进行判断，如果满足，则停止比较，且最终只会落在一个区间内 */
,case
when fans_num > 10000000 then ‘>1000w’
when fans_num > 8000000 then ‘800-1000w’
when fans_num > 5000000 then ‘500-800w’
when fans_num > 3000000 then ‘300-500w’
when fans_num > 2000000 then ‘200-300w’
else ‘100-200w’
end as fans_num_range
…
where
…
and fans_num > 1000000
`
【Hive】case when 使用注意事项
https://blog.csdn.net/m0_37773338/article/details/105919799
`
1. sum（case when…then…else ..end），不要漏了else里面的情况，这个可能会导致计算错误。
2. case when 逻辑错误（case when是从上往下读条件的，但是当x=0的时候，第一种情况不满足，会走到第二种情况。）
3. 当要判断的条件在同一行数据的时候，注意不能用case when去判断了，要分开统计。
`

回复
hi说道：

2024-09-10 17:07

Hive SQL 如何对英文分号(;)做split切分？
split string that includes semicolons in Hive
https://stackoverflow.com/questions/17212511/split-string-that-includes-semicolons-in-hive
`
select split(f1,’\073′)[0] from table;

使用 ‘\073’ 来代替 ‘;’ 这种写法就OK，否则会报语法错误。
`

回复