=Start=
缘由:
简单整理一下近期在进行 Hive SQL 查询时碰到的一些问题和经验,方便后面有需要的时候参考。
正文:
参考解答:
1. Hive SQL 如何对英文分号(;)做split切分?
> select split(f1,'\073')[0] from tab_name;
-- 使用 '\073' 来代替 ';' 这种写法就OK,否则会报语法错误。
-- 验证代码片段
with t1 as (
SELECT
'1\0733' as v1
,'1 3' as v2
,'123' as v3
)
SELECT
v1,split(v1,'\073') as result1
,v2,split(v2,'\073| ') as result2
,v3,split(v3,'\073| ') as result3
FROM
t1
,split(lower(mac_address),'\073| ') as mac_addr_array --对mac地址字符串使用空格和分号作为分隔符进行切分
,array_contains(mac_addr_array,user_mac) as it_asset -- 判断特定的mac地址是否在mac地址数组中
2. Hive SQL的like如何匹配下划线?
SQL中如何使用like匹配下划线的内容?因为下划线在like中代表任意单个字符,如果需要匹配下划线字符,需要对下划线进行转义。
WHERE mycolumn LIKE '%\_%' ESCAPE '\'
WHERE mycolumn LIKE '%\_%'
WHERE mycolumn LIKE '%#_%' ESCAPE '#'
方法一:
使用 escape 对下划线进行转义(默认是反划线)
方法二:
使用 instr/locate/rlike 等函数进行判断,不使用 like 来判断
-- 验证代码片段
with t1 as (
SELECT
'1_3' as underscore_var
,'123' as no_underscore_var
)
SELECT
underscore_var, if(underscore_var like '%\_%',1,0) as has1
,no_underscore_var, if(no_underscore_var like '%\_%',1,0) as has2
,if(no_underscore_var like '%_%',1,0) as wrong_judge_method
FROM
t1
3. Hive中 布尔类型(boolean) 的变量如何做比较?
-- 简单来说就是可以用常规的等号(=),也可以不添加判等符号
Correct syntax is:
where boolean_col = True
where boolean_col
where NOT boolean_col
To check for NULL use:
where boolean_col is NULL
4. Hive中如何向array/map等复杂类型字段中添加空值?
-- 暂未实际测试
Prepare:
CREATE TABLE IF NOT EXISTS tmp.test_table (
col1 STRING,
col2 MAP<STRING, STRING>,
col3 ARRAY<BIGINT>
)
PARTITIONED BY ( ds STRING )
;
Step1: Add brickhouse jar and create temporary functions to cast array and map.
> ADD JAR s3a://airbnb-datainfra-dependencies-internal-only/teams/datainfra/projects/hive_aux_jars/brickhouse-0.7.1.jar;
> CREATE TEMPORARY FUNCTION CAST_ARRAY AS 'brickhouse.udf.collect.CastArrayUDF';
> CREATE TEMPORARY FUNCTION CAST_MAP AS 'brickhouse.udf.collect.CastMapUDF';
Step 2: Insert the data:
> INSERT OVERWRITE TABLE tmp.test_table PARTITION(ds='2020–01–01')
> SELECT
'random_text' col1
, CAST_MAP(map('', ''), 'map<string,string>') col2
, CAST_ARRAY(array(), 'bigint') col3
Step 3: Check the results:
> SELECT * FROM tmp.test_table;
参考链接:
split string that includes semicolons in Hive
https://stackoverflow.com/questions/17212511/split-string-that-includes-semicolons-in-hive
Why does using an Underscore character in a LIKE filter give me all the results?
https://stackoverflow.com/questions/19588455/why-does-using-an-underscore-character-in-a-like-filter-give-me-all-the-results
【Hive】like + 转义字符
https://blog.csdn.net/qq_34105362/article/details/80415526
HIVE Boolean Filter
https://stackoverflow.com/questions/68313045/hive-boolean-filter
Insert NULL into Hive complex columns like ARRAY, MAP etc.
https://medium.com/@rajnishkumargarg/insert-null-into-hive-complex-columns-like-array-map-etc-a76e320d3e7e
https://github.com/jeromebanks/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CastMapUDF.java
=END=
《 “Hive SQL 中的分号、下划线等tips整理” 》 有 2 条评论
一个判断hour字段是否属于自定义的工作时间的方法:
array_contains(split(‘22,23,00,01,02,03,04,05,06,07′,’,’),`hour`) as not_work_hour,
使用split对字符串进行切分之后生成array,然后再用array_contains进行判断,可以在外面再套一层if判断将取值范围限定到0和1方便后面做累加操作
if(array_contains(split(‘22,23,00,01,02,03,04,05,06,07′,’,’),`hour`),1,0) as not_work_hour,
或者直接rlike也可以达到目标,因为hour字符串就2个字符,也不存在或者说很难出现错误匹配的情况
hour rlike ’22|23|00|01|02|03|04|05|06|07′ as not_work_hour