Hive SQL 中的分号、下划线等tips整理


=Start=

缘由:

简单整理一下近期在进行 Hive SQL 查询时碰到的一些问题和经验,方便后面有需要的时候参考。

正文:

参考解答:

1. Hive SQL 如何对英文分号(;)做split切分?

> select split(f1,'\073')[0] from tab_name;

-- 使用 '\073' 来代替 ';' 这种写法就OK,否则会报语法错误。

-- 验证代码片段
with t1 as (
SELECT
'1\0733' as v1
,'1 3' as v2
,'123' as v3
)

SELECT
v1,split(v1,'\073') as result1
,v2,split(v2,'\073| ') as result2
,v3,split(v3,'\073| ') as result3
FROM
t1
,split(lower(mac_address),'\073| ') as mac_addr_array --对mac地址字符串使用空格和分号作为分隔符进行切分

,array_contains(mac_addr_array,user_mac) as it_asset -- 判断特定的mac地址是否在mac地址数组中

2. Hive SQL的like如何匹配下划线?

SQL中如何使用like匹配下划线的内容?因为下划线在like中代表任意单个字符,如果需要匹配下划线字符,需要对下划线进行转义。

WHERE mycolumn LIKE '%\_%' ESCAPE '\'
WHERE mycolumn LIKE '%\_%'

WHERE mycolumn LIKE '%#_%' ESCAPE '#'

方法一:
使用 escape 对下划线进行转义(默认是反划线)

方法二:
使用 instr/locate/rlike 等函数进行判断,不使用 like 来判断
-- 验证代码片段

with t1 as (
SELECT
'1_3' as underscore_var
,'123' as no_underscore_var
)

SELECT
underscore_var, if(underscore_var like '%\_%',1,0) as has1
,no_underscore_var, if(no_underscore_var like '%\_%',1,0) as has2
,if(no_underscore_var like '%_%',1,0) as wrong_judge_method
FROM
t1

3. Hive中 布尔类型(boolean) 的变量如何做比较?

-- 简单来说就是可以用常规的等号(=),也可以不添加判等符号
Correct syntax is:
where boolean_col = True
where boolean_col
where NOT boolean_col

To check for NULL use:
where boolean_col is NULL

4. Hive中如何向array/map等复杂类型字段中添加空值?

-- 暂未实际测试

Prepare:
CREATE TABLE IF NOT EXISTS tmp.test_table (
 col1 STRING,
 col2 MAP<STRING, STRING>,
 col3 ARRAY<BIGINT>
)
PARTITIONED BY ( ds STRING )
;

Step1: Add brickhouse jar and create temporary functions to cast array and map.

> ADD JAR s3a://airbnb-datainfra-dependencies-internal-only/teams/datainfra/projects/hive_aux_jars/brickhouse-0.7.1.jar;
> CREATE TEMPORARY FUNCTION CAST_ARRAY AS 'brickhouse.udf.collect.CastArrayUDF';
> CREATE TEMPORARY FUNCTION CAST_MAP AS 'brickhouse.udf.collect.CastMapUDF';

Step 2: Insert the data:

> INSERT OVERWRITE TABLE tmp.test_table PARTITION(ds='2020–01–01')
> SELECT 
 'random_text' col1
 , CAST_MAP(map('', ''), 'map<string,string>') col2
 , CAST_ARRAY(array(), 'bigint') col3

Step 3: Check the results:
> SELECT * FROM tmp.test_table;
参考链接:

split string that includes semicolons in Hive
https://stackoverflow.com/questions/17212511/split-string-that-includes-semicolons-in-hive

Why does using an Underscore character in a LIKE filter give me all the results?
https://stackoverflow.com/questions/19588455/why-does-using-an-underscore-character-in-a-like-filter-give-me-all-the-results

【Hive】like + 转义字符
https://blog.csdn.net/qq_34105362/article/details/80415526

HIVE Boolean Filter
https://stackoverflow.com/questions/68313045/hive-boolean-filter

Insert NULL into Hive complex columns like ARRAY, MAP etc.
https://medium.com/@rajnishkumargarg/insert-null-into-hive-complex-columns-like-array-map-etc-a76e320d3e7e
https://github.com/jeromebanks/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CastMapUDF.java

=END=


《 “Hive SQL 中的分号、下划线等tips整理” 》 有 2 条评论

  1. 一个判断hour字段是否属于自定义的工作时间的方法:
    array_contains(split(‘22,23,00,01,02,03,04,05,06,07′,’,’),`hour`) as not_work_hour,

    使用split对字符串进行切分之后生成array,然后再用array_contains进行判断,可以在外面再套一层if判断将取值范围限定到0和1方便后面做累加操作

    if(array_contains(split(‘22,23,00,01,02,03,04,05,06,07′,’,’),`hour`),1,0) as not_work_hour,

  2. 或者直接rlike也可以达到目标,因为hour字符串就2个字符,也不存在或者说很难出现错误匹配的情况

    hour rlike ’22|23|00|01|02|03|04|05|06|07′ as not_work_hour

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注