=Start=
缘由:
Hive支持原始数据类型和复杂类型,原始类型包括数值型,Boolean,字符串,时间戳。复杂类型包括array,map,struct。
实际工作中,有时会遇到 map 这种复杂数据类型,除了需要知道如何访问其内容之外,也简单整理一下它的常用方法,方便后面有需要的时候参考。
正文:
参考解答:
map 是一组键值对的组合,可以通过 key 访问 value,键值之间同样要在创建表时指定分隔符。
size(Map)
函数:可得map的长度。返回值类型:int- 获取map中某个key对应的value内容,map的元素访问则使用
[]
,例如map['key1']
map_keys(Map)
函数:可得map中所有的key; 返回值类型: arraymap_values(Map)
函数:可得map中所有的value; 返回值类型: array- 判断map中是否包含某个key值:
方法一:先通过map_keys()
函数 得到一个包含所有key的数组变量,然后通过array_contains()
函数 判定特定的key是否在刚才的数组里面;
方法二:直接用特定key去取值,返回为 NULL 则可能是不包含(还有一种可能是key存在,但是value就是NULL),不过这种方法不精确。
用临时表创建几个map变量用于测试和map类型数据有关的常用方法:
with t1 as (SELECT
str_to_map("name:zhangsan,age:25") as m1
,str_to_map("name:lisi,age:23") as m2
,map("name","wangwu","age","24") as m3
,map('K1','V1','K2','V2','K3','V3') as m4
)
SELECT
m1, size(m1) as m1_size
,m2, m2['name'] as m2_name
,m3, m3['name1'] as m3_name1
,m4
,map_keys(m4) as m4_keys
,array_contains(map_keys(m4),'k4') as m4_has_k4
,m4['k4'] as m4_k4
,map_values(m4) as m4_values
from t1
上面的SQL的查询结果如下:
{"age":"25","name":"zhangsan"}
2
{"age":"23","name":"lisi"}
lisi
{"age":"24","name":"wangwu"}
NULL
{"K1":"V1","K2":"V2","K3":"V3"}
["K1","K2","K3"]
false
NULL
["V1","V2","V3"]
一个实际的场景举例:
/*
url_params 是 map 类型的变量,现在想统计一下请求参数里面的 key 都有哪些,对应的 value 取值都是什么特点(字符串长度、取值范围等)
因此需要借助 explode 进行切分,然后统计
*/
SELECT
url_path
,map_keys(url_params) as params_key_arr
,params_key,params_value
FROM security.nginx_access_log
lateral view explode(url_params) kv as params_key,params_value
WHERE date BETWEEN '20240821' AND '20240821'
AND http_host = 'domain1.ixyzero.com'
AND size(url_params) > 0
LIMIT 100
/*
借助 lateral view explode 将 map 类型变量 url_params 切分成 params_key 和 params_value 两部分;
然后对 params_key 进行 group by 统计总次数(需要用每一条日志的logid做去重统计,否则数值会被放大很多倍),
同步统计 params_value 的字符串长度等信息
*/
SELECT
params_key
,count(DISTINCT req_logid) as logid_cnt
,count(DISTINCT params_value) as params_value_cnt
,count(DISTINCT url_path) as url_path_cnt
,sort_array(collect_set(length(params_value))) as params_value_len_set
,concat(cast(min(length(params_value)) as STRING),'-',cast(max(length(params_value)) as STRING)) as params_value_len_range
,sort_array(collect_set(length(refer_uid))) as refer_uid_len_set
,concat(cast(min(length(refer_uid)) as STRING),'-',cast(max(length(refer_uid)) as STRING)) as refer_uid_len_range
from
(select
url_path
,req_logid
,params_key,params_value
,regexp_extract(referer_url, '(\\d{8,})', 0) as refer_uid
FROM security.nginx_access_log
lateral view explode(url_params) kv as params_key,params_value
WHERE date BETWEEN '20240821' AND '20240821'
AND http_host = 'domain1.ixyzero.com'
AND size(url_params) > 0
)tmp
GROUP BY params_key
参考链接:
Hive中Map函数的应用
https://yerias.github.io/2021/06/03/hive/25/
Hive中的集合数据类型
https://www.ikeguang.com/article/1611
Hive数据类型[通俗易懂]
https://cloud.tencent.cn/developer/article/2064320
hive 使用 map 类型字段
https://cloud.tencent.com/developer/article/1846124
Hive 数据类型
https://www.hadoopdoc.com/hive/hive-data-type
Hive之Map常用方法
https://blog.csdn.net/weixin_43597208/article/details/126962317
=END=
《 “Hive 中 map 类型字段的常用方法学习” 》 有 4 条评论
Hive中 布尔类型(boolean) 的变量如何做比较?
HIVE Boolean Filter
https://stackoverflow.com/questions/68313045/hive-boolean-filter
`
Correct syntax is:
where boolean_col = True
where boolean_col
where NOT boolean_col
To check for NULL use:
where boolean_col is NULL
`
Insert NULL into Hive complex columns like ARRAY, MAP etc.
https://medium.com/@rajnishkumargarg/insert-null-into-hive-complex-columns-like-array-map-etc-a76e320d3e7e
`
Prepare:
CREATE TABLE IF NOT EXISTS tmp.test_table (
col1 STRING,
col2 MAP,
col3 ARRAY
)
PARTITIONED BY ( ds STRING )
;
Step1: Add brickhouse jar and create temporary functions to cast array and map.
> ADD JAR s3a://airbnb-datainfra-dependencies-internal-only/teams/datainfra/projects/hive_aux_jars/brickhouse-0.7.1.jar;
> CREATE TEMPORARY FUNCTION CAST_ARRAY AS ‘brickhouse.udf.collect.CastArrayUDF’;
> CREATE TEMPORARY FUNCTION CAST_MAP AS ‘brickhouse.udf.collect.CastMapUDF’;
Step 2: Insert the data:
> INSERT OVERWRITE TABLE tmp.test_table PARTITION(ds=’2020–01–01′)
> SELECT
‘random_text’ col1
, CAST_MAP(map(‘’, ‘’), ‘map’) col2
, CAST_ARRAY(array(), ‘bigint’) col3
Step 3: Check the results:
> SELECT * FROM tmp.test_table;
`
`
* Cast an Map to the string to string map
*
* Based on CastArrayUDF.
`
https://github.com/jeromebanks/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CastMapUDF.java
hivesql — array_contains()函数使用
https://blog.csdn.net/qq_41018861/article/details/115734991