Hive 中 map 类型字段的常用方法学习


=Start=

缘由:

Hive支持原始数据类型和复杂类型,原始类型包括数值型,Boolean,字符串,时间戳。复杂类型包括array,map,struct。

实际工作中,有时会遇到 map 这种复杂数据类型,除了需要知道如何访问其内容之外,也简单整理一下它的常用方法,方便后面有需要的时候参考。

正文:

参考解答:

map 是一组键值对的组合,可以通过 key 访问 value,键值之间同样要在创建表时指定分隔符。

  1. size(Map)函数:可得map的长度。返回值类型:int
  2. 获取map中某个key对应的value内容,map的元素访问则使用[],例如 map['key1']
  3. map_keys(Map)函数:可得map中所有的key; 返回值类型: array
  4. map_values(Map)函数:可得map中所有的value; 返回值类型: array
  5. 判断map中是否包含某个key值:
    方法一:先通过 map_keys()函数 得到一个包含所有key的数组变量,然后通过 array_contains()函数 判定特定的key是否在刚才的数组里面;
    方法二:直接用特定key去取值,返回为 NULL 则可能是不包含(还有一种可能是key存在,但是value就是NULL),不过这种方法不精确。

用临时表创建几个map变量用于测试和map类型数据有关的常用方法:

with t1 as (SELECT
str_to_map("name:zhangsan,age:25") as m1
,str_to_map("name:lisi,age:23") as m2
,map("name","wangwu","age","24") as m3
,map('K1','V1','K2','V2','K3','V3') as m4
)

SELECT
m1, size(m1) as m1_size 
,m2, m2['name'] as m2_name
,m3, m3['name1'] as m3_name1
,m4
,map_keys(m4) as m4_keys
,array_contains(map_keys(m4),'k4') as m4_has_k4
,m4['k4'] as m4_k4
,map_values(m4) as m4_values
from t1

上面的SQL的查询结果如下:

{"age":"25","name":"zhangsan"}
2
{"age":"23","name":"lisi"}
lisi
{"age":"24","name":"wangwu"}
NULL
{"K1":"V1","K2":"V2","K3":"V3"}
["K1","K2","K3"]
false
NULL
["V1","V2","V3"]

一个实际的场景举例:

/*
url_params 是 map 类型的变量,现在想统计一下请求参数里面的 key 都有哪些,对应的 value 取值都是什么特点(字符串长度、取值范围等)
因此需要借助 explode 进行切分,然后统计
*/
SELECT
url_path
,map_keys(url_params) as params_key_arr
,params_key,params_value
FROM    security.nginx_access_log
lateral view explode(url_params) kv as params_key,params_value
WHERE   date BETWEEN '20240821' AND '20240821'
AND http_host = 'domain1.ixyzero.com'
AND size(url_params) > 0 
LIMIT   100


/*
借助 lateral view explode 将 map 类型变量 url_params 切分成 params_key 和 params_value 两部分;
然后对 params_key 进行 group by 统计总次数(需要用每一条日志的logid做去重统计,否则数值会被放大很多倍),
同步统计 params_value 的字符串长度等信息
*/
SELECT
params_key
,count(DISTINCT req_logid) as logid_cnt
,count(DISTINCT params_value) as params_value_cnt
,count(DISTINCT url_path) as url_path_cnt
,sort_array(collect_set(length(params_value))) as params_value_len_set
,concat(cast(min(length(params_value)) as STRING),'-',cast(max(length(params_value)) as STRING)) as params_value_len_range
,sort_array(collect_set(length(refer_uid))) as refer_uid_len_set
,concat(cast(min(length(refer_uid)) as STRING),'-',cast(max(length(refer_uid)) as STRING)) as refer_uid_len_range
from
(select
url_path
,req_logid
,params_key,params_value
,regexp_extract(referer_url, '(\\d{8,})', 0) as refer_uid
FROM    security.nginx_access_log
lateral view explode(url_params) kv as params_key,params_value
WHERE   date BETWEEN '20240821' AND '20240821'
AND http_host = 'domain1.ixyzero.com'
AND size(url_params) > 0 
)tmp
GROUP BY params_key
参考链接:

Hive中Map函数的应用
https://yerias.github.io/2021/06/03/hive/25/

Hive中的集合数据类型
https://www.ikeguang.com/article/1611

Hive数据类型[通俗易懂]
https://cloud.tencent.cn/developer/article/2064320

hive 使用 map 类型字段
https://cloud.tencent.com/developer/article/1846124

Hive 数据类型
https://www.hadoopdoc.com/hive/hive-data-type

Hive之Map常用方法
https://blog.csdn.net/weixin_43597208/article/details/126962317

=END=


《 “Hive 中 map 类型字段的常用方法学习” 》 有 4 条评论

  1. Insert NULL into Hive complex columns like ARRAY, MAP etc.
    https://medium.com/@rajnishkumargarg/insert-null-into-hive-complex-columns-like-array-map-etc-a76e320d3e7e
    `
    Prepare:
    CREATE TABLE IF NOT EXISTS tmp.test_table (
    col1 STRING,
    col2 MAP,
    col3 ARRAY
    )
    PARTITIONED BY ( ds STRING )
    ;

    Step1: Add brickhouse jar and create temporary functions to cast array and map.

    > ADD JAR s3a://airbnb-datainfra-dependencies-internal-only/teams/datainfra/projects/hive_aux_jars/brickhouse-0.7.1.jar;
    > CREATE TEMPORARY FUNCTION CAST_ARRAY AS ‘brickhouse.udf.collect.CastArrayUDF’;
    > CREATE TEMPORARY FUNCTION CAST_MAP AS ‘brickhouse.udf.collect.CastMapUDF’;

    Step 2: Insert the data:

    > INSERT OVERWRITE TABLE tmp.test_table PARTITION(ds=’2020–01–01′)
    > SELECT
    ‘random_text’ col1
    , CAST_MAP(map(‘’, ‘’), ‘map’) col2
    , CAST_ARRAY(array(), ‘bigint’) col3

    Step 3: Check the results:
    > SELECT * FROM tmp.test_table;
    `

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注