Hive SQL中如何判断某个设备属于新设备？

=Start=

缘由：

简单记录一下最近在工作中用到的Hive SQL的知识点，方便以后参考。

正文：

参考解答：

将表中的一行（但包含多个字段的）数据转换成多行（每行只有一个字段的）数据；
使用 lateral view explode 进行切分，如果字段中包含特殊字符，需要提前用字符串相关函数进行处理以满足要求

lateral view explode(split(replace(replace(replace(ip_addr, '[', ''), '\"', ''),']',''),',')) ip_addr as ip

explode 函数接收的参数类型是 array/map 类型的

explode(ARRAY<T> a)
explode(MAP<Tkey,Tvalue> m)

collect_set 函数生成的返回值是 array 类型的
array_contains 函数用于判断array中是否存在某个值，返回的是 boolean 类型的 true/false

array_contains(Array<T>, value)
Returns TRUE if the array contains value.

在已有 group by 的基础上可以通过 partition by 来进行独立的 group by ，可以按需使用

select
username
,ip,device_id
,collect_set(device_id) over (partition by username) device_list
from
...
where
...
group by username,ip,device_id
;

在 group by k1,k2,k3 里面的 collect_set(device) over (partition by k1) device_list 是独立的，可以理解 device_list 是只基于 k1 的 group by ，不受外层的 group by 影响；
在 group by k1,k2,k3 里面的 collect_set(device) as device_list2 是基于外层 k1,k2,k3 的 group by ，如果外层的 group by 条件只有1个且和 partition by 的字段相同，两者的结果就是相同的，否则是不一样的，需要按需根据场景来选择。

to_json 函数

# 通过 to_json 和 map 将多个字段拼接成一个json格式的字符串，方便传递给程序做处理

SELECT
TO_JSON(
map(
"k1", 0,
"k2", "a string",
"k3", "[4, 5, 6]",
"k4", "7890"
)
)
;

对于安全来说，很多事件中的异常行为其实可以简单定义为——和之前/其它人不一样，因此就可以通过把之前的数据做一个汇总，统计出前段时间的【常用设备列表/……】，然后再拿当前的记录和统计出的列表做一个比对，如果当前的这个不在常用的设备、职场、认证方式里面，则认为是异常，需要发出告警，由运营人员跟进确认，并进行记录。

新设备、首次访问域名等【首次】的判断逻辑在于——今天(dt)使用的设备、访问的域名，没有在昨天及以前的n天(dt-n,dt-1)中出现过。可以

将今天使用的设备、访问的域名按自己的聚合需求做一个 group by 去重和统计
用前段时间的数据用基于 group by 的 collect_set 生成一个历史设备列表、访问域名列表

方法一：3-1. 然后将上面2步生成的临时表基于 username 做 inner join ，然后用 array_contains 判断今天使用的设备、访问的域名在不在上一步生成的列表里面，如果不在的话，说明设备是新设备、域名是近期首次访问的域名，可以发出提醒，然后由运营同学进行跟进确认。

方法二：3-2. 也可以基于 username,device 做 left outer join 当 x2.username is not NULL and x2.device is NULL 时，则可认为是该用户的新设备，后续处理步骤和上面写的一样。

# 方法一
select
x1.*, x2.*
from
(select
username, ip, device, domain
,count(1) as cnt
from
accesslog
where
dt = 'today'
group by username, ip, device, domain
)x1
join
(select
username
,collect_set(device) over (partition by username) device_list
,collect_set(device) as device_list2
,collect_set(domain) over (partition by username) domain_list
,count(1) as cnt
from
accesslog
where
dt between 'today-30' and 'today-1'
group by username
)x2
on x1.username = x2.username
where
! array_contains(x2.device_list, x1.device)
;

# 方法二
select
x1.*, x2.*
from
(select
username, ip, device, domain
,count(1) as cnt
from
accesslog
where
dt = 'today'
group by username, ip, device, domain
)x1
left outer join
(select
username, device
from
accesslog
where
dt between 'today-30' and 'today-1'
group by username, device
)x2
on x1.username = x2.username and x1.device = x2.device
where
x2.username is not NULL and x2.device is NULL
;

参考链接：

LanguageManual UDF
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions

Hive–map/string→json&处理Json数据的函数
https://blog.csdn.net/qq_46893497/article/details/109959763

hive查询结果转为json格式
https://blog.csdn.net/weixin_40873462/article/details/98175571

hive 将hive表数据查询出来转为json对象和json数组输出
https://www.cnblogs.com/yfb918/p/11325969.html

Spark SQL和Hive中的函数（二）：JSON函数
https://zhuanlan.zhihu.com/p/343758768

=END=

3 12 月, 2022

Docker

Database, KnowledgeBase

array_contains, collect_set, explode, Hive, SQL, to_json