Hive SQL用宏实现常用功能


=Start=

缘由:

简单整理一下近期想到的常用SQL代码片段,方便后面有需要的时候参考,不断积累不断提效。

正文:

参考解答:
1. 常用宏–将数字格式的IPv4转换成点分十进制格式的IPv4字符串
-- 将数字格式的IPv4转换成点分十进制格式的IPv4字符串
create temporary macro IPint2str (x bigint) concat_ws('.',CONV(substr(hex(x),1,2),16,10),CONV(substr(hex(x),3,2),16,10),CONV(substr(hex(x),5,2),16,10),CONV(substr(hex(x),7,2),16,10));

select
concat_ws('.',
conv(substr(lpad(conv(3232267034, 10, 2), 32, 0), 1, 8), 2, 10),
conv(substr(lpad(conv(3232267034, 10, 2), 32, 0), 9, 8), 2, 10),
conv(substr(lpad(conv(3232267034, 10, 2), 32, 0), 17, 8), 2, 10),
conv(substr(lpad(conv(3232267034, 10, 2), 32, 0), 25, 8), 2, 10)
)
, concat_ws('.',
CONV(substr(hex(3232267034),1,2),16,10),
CONV(substr(hex(3232267034),3,2),16,10),
CONV(substr(hex(3232267034),5,2),16,10),
CONV(substr(hex(3232267034),7,2),16,10)
)
, IPint2str(3232267034)
;
-- 192.168.123.26    192.168.123.26  192.168.123.26
2. 常用宏–判断IPv4字符串是否为内网IP
内网IP段的划分:
192.168.0.0 – 192.168.255.255 (65,536 IP addresses)
172.16.0.0 – 172.31.255.255 (1,048,576 IP addresses)
10.0.0.0 – 10.255.255.255 (16,777,216 IP addresses)

-- 判断字符串格式的IPv4是否为内网IP的宏
create temporary macro is_private_ipv4(clientip string) if(clientip like '10.%'
or clientip like '192.168.%'
or clientip rlike '^172\\.(1[6-9]|2[0-9]|3[0-1])\\.', 1, 0);

select
is_private_ipv4('172.15.20.17')
,is_private_ipv4('172.20.18.32')
,is_private_ipv4('172.35.20.17')
;
-- 0    1   0
3. 常用宏–将点分十进制格式的IPv4字符串转换成数字
create temporary macro IPstr2int (x string) cast(split(x, '\\.') [0] * 256 * 256 * 256 + split(x, '\\.') [1] * 256 * 256 + split(x, '\\.') [2] * 256 + split(x, '\\.') [3] AS bigint);

create temporary macro IPstr2int2 (x string) (shiftleft(cast(split(x,'\\.')[0] as bigint),24) 
+shiftleft(cast(split(x,'\\.')[1] as bigint),16)
+shiftleft(cast(split(x,'\\.')[2] as bigint),8)
+cast(split(x,'\\.')[3] as bigint));

SELECT
  cast(
    split('192.168.123.26', '\\.') [0] * 256 * 256 * 256 + split('192.168.123.26', '\\.') [1] * 256 * 256 + split('192.168.123.26', '\\.') [2] * 256 + split('192.168.123.26', '\\.') [3] AS bigint
  ) AS ip_bigint
, IPstr2int('192.168.123.26')
, IPstr2int2('192.168.123.26')
;
-- 3232267034    3232267034  3232267034
4. 一点想法–用SQL如何判断某个IP是否在某个网段内?
-- 在SQL里面因为不像Python/Java那样带有很多现成的库来做,一般思路就是:
-- 1. 自己写 UDF 然后加载使用,好处在于能实现很多自定义的功能,不过需要写代码而且将jar包上传等额外步骤
-- 2. 用 正则表达式 来实现,好处在于不用引入额外的依赖,但需要对正则表达式比较熟悉,且实现精准了太复杂,实现简单了不够准
-- 3. 将 IP字符串先转换成整数格式,然后判断该整数是否在网段首尾IP的整数范围内 这也是个办法
5. 常用宏–抽取域名的n级子域名
-- 抽取完整域名的几级子域名的宏
create temporary macro subdomain_4(domain string) concat(
  split(domain, '\\.') [size(split(domain, '\\.'))-4]
  ,'.', split(domain, '\\.') [size(split(domain, '\\.'))-3]
  ,'.', split(domain, '\\.') [size(split(domain, '\\.'))-2]
  ,'.', split(domain, '\\.') [size(split(domain, '\\.'))-1]
);
create temporary macro subdomain_3(domain string) concat(
  split(domain, '\\.') [size(split(domain, '\\.'))-3]
  ,'.', split(domain, '\\.') [size(split(domain, '\\.'))-2]
  ,'.', split(domain, '\\.') [size(split(domain, '\\.'))-1]
);

-- 推荐下面这种 reverse+substring_index 的方式,效率理论上来说应该也更高
create temporary macro subdomain_4x(domain string) reverse(substring_index(reverse(domain), '.', 4));
create temporary macro subdomain_3x(domain string) reverse(substring_index(reverse(domain), '.', 3));

with t1 as (
select
'a.b.c.d.com' as domain
,'1.2.com' as domain2
)

select
domain
,subdomain_3(domain) as x3_1
,subdomain_3x(domain) as x3_2
,subdomain_4(domain) as x4_1
,subdomain_4x(domain) as x4_2
,domain2
,subdomain_3(domain2) as y3_1
,subdomain_3x(domain2) as y3_2
,subdomain_4(domain2) as y4_1
,subdomain_4x(domain2) as y4_2
from t1
;

-- 数据格式正常的情况下都没有问题
-- domain        x3_1    x3_2    x4_1        x4_2
-- a.b.c.d.com    c.d.com c.d.com b.c.d.com   b.c.d.com

-- 数据格式不那么规范的情况下,4级子域名上面就出现了问题,用 concat 连接通过 split 切分的这种方式直接返回空,不满足需求
-- domain2    y3_1    y3_2    y4_1    y4_2
-- 1.2.com    1.2.com 1.2.com         1.2.com
6. 常用宏–进行URL编解码
-- Hive中如何进行URL编解码
create temporary macro url_decode(input string) reflect("java.net.URLDecoder", "decode",trim(input),"UTF-8");
create temporary macro url_encode(input string) reflect("java.net.URLEncoder", "encode",input,"UTF-8");
7. 日期字符串转换–特殊格式日期字符串的解析
-- 在做日志分析的时候可能会碰到各种各样格式的日期字符串,如果是常规那种 yyyy-MM-dd HH:mm:ss 格式的肯定是最好,但是也保不齐有一些特殊格式的,比如月份是用英文字母缩写的这种,之前没怎么遇到过,这次遇到解决了简单记录一下

select
'Jan 09 2023 15:30:21' as input
,unix_timestamp('Jan 09 2023 15:30:21', 'MMM dd yyyy HH:mm:ss') as parsed
;
-- input    parsed  
-- Jan 09 2023 15:30:21    1673249421


unix_timestamp(string date, string pattern)

Convert time string with given pattern to Unix time stamp (in seconds), return 0 if fail.
在第二个参数中指定要转换的字符串格式,然后将第一个参数的时间日期字符串转换成以秒为格式的unix时间戳数字,当转换失败时返回0。

Example: unix_timestamp('2009-03-20', 'uuuu-MM-dd') = 1237532400.

默认情况下传入的字符串格式为:
uuuu-MM-dd HH:mm:ss
2023-01-09 10:34:46
上面这种格式的,年/日/时/分/秒这种的字符其实还比较清楚,麻烦就麻烦在月份的3字简写该怎么表示上面,试了多次才确认是 MMM 用来表示月份的3个字符的英文缩写。

Month abbreviations consist of the first three characters of the month’s name. Months with four-character names, such as June, are not abbreviated.
只有3个字符的月份字符串才是缩写,4个字符的不是。

YYMMMDD    #Last two digits of year, three-letter abbreviation of the month, two-digit day (example: 99JAN02)
YYYYMMMDD    #Four-digit year, three-letter abbreviation of the month, two-digit day (example: 2003JUL04)
参考链接:

https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html #pattern format
https://docs.oracle.com/cd/E41183_01/DR/Date_Format_Types.html

Date/Time Formats
https://www.ibm.com/docs/en/sgfmw/5.3.1?topic=format-datetime-formats

Hive SQL学习整理_6
https://ixyzero.com/blog/archives/5096.html

Hive SQL中的日期时间处理
https://ixyzero.com/blog/archives/5362.html

Convert IPv4 string to a BIGINT representation in spark sql
https://stackoverflow.com/questions/40789834/convert-ipv4-string-to-a-bigint-representation-in-spark-sql

=END=


发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注