Hive SQL用宏实现常用功能

=Start=

缘由：

简单整理一下近期想到的常用SQL代码片段，方便后面有需要的时候参考，不断积累不断提效。

正文：

参考解答：

1. 常用宏–将数字格式的IPv4转换成点分十进制格式的IPv4字符串

-- 将数字格式的IPv4转换成点分十进制格式的IPv4字符串
create temporary macro IPint2str (x bigint) concat_ws('.',CONV(substr(hex(x),1,2),16,10),CONV(substr(hex(x),3,2),16,10),CONV(substr(hex(x),5,2),16,10),CONV(substr(hex(x),7,2),16,10));

select
concat_ws('.',
conv(substr(lpad(conv(3232267034, 10, 2), 32, 0), 1, 8), 2, 10),
conv(substr(lpad(conv(3232267034, 10, 2), 32, 0), 9, 8), 2, 10),
conv(substr(lpad(conv(3232267034, 10, 2), 32, 0), 17, 8), 2, 10),
conv(substr(lpad(conv(3232267034, 10, 2), 32, 0), 25, 8), 2, 10)
)
, concat_ws('.',
CONV(substr(hex(3232267034),1,2),16,10),
CONV(substr(hex(3232267034),3,2),16,10),
CONV(substr(hex(3232267034),5,2),16,10),
CONV(substr(hex(3232267034),7,2),16,10)
)
, IPint2str(3232267034)
;
-- 192.168.123.26    192.168.123.26  192.168.123.26

2. 常用宏–判断IPv4字符串是否为内网IP

内网IP段的划分：
192.168.0.0 – 192.168.255.255 (65,536 IP addresses)
172.16.0.0 – 172.31.255.255 (1,048,576 IP addresses)
10.0.0.0 – 10.255.255.255 (16,777,216 IP addresses)

-- 判断字符串格式的IPv4是否为内网IP的宏
create temporary macro is_private_ipv4(clientip string) if(clientip like '10.%'
or clientip like '192.168.%'
or clientip rlike '^172\\.(1[6-9]|2[0-9]|3[0-1])\\.', 1, 0);

select
is_private_ipv4('172.15.20.17')
,is_private_ipv4('172.20.18.32')
,is_private_ipv4('172.35.20.17')
;
-- 0    1   0

3. 常用宏–将点分十进制格式的IPv4字符串转换成数字

create temporary macro IPstr2int (x string) cast(split(x, '\\.') [0] * 256 * 256 * 256 + split(x, '\\.') [1] * 256 * 256 + split(x, '\\.') [2] * 256 + split(x, '\\.') [3] AS bigint);

create temporary macro IPstr2int2 (x string) (shiftleft(cast(split(x,'\\.')[0] as bigint),24) 
+shiftleft(cast(split(x,'\\.')[1] as bigint),16)
+shiftleft(cast(split(x,'\\.')[2] as bigint),8)
+cast(split(x,'\\.')[3] as bigint));

SELECT
  cast(
    split('192.168.123.26', '\\.') [0] * 256 * 256 * 256 + split('192.168.123.26', '\\.') [1] * 256 * 256 + split('192.168.123.26', '\\.') [2] * 256 + split('192.168.123.26', '\\.') [3] AS bigint
  ) AS ip_bigint
, IPstr2int('192.168.123.26')
, IPstr2int2('192.168.123.26')
;
-- 3232267034    3232267034  3232267034

4. 一点想法–用SQL如何判断某个IP是否在某个网段内？

-- 在SQL里面因为不像Python/Java那样带有很多现成的库来做，一般思路就是：
-- 1. 自己写 UDF 然后加载使用，好处在于能实现很多自定义的功能，不过需要写代码而且将jar包上传等额外步骤
-- 2. 用 正则表达式 来实现，好处在于不用引入额外的依赖，但需要对正则表达式比较熟悉，且实现精准了太复杂，实现简单了不够准
-- 3. 将 IP字符串先转换成整数格式，然后判断该整数是否在网段首尾IP的整数范围内 这也是个办法

5. 常用宏–抽取域名的n级子域名

-- 抽取完整域名的几级子域名的宏
create temporary macro subdomain_4(domain string) concat(
  split(domain, '\\.') [size(split(domain, '\\.'))-4]
  ,'.', split(domain, '\\.') [size(split(domain, '\\.'))-3]
  ,'.', split(domain, '\\.') [size(split(domain, '\\.'))-2]
  ,'.', split(domain, '\\.') [size(split(domain, '\\.'))-1]
);
create temporary macro subdomain_3(domain string) concat(
  split(domain, '\\.') [size(split(domain, '\\.'))-3]
  ,'.', split(domain, '\\.') [size(split(domain, '\\.'))-2]
  ,'.', split(domain, '\\.') [size(split(domain, '\\.'))-1]
);

-- 推荐下面这种 reverse+substring_index 的方式，效率理论上来说应该也更高
create temporary macro subdomain_4x(domain string) reverse(substring_index(reverse(domain), '.', 4));
create temporary macro subdomain_3x(domain string) reverse(substring_index(reverse(domain), '.', 3));

with t1 as (
select
'a.b.c.d.com' as domain
,'1.2.com' as domain2
)

select
domain
,subdomain_3(domain) as x3_1
,subdomain_3x(domain) as x3_2
,subdomain_4(domain) as x4_1
,subdomain_4x(domain) as x4_2
,domain2
,subdomain_3(domain2) as y3_1
,subdomain_3x(domain2) as y3_2
,subdomain_4(domain2) as y4_1
,subdomain_4x(domain2) as y4_2
from t1
;

-- 数据格式正常的情况下都没有问题
-- domain        x3_1    x3_2    x4_1        x4_2
-- a.b.c.d.com    c.d.com c.d.com b.c.d.com   b.c.d.com

-- 数据格式不那么规范的情况下，4级子域名上面就出现了问题，用 concat 连接通过 split 切分的这种方式直接返回空，不满足需求
-- domain2    y3_1    y3_2    y4_1    y4_2
-- 1.2.com    1.2.com 1.2.com         1.2.com

6. 常用宏–进行URL编解码

-- Hive中如何进行URL编解码
create temporary macro url_decode(input string) reflect("java.net.URLDecoder", "decode",trim(input),"UTF-8");
create temporary macro url_encode(input string) reflect("java.net.URLEncoder", "encode",input,"UTF-8");

7. 日期字符串转换–特殊格式日期字符串的解析

-- 在做日志分析的时候可能会碰到各种各样格式的日期字符串，如果是常规那种 yyyy-MM-dd HH:mm:ss 格式的肯定是最好，但是也保不齐有一些特殊格式的，比如月份是用英文字母缩写的这种，之前没怎么遇到过，这次遇到解决了简单记录一下

select
'Jan 09 2023 15:30:21' as input
,unix_timestamp('Jan 09 2023 15:30:21', 'MMM dd yyyy HH:mm:ss') as parsed
;
-- input    parsed  
-- Jan 09 2023 15:30:21    1673249421


unix_timestamp(string date, string pattern)

Convert time string with given pattern to Unix time stamp (in seconds), return 0 if fail.
在第二个参数中指定要转换的字符串格式，然后将第一个参数的时间日期字符串转换成以秒为格式的unix时间戳数字，当转换失败时返回0。

Example: unix_timestamp('2009-03-20', 'uuuu-MM-dd') = 1237532400.

默认情况下传入的字符串格式为：
uuuu-MM-dd HH:mm:ss
2023-01-09 10:34:46
上面这种格式的，年/日/时/分/秒这种的字符其实还比较清楚，麻烦就麻烦在月份的3字简写该怎么表示上面，试了多次才确认是 MMM 用来表示月份的3个字符的英文缩写。

Month abbreviations consist of the first three characters of the month’s name. Months with four-character names, such as June, are not abbreviated.
只有3个字符的月份字符串才是缩写，4个字符的不是。

YYMMMDD    #Last two digits of year, three-letter abbreviation of the month, two-digit day (example: 99JAN02)
YYYYMMMDD    #Four-digit year, three-letter abbreviation of the month, two-digit day (example: 2003JUL04)

参考链接：

https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html #pattern format
https://docs.oracle.com/cd/E41183_01/DR/Date_Format_Types.html

Date/Time Formats
https://www.ibm.com/docs/en/sgfmw/5.3.1?topic=format-datetime-formats

Hive SQL学习整理_6
https://ixyzero.com/blog/archives/5096.html

Hive SQL中的日期时间处理
https://ixyzero.com/blog/archives/5362.html

Convert IPv4 string to a BIGINT representation in spark sql
https://stackoverflow.com/questions/40789834/convert-ipv4-string-to-a-bigint-representation-in-spark-sql

=END=

ASPIRE

Hive SQL用宏实现常用功能

缘由：

正文：

参考解答：

1. 常用宏–将数字格式的IPv4转换成点分十进制格式的IPv4字符串

2. 常用宏–判断IPv4字符串是否为内网IP

3. 常用宏–将点分十进制格式的IPv4字符串转换成数字

4. 一点想法–用SQL如何判断某个IP是否在某个网段内？

5. 常用宏–抽取域名的n级子域名

6. 常用宏–进行URL编解码

7. 日期字符串转换–特殊格式日期字符串的解析

参考链接：

发表回复取消回复

Hive SQL用宏实现常用功能

缘由：

正文：

参考解答：

1. 常用宏–将数字格式的IPv4转换成点分十进制格式的IPv4字符串

2. 常用宏–判断IPv4字符串是否为内网IP

3. 常用宏–将点分十进制格式的IPv4字符串转换成数字

4. 一点想法–用SQL如何判断某个IP是否在某个网段内？

5. 常用宏–抽取域名的n级子域名

6. 常用宏–进行URL编解码

7. 日期字符串转换–特殊格式日期字符串的解析

参考链接：

发表回复 取消回复

发表回复取消回复