Hive SQL学习整理_6


=Start=

缘由:

整理一下最近在工作中用到的Hive SQL的知识点,方便以后参考。

正文:

参考解答:
  • 在Hive SQL中进行IP字符串和数字之间的转换
-- 数字格式的IP转成点分十进制字符串
create temporary macro IPint2str (x bigint) concat_ws('.',CONV(substr(hex(x),1,2),16,10),CONV(substr(hex(x),3,2),16,10),CONV(substr(hex(x),5,2),16,10),CONV(substr(hex(x),7,2),16,10));

select
concat_ws('.',
conv(substr(lpad(conv(3232267034, 10, 2), 32, 0), 1, 8), 2, 10),
conv(substr(lpad(conv(3232267034, 10, 2), 32, 0), 9, 8), 2, 10),
conv(substr(lpad(conv(3232267034, 10, 2), 32, 0), 17, 8), 2, 10),
conv(substr(lpad(conv(3232267034, 10, 2), 32, 0), 25, 8), 2, 10)
)
, concat_ws('.',
CONV(substr(hex(3232267034),1,2),16,10),
CONV(substr(hex(3232267034),3,2),16,10),
CONV(substr(hex(3232267034),5,2),16,10),
CONV(substr(hex(3232267034),7,2),16,10)
)
, IPint2str(3232267034)
;
-- 192.168.123.26	192.168.123.26	192.168.123.26

-- 点分十进制格式的IP字符串转换成整数
create temporary macro IPstr2int (x string) cast(split(x, '\\.') [0] * 256 * 256 * 256 + split(x, '\\.') [1] * 256 * 256 + split(x, '\\.') [2] * 256 + split(x, '\\.') [3] AS bigint);

SELECT
  cast(
    split('192.168.123.26', '\\.') [0] * 256 * 256 * 256 + split('192.168.123.26', '\\.') [1] * 256 * 256 + split('192.168.123.26', '\\.') [2] * 256 + split('192.168.123.26', '\\.') [3] AS bigint
  ) AS ip_bigint
, IPstr2int('192.168.123.26')
;
-- 3232267034	3232267034

# 十六进制函数: hex
语法: hex(BIGINT/BINARY/STRING a)
返回值: string
说明: 如果变量是bigint/binary类型,那么返回a的十六进制表示;如果变量是string类型,则返回该字符串中每个字符的十六进制表示

hive> select hex(17);
11

hive> select hex('abc');
616263

If the argument is an INT or binary, hex returns the number as a STRING in hexadecimal format. Otherwise if the number is a STRING, it converts each character into its hexadecimal representation and returns the resulting STRING. (BINARY version as of Hive 0.12.0.)


# 进制转换函数: conv
语法: conv(BIGINT/STRING num, int from_base, int to_base)
返回值: string
说明: 将数值num从from_base进制转化到to_base进制

hive> select conv(17,10,16) from tableName;
11

hive> select conv(17,10,2) from tableName;
10001

conv(BIGINT num, INT from_base, INT to_base),
conv(STRING num, INT from_base, INT to_base)

Converts a number from a given base to another (see http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_conv).


# 左补足函数: lpad
语法: lpad(string str, int len, string pad)
返回值: string
说明:将 str 进行用 pad 进行左补足到 len 位

hive> select lpad('abc',10,'td');
tdtdtdtabc
  • 在Hive中用SQL生成日期序列
-- 通用方法
with t as (
  select '2020-12-28' as dt_start, '2021-01-12' as dt_end
)
select
idx, date_add(to_date(dt_start), idx) as dt
from t
lateral view posexplode(split(space(datediff(t.dt_end,t.dt_start)),' ')) pi as idx, empty
;

/*
思路解读:
先用 datediff 函数计算两个日期之间的天数(第二个参数应该比第一个参数要更「大」,否则返回的会是负数);
然后用 space 函数把天数这个数值(在此假设为n)转换成 n 个空格;
再用 split 函数对空格进行切分,转换成一个array数组;
再用 lateral view posexplode 对上一步骤产生的数组拆分成多行(索引和内容的格式,但此处的内容为空串);
最后在 select 的时候用 date_add 函数在最初的基础日期上循环不断增加 索引的值 从而产生新的连续日期。

datediff(date1, date2) - Returns the number of days between date1 and date2
date1 and date2 are strings in the format 'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'. The time parts are ignored.If date1 is earlier than date2, the result is negative.

space(n) - returns n spaces

split(str, regex) - Splits str around occurances that match regex
Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDFSplit

posexplode(a) - behaves like explode for arrays, but includes the position of items in the original array
Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDTFPosExplode
Function type:BUILTIN


idx	dt
0	2020-12-28
1	2020-12-29
2	2020-12-30
3	2020-12-31
4	2021-01-01
...
15	2021-01-12
*/
-- 注:在Hive SQL中,函数接收日期的格式默认为 'yyyy-mm-dd' 的,而非 'yyyymmdd' 要注意!

SELECT
default.date_range('20201228','20210112',1)
;

/*
date	index
20201228	0
20201229	1
20201230	2
20201231	3
20210101	4
...
20210112	15
*/

desc function extended default.date_range;
/*
default.date_range(a,b,c) - Generates a range of integers from a to b incremented by c or the elements of a map into multiple rows and columns
Function class:brickhouse.udf.date.DateRangeUDTF
Function type:PERSISTENT
Resource:viewfs://hadoop-lt-cluster/home/system/hive/resources/brickhouse-0.7.1-SNAPSHOT-jar-with-dependencies.jar
*/
参考链接:

Hive Sql中IP地址和数字之间的转换
https://blog.csdn.net/weixin_44034508/article/details/105813769

hive 数字IP与字符串IP之间转换
https://blog.csdn.net/cxy1991xm/article/details/102505150

Hive SQL中IP地址与数字之间的转换
https://www.vzhima.com/2020/06/24/hive-sql-convert-ip-address-to-integer.html

How to generate Date Series in HIVE? (Creating table)
https://stackoverflow.com/questions/45278300/how-to-generate-date-series-in-hive-creating-table/45279528

https://www.gairuo.com/p/hive-sql-mathematical-functions

https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_hex
https://dev.mysql.com/doc/refman/8.0/en/mathematical-functions.html#function_conv

=END=


《 “Hive SQL学习整理_6” 》 有 7 条评论

  1. Hive中的trunc函数
    `
    一句话介绍:
    第一个入参为日期时,可用于获取指定日期的「当月第一天」、「当季第一天」、「当年第一天」;
    第一个入参为数字时,用于将从小数点开始向左或向右的部分置为0。

    desc function extended trunc
    /*
    trunc(date, fmt) / trunc(N,D) – Returns If input is date returns date with the time portion of the day truncated to the unit specified by the format model fmt. If you omit fmt, then date is truncated to the nearest day. It currently only supports ‘MONTH’/’MON’/’MM’, ‘QUARTER’/’Q’ and ‘YEAR’/’YYYY’/’YY’ as format.If input is a number group returns N truncated to D decimal places. If D is omitted, then N is truncated to 0 places.D can be negative to truncate (make zero) D digits left of the decimal point.

    date is a string in the format ‘yyyy-MM-dd HH:mm:ss’ or ‘yyyy-MM-dd’. The time part of date is ignored.

    Example:
    > SELECT trunc(‘2009-02-12’, ‘MM’);
    OK
    ‘2009-02-01’
    > SELECT trunc(‘2017-03-15’, ‘Q’);
    OK
    ‘2017-01-01’
    > SELECT trunc(‘2015-10-27’, ‘YEAR’);
    OK
    ‘2015-01-01’ > SELECT trunc(1234567891.1234567891,4);
    OK
    1234567891.1234
    > SELECT trunc(1234567891.1234567891,-4);
    OK
    1234560000 > SELECT trunc(1234567891.1234567891,0);
    OK
    1234567891
    > SELECT trunc(1234567891.1234567891);
    OK
    1234567891

    Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDFTrunc
    Function type:BUILTIN
    */

    with t1 as (SELECT
    ‘20210703’ as yyyymmdd
    ,’2021-07-03′ as yyyy_mm_dd
    )
    select
    yyyymmdd, yyyy_mm_dd
    — ,trunc(yyyy_mm_dd) as trunc0 — 实际执行时没有第二个参数会报错
    ,trunc(yyyy_mm_dd,’MM’) as month_1st
    ,trunc(yyyy_mm_dd,’YY’) as year_1st
    ,trunc(yyyy_mm_dd,’Q’) as quarter_1st
    from t1

    /*
    yyyymmdd yyyy_mm_dd month_1st year_1st quarter_1st
    20210703 2021-07-03 2021-07-01 2021-01-01 2021-07-01
    */
    `
    【Hive】Hive3中trunc函数功能
    https://blog.csdn.net/debimeng/article/details/101146614

  2. SQL 查询并不是从 SELECT 开始的(SQL queries don’t start with SELECT)
    https://mp.weixin.qq.com/s/5BF12rIs1QXQpX4vpKkL0Q
    `
    日常使用中写 SQL 查询命令都是以 SELECT 开始的(注意:本文仅探讨 SELECT 查询,不涵盖 insert 或其他 SQL 命令)。

    昨天我想到一个问题:可以用 WHERE、HAVING 或者其他方式来过滤窗口函数执行结果吗?

    经过一番探索,我得出的最终结论是否定的,因为窗口函数必须在 WHERE 和 GROUP BY 之后才能运行。但是,这也延伸到了一个更大的问题——SQL 查询的执行顺序是怎么样的呢?

    1. FROM/JOIN/ON
    2. WHERE
    3. GROUP BY
    4. HAVING
    5. SELECT(窗口函数即在此步骤执行)
    6. ORDER BY
    7. LIMIT

    上面是 SQL 查询的语义说明。看懂这些,便能迅速判断一个给定的 SQL 查询将会返回什么结果,也可以轻松解答如下疑问:

    * 可以对 GROUP BY 的结果进行WHERE 筛选吗?(不可以!因为 WHERE 在 GROUP BY 之前执行)
    * 可以对窗口函数的执行结果进行过滤吗?(不可以!因为窗口函数在 SELECT 步骤执行,而这步是在 WHERE 和 GROUP BY 之后)
    * 可以对 GROUP BY 的结果再执行 ORDER BY 操作吗? (可以!ORDER BY 基本上是最后一个步骤了,所以可以对任何操作的执行结果执行 ORDER BY)
    * LIMIT 执行在哪个步骤? (最后一步!)
    `

  3. # 如何用Hive SQL判断两段日期区间是否有重合
    Detect overlapping date ranges from the same table
    https://stackoverflow.com/questions/4490553/detect-overlapping-date-ranges-from-the-same-table
    `
    # 回答一
    and (
    s1.start between s2.start and s2.end
    OR
    s1.end between s2.start and s2.end
    OR
    s2.start between s1.start and s1.end
    )

    # 回答二
    SELECT A.PKey, A.Start, A.End, A.Type
    FROM calendar AS A, calendar AS B
    WHERE (p.pkeya.pkey
    AND b.start>=a.start
    AND b.end<=a.end)
    OR (b.pkeya.pkey
    AND b.start=a.end)
    `

  4. HiveQL 进阶之以柔克刚 – 将简单语法运用到极致
    https://zhuanlan.zhihu.com/p/508038040
    `
    HiveQL 发展到今天已经颇为成熟,作为一种 SQL 方言,其支持大多数查询语法,具有较为丰富的内置函数,同时还支持开窗函数、用户自定义函数、反射机制等诸多高级特性。面对一个复杂的数据场景,或许有人技术娴熟,选择使用 HiveQL 高级特性解决,如:编写用户自定义函数扩展 SQL 的数据处理能力;或许有人选择敬而远之,转向使用其他非 SQL 类型的解决方案。本文并不讨论不同方式的优劣,而是尝试独辟蹊径,不是强调偏僻的语法特性或是复杂的 UDF 实现,而是强调 通过灵活的、发散性的数据处理思维,就可以用最简单的语法,解决复杂的数据场景。

    * 快速制造测试数据

    posexplode
    split
    space
    `

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注