Hive中如何加载和解析(嵌套)json数据


=Start=

缘由:

最近经常使用Hive进行数据分析,其中有一个比较典型的场景就是——部分有效的数据在一个大的多层嵌套结构json字符串中,该如何进行匹配和比对,可行的方法应该有不少,我这里先记录一下我知道和实际用到的,方便以后参考。

正文:

参考解答:
/*
-- 测试json数据
{
  "store": {
    "fruit": [
      {
        "weight": 8,
        "type": "apple"
      },
      {
        "weight": 9,
        "type": "pear"
      }
    ],
    "bicycle": {
      "price": 19.951,
      "color": "red1"
    }
  },
  "email": "amy@only_for_json_udf_test.net",
  "owner": "amy1"
}

-- 测试json数据
[
  {
    "bssid": "12:34:56:78:c4:90",
    "ssid": "wifi-guest"
  },
  {
    "bssid": "12:34:56:78:c4:a9",
    "ssid": "wifi-guest"
  },
  {
    "bssid": "12:34:56:78:c4:a8",
    "ssid": "wifi-inc"
  }
]

-- 测试json数据
{
  "optional": "unknown",
  "wifiList": "[{\"bssid\":\"12:34:56:78:c4:85\",\"ssid\":\"Xiaomi_XS\"},{\"bssid\":\"12:34:56:78:c4:86\",\"ssid\":\"Xiaomi_5G\"},{\"bssid\":\"7c:03:c9:e3:95:74\",\"ssid\":\"ChinaNet-HfVw\"}]",
  "IMSI": "351036391715285",
  "dpi": "588",
  "timestamp": "1542365379014",
  "coordinates": "36.733577|125.237956",
  "startupTime": "1542391575",
  "totalMemory": "5964230656",
  "availableMemory": "1474846720",
  "serial": "abcd",
  "currentWifi": "[{\"bssid\":\"12:34:56:78:c4:85\",\"rssi\":-34,\"ssid\":\"Xiaomi_XS\"}]",
  "systemVolume": "83"
}
*/

&

select get_json_object('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}', '$.owner');
-- amy1

select get_json_object('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}', '$.store.fruit[0].weight');
-- 8

select get_json_object('[{"bssid":"12:34:56:78:c4:90","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a9","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a8","ssid":"wifi-inc"}]', '$.[0].ssid');
-- wifi-guest

select get_json_object('[{"bssid":"12:34:56:78:c4:90","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a9","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a8","ssid":"wifi-inc"}]', '$.[*].ssid');
-- ["wifi-guest","wifi-guest","wifi-inc"]

&

select get_json_object('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"currentWifi":"[{\"bssid\":\"12:34:56:78:c4:85\",\"rssi\":-34,\"ssid\":\"Xiaomi_XS\"}]","email":"amy@only_for_json_udf_test.net","owner":"amy1"}', '$.store.fruit[*].type');
-- NULL

select get_json_object('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"currentWifi":[{"bssid":"12:34:56:78:c4:90","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a9","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a8","ssid":"wifi-inc"}],"email":"amy@only_for_json_udf_test.net","owner":"amy1"}', '$.store.fruit[*].type');
-- ["apple","pear"]
-- 我知道问题的原因了,它是支持嵌套json的,但必须要求value的值也是json类型,而不能是符合json格式的字符串,否则整体会报错(返回NULL)。
-- 原因是 get_json_object 的第一个参数只支持json对象,而不是符合json格式的字符串,否则会将其作为非法json来处理,在取任何字段都会返回NULL。处理办法就是嵌套使用 get_json_object 进行提取。

select get_json_object(data,'$.timestamp'),
		get_json_object(data,'$.currentWifi'), -- [{"bssid":"12:34:56:78:1f:7f","rssi":50,"ssid":"zero"}]
		get_json_object(data,'$.currentWifi[0]'), -- null
		get_json_object(data,'$.currentWifi.[0]'), -- null
		get_json_object(get_json_object(data,'$.currentWifi'), '$.[0].ssid'), -- zero
		data
from db_name.table_name
where
	dt BETWEEN $$begindatekey and $$enddatekey and
	data like '%\\"ssid\\":%' -- 这里的双引号「"」要用「\」进行转义
;

&

select json_tuple('{"optional":"unknown","wifiList":"[{\"bssid\":\"12:34:56:78:c4:85\",\"ssid\":\"Xiaomi_XS\"},{\"bssid\":\"12:34:56:78:c4:86\",\"ssid\":\"Xiaomi_5G\"},{\"bssid\":\"7c:03:c9:e3:95:74\",\"ssid\":\"ChinaNet-HfVw\"}]","IMSI":"351036391715285","dpi":"588","timestamp":"1542365379014","coordinates":"36.733577|125.237956","startupTime":"1542391575","totalMemory":"5964230656","availableMemory":"1474846720","serial":"abcd","currentWifi":"[{\"bssid\":\"12:34:56:78:c4:85\",\"rssi\":-34,\"ssid\":\"Xiaomi_XS\"}]","systemVolume":"83"}', 'currentWifi');
-- NULL
-- json_tuple 的第一个参数只支持json对象,而不是符合json格式的字符串,否则会将其作为非法json来处理,在取任何字段都会返回NULL。

select json_tuple('{"name":"jack","server":"www.qq.com"}','server','name');
-- www.qq.com	jack

select json_tuple('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"currentWifi":[{"bssid":"12:34:56:78:c4:90","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a9","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a8","ssid":"wifi-inc"}],"email":"amy@only_for_json_udf_test.net","owner":"amy1"}', 'currentWifi', 'currentWifi[0]');
-- [{"bssid":"12:34:56:78:c4:90","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a9","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a8","ssid":"wifi-inc"}]	NULL
-- json_tuple 这个支持一次取多个key,但是不支持取嵌套的内容

 

参考链接:

=END=


《 “Hive中如何加载和解析(嵌套)json数据” 》 有 14 条评论

  1. Hive Plays Well with JSON
    https://pkghosh.wordpress.com/2012/05/06/hive-plays-well-with-json/
    `
    get_json_object函数第一个参数填写json对象变量,第二个参数使用$表示json变量标识,然后用 . 或 [] 读取对象或数组;但它每次只能查一个字段,因为这个方法,只能接受两个参数,多的不行,那么就导致我们对同一个json数据想要查看多个值,只能多写几个get_json_object,比较麻烦,所以另一个json_tuple方法就派上了用场。

    json_tuple函数可以一次查询多个字段,但它的缺点就是对于复杂的嵌套的json,就操作不了了(就是说使用不了”.”,“[]”这种符号来操作json对象),所以看情况选择这两个方法去使用。
    `

  2. Hive Expression not in GROUP BY key
    https://stackoverflow.com/questions/5746687/hive-expression-not-in-group-by-key

    Hive取非Group by字段数据的方法
    http://www.crazyant.net/1600.html
    `
    Hive不允许直接访问非group by字段(如果要查,可以将对应字段放入 group by 里面);
    对于非group by字段,可以用Hive的collect_set函数收集这些字段,返回一个数组;
    使用数字下标,可以直接访问数组中的元素;
    `

  3. 一例 Hive join 优化实战
    https://my.oschina.net/leejun2005/blog/307812

    Hive的严格模式
    https://www.cnblogs.com/benchen/p/5817420.html
    https://stackoverflow.com/questions/42261935/how-allow-hive-mapred-mode-nonstrict

    LanguageManual Joins
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
    `
    SELECT a.* FROM a JOIN b ON (a.id = b.id)
    SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)

    SELECT a.val, b.val, c.val
    FROM
    a
    JOIN
    b ON (a.key = b.key1)
    JOIN
    c ON (c.key = b.key2)
    `

  4. Hive中如何将微秒转换成直观的日期格式?(Hive from_unixtime for milliseconds)
    https://stackoverflow.com/questions/27942930/hive-from-unixtime-for-milliseconds
    `
    select from_unixtime(1543908748678,’yyyyMMdd’); # 508940709
    select from_unixtime(1543908748.678,’yyyyMMdd’) # FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments ”yyyyMMdd”: No matching method for class org.apache.hadoop.hive.ql.udf.UDFFromUnixTime with (double, string). Possible choices: _FUNC_(bigint) _FUNC_(bigint, string) _FUNC_(int) _FUNC_(int, string)
    select from_unixtime(1543908748,’yyyyMMdd’) # 20181204

    select 1543908748678, from_unixtime(CAST(1543908748678/1000 as BIGINT), ‘yyyy-MM-dd’); # 1543908748678 2018-12-04
    select 1543908748678, from_unixtime(CAST(1543908748678/1000 as BIGINT), ‘yyyy-MM-dd_HH:mm:ss’) # 1543908748678 2018-12-04_15:32:28
    `
    Hive中如何将日期字符串转换成timestamp格式?(Convert string to timestamp in Hive)
    https://stackoverflow.com/questions/17796071/convert-string-to-timestamp-in-hive

  5. Hive常用字符串函数
    https://www.iteblog.com/archives/1639.html
    `
    1、字符串长度计算函数:length
    2、字符串反转函数:reverse
    3、字符串连接函数:concat
    4、带分隔符字符串连接函数:concat_ws
    5、字符串截取函数:substr,substring
    6、字符串截取函数:substr,substring
    7、字符串转大写函数:upper,ucase
    8、字符串转小写函数:lower,lcase
    9、去空格函数:trim
    10、左边去空格函数:ltrim
    11、右边去空格函数:rtrim
    12、正则表达式替换函数:regexp_replace
    13、正则表达式提取函数:regexp_extract
    14、URL解析函数:parse_url
    15、json解析函数:get_json_object
    16、空格字符串函数:space
    17、重复字符串函数:repeat
    18、首字符ascii函数:ascii
    19、左补足函数:lpad
    20、右补足函数:rpad
    21、分割字符串函数: split
    22、集合查找函数:find_in_set
    `
    Hive中内置的字符串处理函数
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions

  6. Hive中的「非」该怎么写?
    https://stackoverflow.com/questions/5627996/is-there-a-hive-equivalent-of-sql-not-like
    `
    试试:
    where not (Col_Name like ‘%whatever%’)
    或者:
    where not (Col_Name rlike ‘.*whatever.*’)

    和SQL里面是不一样的!!!

    在SQL中,「非」的表示方法:
    select * from table_name where column_name not like ‘%something%’;

    在Hive中,「非」的表示方法
    select * from table_name where not (column_name like ‘%something%’);
    `

  7. 如何在 Apache Hive 中解析 Json 数组
    https://www.iteblog.com/archives/2362.html
    `
    — 测试数据
    — ‘[{“bssid”:”12:34:56:78:11:50″,”ssid”:”ChinaNet-xyz”}]’

    — 方法一:直接用 get_json_object 处理
    select get_json_object(‘[{“bssid”:”12:34:56:78:11:50″,”ssid”:”ChinaNet-xyz”}]’, ‘$.[0].ssid’)
    — ChinaNet-xyz

    select regexp_replace(‘[{“bssid”:”12:34:56:78:11:50″,”ssid”:”ChinaNet-xyz”}]’, ‘\\[|\\]’, ”)
    — {“bssid”:”12:34:56:78:11:50″,”ssid”:”ChinaNet-xyz”}

    — 方法二:先将方括号去掉再作为一个单一json字符串进行处理(只适用于json数组长度为1的情况)
    select get_json_object(regexp_replace(‘[{“bssid”:”12:34:56:78:11:50″,”ssid”:”ChinaNet-xyz”}]’, ‘\\[|\\]’, ”), ‘$.ssid’)
    — ChinaNet-xyz
    `

  8. json_tuple(STRING jsonStr, STRING k1, STRING k2, …)
    A new json_tuple() UDTF is introduced in Hive 0.7. It takes a set of names (keys) and a JSON string, and returns a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string.

    json_tuple(jsonStr, p1, p2, …, pn) – like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.

    Hive lateral view ,get_json_object,json_tuple 用法
    https://blog.csdn.net/u010002184/article/details/89639422

    Hive使用json_tuple, posexplode 解析 json
    https://blog.csdn.net/lfq1532632051/article/details/63262519

    WITH t as (
    SELECT ‘{“device_info”:[{“device_id”:”iOS_idfa”,”brand”:”苹果”},{“device_id”:”android_imei”,”brand”:”三星”}],”age”:28}’ as json_info
    )
    select
    json_info
    ,tt0.device_id
    ,tt0.brand
    ,tt1.age
    from
    t
    lateral view json_tuple(get_json_object(get_json_object(json_info, ‘$.device_info’), ‘$.[0]’),’device_id’,’brand’) tt0 as device_id, brand
    LATERAL VIEW json_tuple(json_info, ‘age’) tt1 as age
    ;
    /* 返回1行结果如下(把4个字段分别放在4行里面了):
    {“device_info”:[{“device_id”:”iOS_idfa”,”brand”:”苹果”},{“device_id”:”android_imei”,”brand”:”三星”}],”age”:28}
    iOS_idfa
    苹果
    28
    */

  9. 一文学会Hive解析Json数组(好文收藏)
    https://mp.weixin.qq.com/s/FbQBAT-t_JWuwn2OK0vZ8Q
    `
    Hive自带的json解析函数

    1. get_json_object

    语法:get_json_object(json_string, ‘$.key’)
    说明:解析json的字符串json_string,返回path指定的内容。如果输入的json字符串无效,那么返回NULL。这个函数每次只能返回一个数据项。

    2. json_tuple

    语法:json_tuple(json_string, k1, k2 …)
    说明:解析json的字符串json_string,可指定多个json数据中的key,返回对应的value。如果输入的json字符串无效,那么返回NULL。

    Hive解析json数组

    一、嵌套子查询解析json数组
    explode函数
    regexp_replace函数
    1. 先将json数组中的元素解析出来,转化为每行显示:
    2. 上步已经把一个json数组转化为多个json字符串了,接下来结合json_tuple函数来解析json里面的字段:

    二、使用 lateral view 解析json数组
    lateral view通常和UDTF一起出现,为了解决UDTF不允许在select存在多个字段的问题。
    `
    SQL one trick a day: how to parse JSON through HQL
    https://programmer.group/sql-one-trick-a-day-how-to-parse-json-through-hql.html

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注