=Start=
缘由:
最近经常使用Hive进行数据分析,其中有一个比较典型的场景就是——部分有效的数据在一个大的多层嵌套结构json字符串中,该如何进行匹配和比对,可行的方法应该有不少,我这里先记录一下我知道和实际用到的,方便以后参考。
正文:
参考解答:
/* -- 测试json数据 { "store": { "fruit": [ { "weight": 8, "type": "apple" }, { "weight": 9, "type": "pear" } ], "bicycle": { "price": 19.951, "color": "red1" } }, "email": "amy@only_for_json_udf_test.net", "owner": "amy1" } -- 测试json数据 [ { "bssid": "12:34:56:78:c4:90", "ssid": "wifi-guest" }, { "bssid": "12:34:56:78:c4:a9", "ssid": "wifi-guest" }, { "bssid": "12:34:56:78:c4:a8", "ssid": "wifi-inc" } ] -- 测试json数据 { "optional": "unknown", "wifiList": "[{\"bssid\":\"12:34:56:78:c4:85\",\"ssid\":\"Xiaomi_XS\"},{\"bssid\":\"12:34:56:78:c4:86\",\"ssid\":\"Xiaomi_5G\"},{\"bssid\":\"7c:03:c9:e3:95:74\",\"ssid\":\"ChinaNet-HfVw\"}]", "IMSI": "351036391715285", "dpi": "588", "timestamp": "1542365379014", "coordinates": "36.733577|125.237956", "startupTime": "1542391575", "totalMemory": "5964230656", "availableMemory": "1474846720", "serial": "abcd", "currentWifi": "[{\"bssid\":\"12:34:56:78:c4:85\",\"rssi\":-34,\"ssid\":\"Xiaomi_XS\"}]", "systemVolume": "83" } */
&
select get_json_object('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}', '$.owner'); -- amy1 select get_json_object('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}', '$.store.fruit[0].weight'); -- 8 select get_json_object('[{"bssid":"12:34:56:78:c4:90","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a9","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a8","ssid":"wifi-inc"}]', '$.[0].ssid'); -- wifi-guest select get_json_object('[{"bssid":"12:34:56:78:c4:90","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a9","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a8","ssid":"wifi-inc"}]', '$.[*].ssid'); -- ["wifi-guest","wifi-guest","wifi-inc"]
&
select get_json_object('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"currentWifi":"[{\"bssid\":\"12:34:56:78:c4:85\",\"rssi\":-34,\"ssid\":\"Xiaomi_XS\"}]","email":"amy@only_for_json_udf_test.net","owner":"amy1"}', '$.store.fruit[*].type'); -- NULL select get_json_object('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"currentWifi":[{"bssid":"12:34:56:78:c4:90","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a9","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a8","ssid":"wifi-inc"}],"email":"amy@only_for_json_udf_test.net","owner":"amy1"}', '$.store.fruit[*].type'); -- ["apple","pear"] -- 我知道问题的原因了,它是支持嵌套json的,但必须要求value的值也是json类型,而不能是符合json格式的字符串,否则整体会报错(返回NULL)。 -- 原因是 get_json_object 的第一个参数只支持json对象,而不是符合json格式的字符串,否则会将其作为非法json来处理,在取任何字段都会返回NULL。处理办法就是嵌套使用 get_json_object 进行提取。 select get_json_object(data,'$.timestamp'), get_json_object(data,'$.currentWifi'), -- [{"bssid":"12:34:56:78:1f:7f","rssi":50,"ssid":"zero"}] get_json_object(data,'$.currentWifi[0]'), -- null get_json_object(data,'$.currentWifi.[0]'), -- null get_json_object(get_json_object(data,'$.currentWifi'), '$.[0].ssid'), -- zero data from db_name.table_name where dt BETWEEN $$begindatekey and $$enddatekey and data like '%\\"ssid\\":%' -- 这里的双引号「"」要用「\」进行转义 ;
&
select json_tuple('{"optional":"unknown","wifiList":"[{\"bssid\":\"12:34:56:78:c4:85\",\"ssid\":\"Xiaomi_XS\"},{\"bssid\":\"12:34:56:78:c4:86\",\"ssid\":\"Xiaomi_5G\"},{\"bssid\":\"7c:03:c9:e3:95:74\",\"ssid\":\"ChinaNet-HfVw\"}]","IMSI":"351036391715285","dpi":"588","timestamp":"1542365379014","coordinates":"36.733577|125.237956","startupTime":"1542391575","totalMemory":"5964230656","availableMemory":"1474846720","serial":"abcd","currentWifi":"[{\"bssid\":\"12:34:56:78:c4:85\",\"rssi\":-34,\"ssid\":\"Xiaomi_XS\"}]","systemVolume":"83"}', 'currentWifi'); -- NULL -- json_tuple 的第一个参数只支持json对象,而不是符合json格式的字符串,否则会将其作为非法json来处理,在取任何字段都会返回NULL。 select json_tuple('{"name":"jack","server":"www.qq.com"}','server','name'); -- www.qq.com jack select json_tuple('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"currentWifi":[{"bssid":"12:34:56:78:c4:90","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a9","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a8","ssid":"wifi-inc"}],"email":"amy@only_for_json_udf_test.net","owner":"amy1"}', 'currentWifi', 'currentWifi[0]'); -- [{"bssid":"12:34:56:78:c4:90","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a9","ssid":"wifi-guest"},{"bssid":"12:34:56:78:c4:a8","ssid":"wifi-inc"}] NULL -- json_tuple 这个支持一次取多个key,但是不支持取嵌套的内容
参考链接:
- hive加载json数据和解析json
- json_tuple
- get_json_object
- Hive处理JSON字符串
- hive中解析json数组
- Hive JSON数据处理的一点探索
- 如何在 Apache Hive 中解析 Json 数组
- Nested json in hive
- Querying JSON records via Hive
- SerDe vs UDF – parsing JSON in Hive
=END=
《 “Hive中如何加载和解析(嵌套)json数据” 》 有 14 条评论
Hive Plays Well with JSON
https://pkghosh.wordpress.com/2012/05/06/hive-plays-well-with-json/
`
get_json_object函数第一个参数填写json对象变量,第二个参数使用$表示json变量标识,然后用 . 或 [] 读取对象或数组;但它每次只能查一个字段,因为这个方法,只能接受两个参数,多的不行,那么就导致我们对同一个json数据想要查看多个值,只能多写几个get_json_object,比较麻烦,所以另一个json_tuple方法就派上了用场。
json_tuple函数可以一次查询多个字段,但它的缺点就是对于复杂的嵌套的json,就操作不了了(就是说使用不了”.”,“[]”这种符号来操作json对象),所以看情况选择这两个方法去使用。
`
Hive Expression not in GROUP BY key
https://stackoverflow.com/questions/5746687/hive-expression-not-in-group-by-key
Hive取非Group by字段数据的方法
http://www.crazyant.net/1600.html
`
Hive不允许直接访问非group by字段(如果要查,可以将对应字段放入 group by 里面);
对于非group by字段,可以用Hive的collect_set函数收集这些字段,返回一个数组;
使用数字下标,可以直接访问数组中的元素;
`
一例 Hive join 优化实战
https://my.oschina.net/leejun2005/blog/307812
Hive的严格模式
https://www.cnblogs.com/benchen/p/5817420.html
https://stackoverflow.com/questions/42261935/how-allow-hive-mapred-mode-nonstrict
LanguageManual Joins
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
`
SELECT a.* FROM a JOIN b ON (a.id = b.id)
SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)
SELECT a.val, b.val, c.val
FROM
a
JOIN
b ON (a.key = b.key1)
JOIN
c ON (c.key = b.key2)
`
Hive中如何将微秒转换成直观的日期格式?(Hive from_unixtime for milliseconds)
https://stackoverflow.com/questions/27942930/hive-from-unixtime-for-milliseconds
`
select from_unixtime(1543908748678,’yyyyMMdd’); # 508940709
select from_unixtime(1543908748.678,’yyyyMMdd’) # FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments ”yyyyMMdd”: No matching method for class org.apache.hadoop.hive.ql.udf.UDFFromUnixTime with (double, string). Possible choices: _FUNC_(bigint) _FUNC_(bigint, string) _FUNC_(int) _FUNC_(int, string)
select from_unixtime(1543908748,’yyyyMMdd’) # 20181204
select 1543908748678, from_unixtime(CAST(1543908748678/1000 as BIGINT), ‘yyyy-MM-dd’); # 1543908748678 2018-12-04
select 1543908748678, from_unixtime(CAST(1543908748678/1000 as BIGINT), ‘yyyy-MM-dd_HH:mm:ss’) # 1543908748678 2018-12-04_15:32:28
`
Hive中如何将日期字符串转换成timestamp格式?(Convert string to timestamp in Hive)
https://stackoverflow.com/questions/17796071/convert-string-to-timestamp-in-hive
Hive常用字符串函数
https://www.iteblog.com/archives/1639.html
`
1、字符串长度计算函数:length
2、字符串反转函数:reverse
3、字符串连接函数:concat
4、带分隔符字符串连接函数:concat_ws
5、字符串截取函数:substr,substring
6、字符串截取函数:substr,substring
7、字符串转大写函数:upper,ucase
8、字符串转小写函数:lower,lcase
9、去空格函数:trim
10、左边去空格函数:ltrim
11、右边去空格函数:rtrim
12、正则表达式替换函数:regexp_replace
13、正则表达式提取函数:regexp_extract
14、URL解析函数:parse_url
15、json解析函数:get_json_object
16、空格字符串函数:space
17、重复字符串函数:repeat
18、首字符ascii函数:ascii
19、左补足函数:lpad
20、右补足函数:rpad
21、分割字符串函数: split
22、集合查找函数:find_in_set
`
Hive中内置的字符串处理函数
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
Hive中的「非」该怎么写?
https://stackoverflow.com/questions/5627996/is-there-a-hive-equivalent-of-sql-not-like
`
试试:
where not (Col_Name like ‘%whatever%’)
或者:
where not (Col_Name rlike ‘.*whatever.*’)
和SQL里面是不一样的!!!
在SQL中,「非」的表示方法:
select * from table_name where column_name not like ‘%something%’;
在Hive中,「非」的表示方法
select * from table_name where not (column_name like ‘%something%’);
`
在Hive中如何判断某个元素是否在一个array中?
https://stackoverflow.com/questions/22906226/how-do-i-search-for-an-item-in-an-array-in-hive
`
方法一:
array_contains(Array, value)
方法二:
先 explode 取出特定列之后,再进行常规的判等操作。
`
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-CollectionFunctions
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array)
Hive复合数据类型array,map,struct的使用
https://blog.csdn.net/sl1992/article/details/53894481
hive复杂格式array,map,struct使用
https://blog.csdn.net/u010670689/article/details/72885944
hive array、map、struct使用
https://blog.csdn.net/yfkiss/article/details/7842014
关于Hive中的复杂数据类型Array,Map,Structs的一些使用案例
https://blog.csdn.net/Gamer_gyt/article/details/52169441
如何在 Apache Hive 中解析 Json 数组
https://www.iteblog.com/archives/2362.html
`
— 测试数据
— ‘[{“bssid”:”12:34:56:78:11:50″,”ssid”:”ChinaNet-xyz”}]’
— 方法一:直接用 get_json_object 处理
select get_json_object(‘[{“bssid”:”12:34:56:78:11:50″,”ssid”:”ChinaNet-xyz”}]’, ‘$.[0].ssid’)
— ChinaNet-xyz
select regexp_replace(‘[{“bssid”:”12:34:56:78:11:50″,”ssid”:”ChinaNet-xyz”}]’, ‘\\[|\\]’, ”)
— {“bssid”:”12:34:56:78:11:50″,”ssid”:”ChinaNet-xyz”}
— 方法二:先将方括号去掉再作为一个单一json字符串进行处理(只适用于json数组长度为1的情况)
select get_json_object(regexp_replace(‘[{“bssid”:”12:34:56:78:11:50″,”ssid”:”ChinaNet-xyz”}]’, ‘\\[|\\]’, ”), ‘$.ssid’)
— ChinaNet-xyz
`
json_tuple(STRING jsonStr, STRING k1, STRING k2, …)
A new json_tuple() UDTF is introduced in Hive 0.7. It takes a set of names (keys) and a JSON string, and returns a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string.
json_tuple(jsonStr, p1, p2, …, pn) – like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.
Hive lateral view ,get_json_object,json_tuple 用法
https://blog.csdn.net/u010002184/article/details/89639422
Hive使用json_tuple, posexplode 解析 json
https://blog.csdn.net/lfq1532632051/article/details/63262519
WITH t as (
SELECT ‘{“device_info”:[{“device_id”:”iOS_idfa”,”brand”:”苹果”},{“device_id”:”android_imei”,”brand”:”三星”}],”age”:28}’ as json_info
)
select
json_info
,tt0.device_id
,tt0.brand
,tt1.age
from
t
lateral view json_tuple(get_json_object(get_json_object(json_info, ‘$.device_info’), ‘$.[0]’),’device_id’,’brand’) tt0 as device_id, brand
LATERAL VIEW json_tuple(json_info, ‘age’) tt1 as age
;
/* 返回1行结果如下(把4个字段分别放在4行里面了):
{“device_info”:[{“device_id”:”iOS_idfa”,”brand”:”苹果”},{“device_id”:”android_imei”,”brand”:”三星”}],”age”:28}
iOS_idfa
苹果
28
*/
一个增强版本的json解析/提取工具,可以支持json切片、筛选等高级功能(A Java DSL for reading JSON documents.)
https://github.com/json-path/JsonPath
https://goessner.net/articles/JsonPath/
一文学会Hive解析Json数组(好文收藏)
https://mp.weixin.qq.com/s/FbQBAT-t_JWuwn2OK0vZ8Q
`
Hive自带的json解析函数
1. get_json_object
语法:get_json_object(json_string, ‘$.key’)
说明:解析json的字符串json_string,返回path指定的内容。如果输入的json字符串无效,那么返回NULL。这个函数每次只能返回一个数据项。
2. json_tuple
语法:json_tuple(json_string, k1, k2 …)
说明:解析json的字符串json_string,可指定多个json数据中的key,返回对应的value。如果输入的json字符串无效,那么返回NULL。
Hive解析json数组
一、嵌套子查询解析json数组
explode函数
regexp_replace函数
1. 先将json数组中的元素解析出来,转化为每行显示:
2. 上步已经把一个json数组转化为多个json字符串了,接下来结合json_tuple函数来解析json里面的字段:
二、使用 lateral view 解析json数组
lateral view通常和UDTF一起出现,为了解决UDTF不允许在select存在多个字段的问题。
`
SQL one trick a day: how to parse JSON through HQL
https://programmer.group/sql-one-trick-a-day-how-to-parse-json-through-hql.html
hive sql parse json array item to multiple lines
how to convert jsonarray to multi column from hive
https://stackoverflow.com/questions/61004899/how-to-convert-jsonarray-to-multi-column-from-hive
Parse json arrays using HIVE
https://stackoverflow.com/questions/24447428/parse-json-arrays-using-hive
[…] Hive中如何加载和解析(嵌套)json数据https://ixyzero.com/blog/archives/4208.html […]