=Start=
缘由:
在分析日志的时候发现之前记录的那篇文章的内容好像不够严谨,可能会匹配到一些非预期的数字ID字符串,对后面的分析产生干扰,而且排查成本还挺高,所以再看看怎么能更好的在前期做好过滤,降低后期处理成本。
正文:
参考解答:
先说结论——想寄希望于单步完成过滤太过困难,且不利于优化和更新,最好放在多步中处理:
- 先做一些统一的格式化操作——移除注释,将连续的多个空白符号替换成一个空格符号,使用的是 regexp_replace 函数;
- 将一些可能引起错误提取的情况进行总结,然后尽可能多的将大部分信息都提取出来(在匹配数字的正则表达式前面添加
(c1|c2|c3)?
)方便后面做过滤; - 用 explode 函数将匹配到的内容进行切分,方便对不符合条件的内容做过滤;
- (可选)用 collect_set 函数对过滤的结果做聚合,将结果放在1行里面;
with t1 as (
select "select
1234567 as id_int ,'2345678' as id_str
,advertiser_name AS _1700007340280
,(toString((`adID`))) AS _r1700066608896
,cash_cost / 1000001 as div_test
,(sum(coalesce(round(cash_cost/100000, 2), 0))) AS _sum_1700007340301
from db1.tbl1
where
date between '20240101' and '20241201'
or dt between '2024-01-01' and '2024-12-01'
-- and comment_id = 9999999
and condition0 = 123456789
and condition1 > 3456789
and condition2 < 9876543
and condition3 - 222222 > 6666
and condition4 * 333333 > 666666
limit 100001" as multi_line_var
)
SELECT
size(num_array) as num_cnt
,num_array
,num_array2
,num_array3
-- ,num_array4
,array_except(num_array,yyyymmdd_array) as filted_num --排除掉匹配出来的yyyymmdd日期格式的数字字符串
,num1
-- ,yyyy_mm_dd_array
-- ,yyyymmdd_array
,limit_num
FROM
(select multi_line_var
,array_distinct(regexp_extract_all(regexp_replace(regexp_replace(multi_line_var,'--.+?\\n{1,}',''),'\\s{1,}',' '),'(\\+|-|\\*|/|>|>=|<|<=|limit)? ?\\b[1-9]\\d{4,}\\b',0)) as num_array
,regexp_extract_all(multi_line_var,'\\b[1-9]\\d{4,}\\b',0) as num_array2
,regexp_extract_all(regexp_replace(multi_line_var,'\\s+limit\\s+(\\d+)\\s*','') ,'[1-9]\\d{4,}',0) as num_array3
-- ,regexp_extract_all(regexp_replace(multi_line_var,'\\s+limit\\s+(\\d+)\\s*','') ,'(?<!\\+|-|\\*|/|>|>=|<|<=)\\s{0,}([1-9]\\d{4,})\\b',1) as num_array4
,regexp_extract_all(multi_line_var,'20[1-2][0-9]\\-(0[1-9]|1[0-2])\\-(0[1-9]|1[0-9]|2[0-9]|3[0-1])',0) as yyyy_mm_dd_array
,regexp_extract_all(multi_line_var,'20[1-2][0-9](0[1-9]|1[0-2])(0[1-9]|1[0-9]|2[0-9]|3[0-1])',0) as yyyymmdd_array
,regexp_extract(multi_line_var, '\\s+limit\\s+(\\d+)\\s*', 1) as limit_num
from t1
)x1
lateral view explode(array_except(num_array,yyyymmdd_array)) num as num1
WHERE
trim(num1) rlike '^\\d'
-- 下面为了看起来方便,将3行结果按块的方式进行展示
num_cnt
num_array
num_array2
num_array3
filted_num
num1
limit_num
15
[" 1234567","2345678","_1700007340280","/ 1000001","/100000","_1700007340301","20240101","20241201"," 123456789","> 3456789","< 9876543","- 222222","* 333333","> 666666","limit 100001"]
["1234567","2345678","1000001","100000","20240101","20241201","9999999","123456789","3456789","9876543","222222","333333","666666","100001"]
["1234567","2345678","1700007340280","1000001","100000","1700007340301","20240101","20241201","9999999","123456789","3456789","9876543","222222","333333","666666"]
[" 1234567","2345678","_1700007340280","/ 1000001","/100000","_1700007340301"," 123456789","> 3456789","< 9876543","- 222222","* 333333","> 666666","limit 100001"]
1234567
100001
15
[" 1234567","2345678","_1700007340280","/ 1000001","/100000","_1700007340301","20240101","20241201"," 123456789","> 3456789","< 9876543","- 222222","* 333333","> 666666","limit 100001"]
["1234567","2345678","1000001","100000","20240101","20241201","9999999","123456789","3456789","9876543","222222","333333","666666","100001"]
["1234567","2345678","1700007340280","1000001","100000","1700007340301","20240101","20241201","9999999","123456789","3456789","9876543","222222","333333","666666"]
[" 1234567","2345678","_1700007340280","/ 1000001","/100000","_1700007340301"," 123456789","> 3456789","< 9876543","- 222222","* 333333","> 666666","limit 100001"]
2345678
100001
15
[" 1234567","2345678","_1700007340280","/ 1000001","/100000","_1700007340301","20240101","20241201"," 123456789","> 3456789","< 9876543","- 222222","* 333333","> 666666","limit 100001"]
["1234567","2345678","1000001","100000","20240101","20241201","9999999","123456789","3456789","9876543","222222","333333","666666","100001"]
["1234567","2345678","1700007340280","1000001","100000","1700007340301","20240101","20241201","9999999","123456789","3456789","9876543","222222","333333","666666"]
[" 1234567","2345678","_1700007340280","/ 1000001","/100000","_1700007340301"," 123456789","> 3456789","< 9876543","- 222222","* 333333","> 666666","limit 100001"]
123456789
100001
,(toString((`adID`))) AS _r1700066608896 --本来以为快要写完了,发现这里面的 1700066608896 这个数字会被提取出来,又发现了一个bug
解决方法也很简单,在数字匹配那里使用正则表达式的单词边界锚定符 \b。这个特殊字符用来匹配一个单词边界,即它确保匹配的是一个完整的单词,而不会是另一个单词的一部分。
,array_distinct(regexp_extract_all(regexp_replace(regexp_replace(multi_line_var,'--.+?\\n{1,}',''),'\\s{1,}',' '),'(\\+|-|\\*|/|>|>=|<|<=|limit)? ?\\b[1-9]\\d{4,}\\b',0)) as num_array
参考链接:
array_except 函数
https://learn.microsoft.com/zh-cn/azure/databricks/sql/language-manual/functions/array_except
array_distinct 函数
https://learn.microsoft.com/zh-cn/azure/databricks/sql/language-manual/functions/array_distinct
=END=