=Start=
缘由:
简单整理一下近期在做数据分析时遇到的几个和IP匹配的问题以及处理办法,方便后面需要的时候参考。
正文:
参考解答:
- 最常规的匹配方法
ip = '1.2.3.4'
ip in ('1.2.3.4', '2.3.4.5')
ip like '1.2.3.%'
- 如何匹配较大的IP段?
请将CIDR格式的IP地址段转换成C段IP的格式。比如:将“106.11.34.217/24”转换成“106.11.34.%”的形式。请将下面3个CIDR格式的IP地址段转换成C段IP的格式:
59.82.56.0/22
106.11.34.217/24
140.205.0.0/19
解决办法一(不推荐):
对于 /24 这种恰好对应一个 C 段的,可以使用 ip like '1.2.3.%' 这种方式;
但是如果是 /22 这种有 4 个 C 段的,写起来已经有点麻烦了;
对于 /19 这种有 32 个 C 段的,已经是无法接受了(一方面不美观,另一方面数量太多了性能很差)。
推荐方案1:将 IP 转为整数后进行子网匹配
- 将 IP 地址(如 140.205.5.10)转换为整数。
- 将 CIDR 网段(如 140.205.0.0/19)的网络地址和子网掩码也转换为整数范围。
- 判断 IP 整数是否落在该范围内。
-- 创建临时表,添加 ip_num 列,方便后面直接使用
with extend_data as (
SELECT
date,hour,user_name,url,object_id,keyword,ua,ip
,CAST(SPLIT(ip, '\\.')[0] AS BIGINT) * 16777216 +
CAST(SPLIT(ip, '\\.')[1] AS BIGINT) * 65536 +
CAST(SPLIT(ip, '\\.')[2] AS BIGINT) * 256 +
CAST(SPLIT(ip, '\\.')[3] AS BIGINT) as ip_num
FROM
security.access_logs
WHERE
date BETWEEN '20250921' AND '${date}'
AND ip rlike '^\\d+\\.\\d+\\.\\d+\\.\\d+$'
)
SELECT
user_name,date
,count(1) as cnt
,COUNT(DISTINCT object_id) as object_id_cnt
,COUNT(DISTINCT keyword) as keyword_cnt
,sort_array(COLLECT_SET(keyword)) as keyword_set
,sort_array(COLLECT_SET(hour)) as hour_set
,COUNT(DISTINCT ip) as ip_cnt
,sort_array(COLLECT_SET(ip)) as ip_set
,sort_array(COLLECT_SET(if(ua like '%Mobile%', 1, 0))) as is_mobile
,sort_array(COLLECT_SET(split(replace(ua, ')', '('),'\\(')[1])) as ua_set
FROM
extend_data
WHERE
(ip_num BETWEEN 142614528 AND 142622719
OR ip_num BETWEEN 142622720 AND 142630911
-- ...
OR ip_num BETWEEN 3673456640 AND 3673464831
OR ip_num BETWEEN 3741581312 AND 3741843455
)
GROUP BY user_name,date
推荐方案2:结合使用(适合匹配CIDR数量不多的时候使用)
AND (ip like '59.82.56.%'
or ip like '59.82.57.%'
or ip like '59.82.58.%'
or ip like '59.82.59.%'
or ip like '106.11.34.%'
or (ip like '140.205.%' and CAST(SPLIT(ip, '\\.')[2] AS INT) BETWEEN 0 AND 31)
)
- 如何查找所需的IP段信息?
哈哈哈,根据实际需求来……我只能告诉你那以前要多想。
- 如何合并要匹配的IP段信息,并处理成上面的格式方便直接使用?
#!/usr/bin/env python3
# coding=utf-8
from ipaddress import ip_network, collapse_addresses
import sys
def read_cidr_from_file(file_path):
with open(file_path, 'r') as f:
return [line.strip() for line in f if line.strip() and not line.startswith('#')]
def merge_cidr(cidr_list):
"""
将一组 CIDR 地址段合并为最简形式。
:param cidr_list: List[str] - CIDR 字符串列表,如 ["1.1.1.0/24", ...]
:return: List[str] - 合并后的最简 CIDR 列表
"""
# 转换为 ip_network 对象
networks = []
for cidr in cidr_list:
try:
networks.append(ip_network(cidr.strip()))
except Exception as e:
print(f"无效 CIDR: {cidr}, 错误: {e}")
continue
# 合并重叠或连续的网络
merged = list(collapse_addresses(sorted(networks)))
# 转回字符串
return [str(net) for net in merged]
def cidr_to_hive_sql_fallback(merged_cidr_list, ip_column='ipv4'):
"""
生成不依赖 inet_aton 的 Hive SQL 匹配逻辑
"""
conditions = []
for cidr in merged_cidr_list:
net = ip_network(cidr)
start_int = int(net.network_address)
end_int = int(net.broadcast_address)
# 构建 Hive 中的 IP 转整数表达式
ip_to_int_expr = f"""
CAST(SPLIT({ip_column}, '\\\\.')[0] AS BIGINT) * 16777216 +
CAST(SPLIT({ip_column}, '\\\\.')[1] AS BIGINT) * 65536 +
CAST(SPLIT({ip_column}, '\\\\.')[2] AS BIGINT) * 256 +
CAST(SPLIT({ip_column}, '\\\\.')[3] AS BIGINT)
""".strip()
# 完全不做任何预处理,直接用原始的IP字段进行转换后拼接
cond = f"""(
{ip_to_int_expr} BETWEEN {start_int} AND {end_int}
)"""
# 提前将IP字符串转换成数字,然后对数字进行比较
cond = f"""ip_num BETWEEN {start_int} AND {end_int}
"""
conditions.append(cond)
return " OR ".join(conditions)
# === 使用示例 ===
if __name__ == "__main__":
cidr_list = [
"180.184.0.0/24",
"180.184.1.0/24",
"180.184.2.0/24",
"180.184.3.0/24"
]
# 使用
cidr_list = read_cidr_from_file(sys.argv[1]) # 从命令行的第1个参数指定的文件中读取
merged = merge_cidr(cidr_list)
for cidr in merged:
print(cidr)
hive_sql = cidr_to_hive_sql_fallback(merged, 'ip')
print("Hive SQL (无 inet_aton):")
print(hive_sql)
参考链接:
=END=