IP匹配的一些小tips


=Start=

缘由:

简单整理一下近期在做数据分析时遇到的几个和IP匹配的问题以及处理办法,方便后面需要的时候参考。

正文:

参考解答:
  1. 最常规的匹配方法
ip = '1.2.3.4'
ip in ('1.2.3.4', '2.3.4.5')
ip like '1.2.3.%'
  1. 如何匹配较大的IP段?
请将CIDR格式的IP地址段转换成C段IP的格式。比如:将“106.11.34.217/24”转换成“106.11.34.%”的形式。请将下面3个CIDR格式的IP地址段转换成C段IP的格式:
59.82.56.0/22
106.11.34.217/24
140.205.0.0/19
解决办法一(不推荐):

对于 /24 这种恰好对应一个 C 段的,可以使用 ip like '1.2.3.%' 这种方式;
但是如果是 /22 这种有 4 个 C 段的,写起来已经有点麻烦了;
对于 /19 这种有 32 个 C 段的,已经是无法接受了(一方面不美观,另一方面数量太多了性能很差)。
  1. 将 IP 地址(如 140.205.5.10)转换为整数。
  2. 将 CIDR 网段(如 140.205.0.0/19)的网络地址和子网掩码也转换为整数范围。
  3. 判断 IP 整数是否落在该范围内。
-- 创建临时表,添加 ip_num 列,方便后面直接使用
with extend_data as (
SELECT
date,hour,user_name,url,object_id,keyword,ua,ip
,CAST(SPLIT(ip, '\\.')[0] AS BIGINT) * 16777216 +
CAST(SPLIT(ip, '\\.')[1] AS BIGINT) * 65536 +
CAST(SPLIT(ip, '\\.')[2] AS BIGINT) * 256 +
CAST(SPLIT(ip, '\\.')[3] AS BIGINT) as ip_num
FROM
  security.access_logs
WHERE
  date BETWEEN '20250921' AND '${date}'
AND ip rlike '^\\d+\\.\\d+\\.\\d+\\.\\d+$'
)

SELECT
user_name,date
,count(1) as cnt
,COUNT(DISTINCT object_id) as object_id_cnt
,COUNT(DISTINCT keyword) as keyword_cnt
,sort_array(COLLECT_SET(keyword)) as keyword_set
,sort_array(COLLECT_SET(hour)) as hour_set
,COUNT(DISTINCT ip) as ip_cnt
,sort_array(COLLECT_SET(ip)) as ip_set
,sort_array(COLLECT_SET(if(ua like '%Mobile%', 1, 0))) as is_mobile
,sort_array(COLLECT_SET(split(replace(ua, ')', '('),'\\(')[1])) as ua_set
FROM
  extend_data
WHERE
(ip_num BETWEEN 142614528 AND 142622719
 OR ip_num BETWEEN 142622720 AND 142630911
 -- ...
 OR ip_num BETWEEN 3673456640 AND 3673464831
 OR ip_num BETWEEN 3741581312 AND 3741843455
)
GROUP BY user_name,date
AND (ip like '59.82.56.%'
  or ip like '59.82.57.%'
  or ip like '59.82.58.%'
  or ip like '59.82.59.%'
  or ip like '106.11.34.%'
  or (ip like '140.205.%' and CAST(SPLIT(ip, '\\.')[2] AS INT) BETWEEN 0 AND 31)
)
  1. 如何查找所需的IP段信息?

哈哈哈,根据实际需求来……我只能告诉你那以前要多想。

  1. 如何合并要匹配的IP段信息,并处理成上面的格式方便直接使用?
#!/usr/bin/env python3
# coding=utf-8

from ipaddress import ip_network, collapse_addresses
import sys

def read_cidr_from_file(file_path):
    with open(file_path, 'r') as f:
        return [line.strip() for line in f if line.strip() and not line.startswith('#')]

def merge_cidr(cidr_list):
    """
    将一组 CIDR 地址段合并为最简形式。

    :param cidr_list: List[str] - CIDR 字符串列表,如 ["1.1.1.0/24", ...]
    :return: List[str] - 合并后的最简 CIDR 列表
    """
    # 转换为 ip_network 对象
    networks = []
    for cidr in cidr_list:
        try:
            networks.append(ip_network(cidr.strip()))
        except Exception as e:
            print(f"无效 CIDR: {cidr}, 错误: {e}")
            continue

    # 合并重叠或连续的网络
    merged = list(collapse_addresses(sorted(networks)))

    # 转回字符串
    return [str(net) for net in merged]

def cidr_to_hive_sql_fallback(merged_cidr_list, ip_column='ipv4'):
    """
    生成不依赖 inet_aton 的 Hive SQL 匹配逻辑
    """
    conditions = []
    for cidr in merged_cidr_list:
        net = ip_network(cidr)
        start_int = int(net.network_address)
        end_int = int(net.broadcast_address)

        # 构建 Hive 中的 IP 转整数表达式
        ip_to_int_expr = f"""
        CAST(SPLIT({ip_column}, '\\\\.')[0] AS BIGINT) * 16777216 +
        CAST(SPLIT({ip_column}, '\\\\.')[1] AS BIGINT) * 65536 +
        CAST(SPLIT({ip_column}, '\\\\.')[2] AS BIGINT) * 256 +
        CAST(SPLIT({ip_column}, '\\\\.')[3] AS BIGINT)
        """.strip()

        # 完全不做任何预处理,直接用原始的IP字段进行转换后拼接
        cond = f"""(
  {ip_to_int_expr} BETWEEN {start_int} AND {end_int}
)"""
        # 提前将IP字符串转换成数字,然后对数字进行比较
        cond = f"""ip_num BETWEEN {start_int} AND {end_int}
"""

        conditions.append(cond)

    return " OR ".join(conditions)

# === 使用示例 ===
if __name__ == "__main__":
    cidr_list = [
        "180.184.0.0/24",
        "180.184.1.0/24",
        "180.184.2.0/24",
        "180.184.3.0/24"
    ]

    # 使用
    cidr_list = read_cidr_from_file(sys.argv[1]) # 从命令行的第1个参数指定的文件中读取

    merged = merge_cidr(cidr_list)
    for cidr in merged:
        print(cidr)

    hive_sql = cidr_to_hive_sql_fallback(merged, 'ip')
    print("Hive SQL (无 inet_aton):")
    print(hive_sql)
参考链接:

通义千问
https://www.tongyi.com/

=END=


发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注