Squid的日志查看和分析

=Start=

缘由：

简单记录一下 Squid 日志的相关内容，方便有需要的时候参考。

正文：

参考解答：

默认主程序: /usr/sbin/squid
默认主配置文件: /etc/squid/squid.conf
默认监听端口: TCP 3128
默认访问日志: /var/log/squid/access.log
默认缓存日志: /var/log/squid/cache.log

# Squid 的默认的 access.log 日志格式
"%9d.%03d %6d %s %s/%03d %d %s %s %s %s%s/%s %s"

# 因此，access.log 条目通常由（至少）10 列组成，中间用一个或多个空格隔开：

1. time             #Unix 时间戳，以 UTC 秒为单位，分辨率为毫秒，这是 Squid 开始记录事务的时间。
2. duration         #持续时间，时间的长短取决于事务占用缓存的毫秒数。TCP 和 UDP 对它的解释不同。
3. client_ip        #客户端 IP 地址。
4. result_codes     #结果代码，这一栏由两个条目组成，中间用斜线隔开。
5. bytes            #字节大小，这是返回给客户端的数据量。请注意，这并不构成对象的净大小，因为标题也被计算在内。此外，失败的请求可能会发送一个错误页面，其大小也会记录在这里。
6. req_method       #请求方法。
7. url              #请求的 URL 。
8. user             #请求客户端的用户身份，用户身份信息可以来自 HTTP 身份验证、外部 ACL 辅助程序、TLS 身份验证或 IDENT 查找（RFC 931），按此顺序进行检查，显示最先提供信息的用户身份信息。如果没有可用的用户身份，将记录"-"。
9. hierarchy_code   #层次结构代码(Hierarchy/From)，如何获取对象以及从哪里获取对象(How the object is fetched and from where)。
10. type            #类型，HTTP回复头中显示的对象内容类型。请注意，ICP交换通常没有任何内容类型，因此记录为"-"。此外，有些奇怪的回复内容类型为":"，甚至为空。

# 样例
1715135750.962     26 172.12.13.48 TCP_MISS/200 671 GET http://172.12.13.57:8080/ - HIER_DIRECT/172.12.13.57 text/html

1715140512.772      1 172.12.13.57 TCP_DENIED/407 4136 GET http://172.12.13.57:8080/1.txt - HIER_NONE/- text/html

question:
Squid代理服务器的日志应该如何查看和分析？

answer:
Squid是一种常用的代理服务器软件，用于缓存和控制用户对互联网资源的访问。Squid代理服务器的日志记录了用户的访问请求、响应信息等重要数据，对于网络管理和安全监控非常重要。以下是如何查看和分析Squid代理服务器的日志的一般步骤：

查看Squid日志文件：

找到Squid日志文件位置：通常Squid的日志文件位于 /var/log/squid/ 目录下，具体文件名可能是 access.log 或类似的名称。
使用命令行查看日志：可以使用命令行工具如 cat、tail 或 less 来查看日志文件内容。例如，可以使用以下命令查看最后几行的日志内容：

   tail -n 100 /var/log/squid/access.log

分析Squid日志：

日志格式：了解Squid日志的格式，通常可以在Squid的配置文件中指定日志格式，常见的日志格式包括 Combined Log Format、Common Log Format 等。
使用工具分析：可以使用日志分析工具如 AWStats、SARG（Squid Analysis Report Generator）等来分析Squid日志，这些工具可以生成报表、图表，帮助更直观地了解用户的访问情况。
关注重要信息：在分析日志时，可以关注以下信息：

用户访问的URL
访问时间
访问结果状态码
用户IP地址
访问流量
缓存命中率等指标

监控访问行为：通过分析日志，可以监控用户的访问行为，识别异常访问、安全威胁等，有助于网络管理和安全监控。
定期分析和报告：建议定期对Squid的日志进行分析，生成报告，及时发现问题并采取相应措施。

通过查看和分析Squid代理服务器的日志，可以更好地了解用户的访问行为，优化网络性能，加强网络安全。

question:
Squid的access.log日志中都包含哪些字段，分别是什么含义？

answer:
Squid是一种流行的缓存代理服务器软件，用于提高网络性能和安全性。在Squid的access.log日志文件中，通常包含以下字段：

时间戳 (timestamp)：记录请求发生的时间。
响应时间 (response time)：请求完成所花费的时间。
客户端IP地址 (client IP address)：发起请求的客户端IP地址。
HTTP状态码 (HTTP status code)：服务器返回的HTTP状态码，用于指示请求的处理结果。
请求方法 (request method)：HTTP请求使用的方法，如GET、POST等。
请求的URL (requested URL)：客户端请求的URL地址。
大小 (size)：请求的大小，通常是以字节为单位。
缓存命中情况 (cache status)：指示请求是否命中缓存。
用户身份验证信息 (user authentication)：如果请求需要用户身份验证，则记录用户身份验证信息。
用户代理 (user agent)：发起请求的用户代理信息，通常是浏览器的标识信息。

这些字段提供了关于请求和响应的详细信息，有助于管理员监控和分析Squid服务器的性能和访问情况。

/var/log/squid/access.log #大多数日志文件分析程序都基于access.log中的内容。你可以使用这个文件来找出谁在使用Squid服务器，他们在做什么等等

/var/log/squid/cache.log #cache.log 文件包含 Squid 生成的调试和错误信息。如果使用 -s 命令行选项启动 Squid，某些信息的副本会进入 syslog 设施。使用单独的文件来保存 Squid 日志数据是个人喜好问题。

/var/log/squid/store.log #store.log文件包含当前保存在磁盘上或已删除的对象。作为一种事务日志，它通常用于调试目的。只有在分析了完整的日志文件之后，才能确定对象是否驻留在磁盘上。对象的释放(删除)可能会在交换(保存到磁盘)之后记录。

#粗略整理了一个经验就是，格式代码前面的【>】代表从客户端过来的，【<】代表发送到服务器的（sent to server or peer）

Connection related format codes: 连接相关的格式代码

    >a  Client source IP address （客户端源IP）
    >A  Client FQDN
    >p  Client source port （客户端源端口）
    >eui    Client source EUI (MAC address, EUI-48 or EUI-64 identifier) （客户端源mac地址）
    >la Local IP address the client connected to
    >lp Local port number the client connected to
    >qos    Client connection TOS/DSCP value set by Squid
    >nfmark Client connection netfilter packet MARK set by Squid

    transport::>connection_id Identifies a transport connection
        accepted by Squid (e.g., a connection carrying the
        logged HTTP request). Currently, Squid only supports
        TCP transport connections.

        The logged identifier is an unsigned integer. These
        IDs are guaranteed to monotonically increase within a
        single worker process lifetime, with higher values
        corresponding to connections that were accepted later.
        Many IDs are skipped (i.e. never logged). Concurrent
        workers and restarted workers use similar, partially
        overlapping sequences of IDs.

    la  Local listening IP address the client connection was connected to.
    lp  Local listening port number the client connection was connected to.

    <a  Server IP address of the last server or peer connection
    <A  Server FQDN or peer name
    <p  Server port number of the last server or peer connection
    <la Local IP address of the last server or peer connection
    <lp     Local port number of the last server or peer connection
    <qos    Server connection TOS/DSCP value set by Squid
    <nfmark Server connection netfilter packet MARK set by Squid

    >handshake Raw client handshake
        Initial client bytes received by Squid on a newly
        accepted TCP connection or inside a just established
        CONNECT tunnel. Squid stops accumulating handshake
        bytes as soon as the handshake parser succeeds or
        fails (determining whether the client is using the
        expected protocol).

        For HTTP clients, the handshake is the request line.
        For TLS clients, the handshake consists of all TLS
        records up to and including the TLS record that
        contains the last byte of the first ClientHello
        message. For clients using an unsupported protocol,
        this field contains the bytes received by Squid at the
        time of the handshake parsing failure.

        See the on_unsupported_protocol directive for more
        information on Squid handshake traffic expectations.

        Current support is limited to these contexts:
        - http_port connections, but only when the
          on_unsupported_protocol directive is in use.
        - https_port connections (and CONNECT tunnels) that
          are subject to the ssl_bump peek or stare action.

        To protect binary handshake data, this field is always
        base64-encoded (RFC 4648 Section 4). If logformat
        field encoding is configured, that encoding is applied
        on top of base64. Otherwise, the computed base64 value
        is recorded as is.

Time related format codes: 时间相关的格式代码

    ts  Seconds since epoch （纪元以来的秒数）
    tu  subsecond time (milliseconds) （亚秒时间(毫秒)）
    tl  Local time. Optional strftime format argument
            default %d/%b/%Y:%H:%M:%S %z
    tg  GMT time. Optional strftime format argument
            default %d/%b/%Y:%H:%M:%S %z
    tr  Response time (milliseconds)
    dt  Total time spent making DNS lookups (milliseconds) （在DNS查询上花费的时间）

HTTP related format codes: HTTP相关的格式代码

    REQUEST

    [http::]rm  Request method (GET/POST etc) （请求方法）
    [http::]>rm Request method from client
    [http::]<rm Request method sent to server or peer

    [http::]ru  Request URL received (or computed) and sanitized

            Logs request URI received from the client, a
            request adaptation service, or a request
            redirector (whichever was applied last).

            Computed URLs are URIs of internally generated
            requests and various "error:..." URIs.

            Honors strip_query_terms and uri_whitespace.

            This field is not encoded by default. Encoding
            this field using variants of %-encoding will
            clash with uri_whitespace modifications that
            also use %-encoding.

    [http::]>ru Request URL received from the client (or computed)

            Computed URLs are URIs of internally generated
            requests and various "error:..." URIs.

            Unlike %ru, this request URI is not affected
            by request adaptation, URL rewriting services,
            and strip_query_terms.

            Honors uri_whitespace.

            This field is using pass-through URL encoding
            by default. Encoding this field using other
            variants of %-encoding will clash with
            uri_whitespace modifications that also use
            %-encoding.

    [http::]<ru Request URL sent to server or peer
    [http::]>rs Request URL scheme from client
    [http::]<rs Request URL scheme sent to server or peer
    [http::]>rd Request URL domain from client
    [http::]<rd Request URL domain sent to server or peer
    [http::]>rP Request URL port from client
    [http::]<rP Request URL port sent to server or peer
    [http::]rp  Request URL path excluding hostname
    [http::]>rp Request URL path excluding hostname from client
    [http::]<rp Request URL path excluding hostname sent to server or peer
    [http::]rv  Request protocol version
    [http::]>rv Request protocol version from client
    [http::]<rv Request protocol version sent to server or peer

    [http::]>h  Original received request header.
            Usually differs from the request header sent by
            Squid, although most fields are often preserved.
            Accepts optional header field name/value filter
            argument using name[:[separator]element] format.
    [http::]>ha Received request header after adaptation and
            redirection (pre-cache REQMOD vectoring point).
            Usually differs from the request header sent by
            Squid, although most fields are often preserved.
            Optional header name argument as for >h

    RESPONSE

    [http::]<Hs HTTP status code received from the next hop
    [http::]>Hs HTTP status code sent to the client

    [http::]<h  Reply header. Optional header name argument
            as for >h

    [http::]mt  MIME content type


    SIZE COUNTERS （大小计数器）

    [http::]st  Total size of request + reply traffic with client
    [http::]>st Total size of request received from client. Excluding chunked encoding bytes.
    [http::]<st Total size of reply sent to client (after adaptation)

    [http::]>sh Size of request headers received from client
    [http::]<sh Size of reply headers sent to client (after adaptation)

    [http::]<sH Reply high offset sent
    [http::]<sS Upstream object size

    [http::]<bs Number of HTTP-equivalent message body bytes
            received from the next hop, excluding chunked
            transfer encoding and control messages.
            Generated FTP listings are treated as
            received bodies.

    TIMING

    [http::]<pt Peer response time in milliseconds. The timer starts
            when the last request byte is sent to the next hop
            and stops when the last response byte is received.
    [http::]<tt Total time in milliseconds. The timer
            starts with the first connect request (or write I/O)
            sent to the first selected peer. The timer stops
            with the last I/O with the last peer.

Squid handling related format codes: Squid处理相关的格式代码

    Ss  Squid request status (TCP_MISS etc)
    Sh  Squid hierarchy status (DEFAULT_PARENT etc)

    [http::]request_attempts    Number of request forwarding attempts

        See forward_max_tries documentation that details what Squid counts
        as a forwarding attempt. Pure cache hits log zero, but cache hits
        that triggered HTTP cache revalidation log the number of attempts
        made when sending an internal revalidation request. DNS, ICMP,
        ICP, HTCP, ESI, ICAP, eCAP, helper, and other secondary requests
        sent by Squid as a part of a master transaction do not increment
        the counter logged for the received request.


The default formats available (which do not need re-defining) are:
可用的默认格式(不需要重新定义)是:

logformat squid      %ts.%03tu %6tr %>a %Ss/%03>Hs %<st %rm %ru %[un %Sh/%<a %mt
logformat common     %>a %[ui %[un [%tl] "%rm %ru HTTP/%rv" %>Hs %<st %Ss:%Sh
logformat combined   %>a %[ui %[un [%tl] "%rm %ru HTTP/%rv" %>Hs %<st "%{Referer}>h" "%{User-Agent}>h" %Ss:%Sh
logformat referrer   %ts.%03tu %>a %{Referer}>h %ru
logformat useragent  %>a [%tl] "%{User-Agent}>h"

当通过Squid代理服务器来收敛内部服务器的出网访问需求时，对于数据安全来说，要关注它日志里的一些和上传行为有关的操作，最简单的比如 HTTP POST 发送本地文件到外网，这时请求方法为 POST ，请求大小又很大（日志里的字段应该是[http::]>st Total size of request received from client. Excluding chunked encoding bytes.）。

请求的域名又是 pastebin.com 这样的用于共享代码、文本和文件的平台。

##### Step 1: Locating the Squid Logs

Squid logs are typically located in the /var/log/squid/ directory. The main log files are access.log, cache.log, and store.log.

$ cd /var/log/squid/
$ ls

##### Step 2: Understanding the Squid Logs

Each of the log files serves a different purpose:
* access.log: This file records all the requests processed by the Squid proxy server.
* cache.log: This is the main Squid log file where general information, warnings, and error messages are logged.
* store.log: This file contains information about the objects stored and retrieved from the Squid cache.

##### Step 3: Monitoring the Squid Logs

You can use the tail command to monitor the logs in real-time:

$ tail -f /var/log/squid/access.log

##### Step 4: Analyzing the Squid Logs

To analyze the logs, you can use various command-line tools like grep, awk, cut, sort, uniq, etc. For example, to find the top 10 most visited websites, you can use the following command:

$ awk '{print $7}' /var/log/squid/access.log | sort | uniq -c | sort -nr | head -10

##### Step 5: Setting Up Log Rotation

To prevent the log files from growing too large, you can set up log rotation using the logrotate utility. You can create a new configuration file for Squid in the /etc/logrotate.d/ directory:

$ nano /etc/logrotate.d/squid

And add the following content:

/var/log/squid/*.log {
    daily
    rotate 7
    compress
    missingok
    notifempty
    sharedscripts
    postrotate
        /usr/sbin/squid -k rotate
    endscript
}

This configuration will rotate the logs daily, keep 7 days of logs, compress the old logs, and send a signal to Squid to close and reopen the log files.