ClickHouse中的一些功能点学习


=Start=

缘由:

整理记录一下最近使用ClickHouse的经验,方便以后有需要的时候参考。

正文:

参考解答:
0. 借助 clickhouse-cli 这个工具进行命令行操作
$ pip3 install clickhouse-cli

$ vim ~/.clickhouse-cli.rc

$ clickhouse-cli
command not found: clickhouse-cli

# 我的环境是 macOS Catalina 10.15.6 ,用pip3安装成功,但是在 $PATH 里面找不到这个程序
# 后来发现是在 $HOME 目录下的 Library 的 Python 目录中
$ ls -lt ./Library/Python/3.8/bin

$ ~/Library/Python/3.8/bin/clickhouse-cli
1. 一些环境熟悉操作/命令
$ ~/Library/Python/3.8/bin/clickhouse-cli
clickhouse-cli version: 0.3.6
Connecting to x.x.x.x:80
Connected to ClickHouse server v19.13.1.

 :) help

clickhouse-cli's custom commands:
---------------------------------
USE     Change the current database.
SET     Set an option for the current CLI session.
QUIT    Exit clickhouse-cli.
HELP    Show this help message.

PostgreSQL-like custom commands:
--------------------------------
\l      Show databases.
\c      Change the current database.
\d, \dt Show tables in the current database.
\d+     Show table's schema.
\ps     Show current queries.
\kill   Kill query by its ID.

Query suffixes:
---------------
\g, \G  Use the Vertical format.
\p      Enable the pager.
 :)

# 查看有哪些数据库
show databases
\l

# 切换数据库
use db_name

# 查看当前数据库有哪些表
show tables
\d
\dt

# 查看特定数据表的表结构
describe table table_name
\d+ table_name
2. 时间日期函数
WITH
    toDate('2019-01-01') AS date,
    toDateTime('2019-01-01 00:00:00') AS date_time
SELECT
    date,
    subtractYears(date, 1) AS subtract_years_with_date,
    date_time,
    subtractYears(date_time, 1) AS subtract_years_with_date_time
;
3. 字符串处理函数
SELECT splitByChar(',', '1,2,3,abcde');
SELECT splitByString(', ', '1, 2 3, 4,5, abcde');

-- arrayStringConcat() 数组元素拼接的功能和Hive中的 concat_ws() 类似。

WITH
	splitByChar(',', '1,2,3,abcde') as arr1
select
	arr1,
	arrayStringConcat(arr1, '#') as arr1str
;

┌─arr1──────────────────┬─arr1str─────┐
│ ['1','2','3','abcde'] │ 1#2#3#abcde │
└───────────────────────┴─────────────┘
-- 字符串查找
position() / locate()

-- 字符串替换
SELECT replaceRegexpAll('Hello, World!', '.', '\\0\\0') AS res

┌─res────────────────────────┐
│ HHeelllloo,,  WWoorrlldd!! │
└────────────────────────────┘

-- 字符串提取
substring() / extractallgroups()
4. 常用的一些聚合功能
-- Hive中的 collect_set 功能,将某一列中的value取值转换成一个 list/set 在 ClickHouse 中可以通过 groupUniqArray() 来实现。
select user_id,
       count(1) as cnt,
       count(distinct user_agent) as ua_cnt,
       groupUniqArray(toDate(timestamp))
  from table_name
 where status = 200
   and http_host = 'domain_name'
   and uri = '/path/'
   and args like '%keyword%'
   and user_id in ('user1', 'user2', 'user3', 'user4', 'user5')
   and timestamp between '2020-06-01 00:00:00' and '2020-07-31 23:59:59'
 group by user_id
 limit 20000
;

┌─user_id─────┬─cnt─┬─ua_cnt─┬─groupUniqArray(toDate(timestamp))─────────────────────┐
│ user1       │  32 │      1 │ ['2020-07-03','2020-07-09','2020-07-02','2020-07-08'] │
└─────────────┴─────┴────────┴───────────────────────────────────────────────────────┘
┌─user_id─┬─cnt─┬─ua_cnt─┬─groupUniqArray(toDate(timestamp))────────┐
│ user4   │  18 │      1 │ ['2020-07-06','2020-07-03','2020-07-09'] │
└─────────┴─────┴────────┴──────────────────────────────────────────┘
┌─user_id─┬─cnt─┬─ua_cnt─┬─groupUniqArray(toDate(timestamp))─┐
│ user2   │   4 │      1 │ ['2020-07-13']                    │
└─────────┴─────┴────────┴───────────────────────────────────┘


-- Hive中的 count(distinct col_name) 在 ClickHouse 中可以使用 uniq(col_name) 或者 uniqExact(col_name) 来实现。
5. 其他功能
  • URL相关函数
  • JSON处理相关函数
  • IP处理相关函数

参考链接:

https://github.com/hatarist/clickhouse-cli

https://clickhouse.tech/docs/en/sql-reference/functions/date-time-functions/

Is there any function like hive’s concat_ws or explode or collect_list/collect_array #6664
https://github.com/ClickHouse/ClickHouse/issues/6664

https://clickhouse.tech/docs/en/sql-reference/functions/splitting-merging-functions/

https://clickhouse.tech/docs/en/sql-reference/functions/string-search-functions/#position

https://clickhouse.tech/docs/en/sql-reference/functions/string-functions/#substring

https://clickhouse.tech/docs/en/sql-reference/functions/splitting-merging-functions/#extractallgroups

https://clickhouse.tech/docs/en/sql-reference/functions/url-functions/

https://clickhouse.tech/docs/en/sql-reference/functions/json-functions/

https://clickhouse.tech/docs/en/sql-reference/functions/ip-address-functions/#ipv4stringtonums

https://clickhouse.tech/docs/zh/sql-reference/statements/select/array-join/

=END=


《 “ClickHouse中的一些功能点学习” 》 有 7 条评论

  1. — base64 编解码
    `
    with
    ‘/search/q-5Y+I5piv5LiA5Liq5LiL6Zuo5aSp’ as args
    select
    args,
    replaceOne(args, ‘/search/q-‘, ”) as base64ed_query,
    tryBase64Decode(replaceOne(args, ‘/search/q-‘, ”)) as query_str
    ;

    /*
    — 输出
    ┌─args───────────────────────────────────┬─base64ed_query───────────────┬─query_str──────┐
    │ /search/q-5Y+I5piv5LiA5Liq5LiL6Zuo5aSp │ 5Y+I5piv5LiA5Liq5LiL6Zuo5aSp │ 又是一个下雨天 │
    └────────────────────────────────────────┴──────────────────────────────┴────────────────┘
    */

    select
    base64Encode(‘又是一个下雨天’) — 5Y+I5piv5LiA5Liq5LiL6Zuo5aSp
    ;
    `

  2. ClickHouse 中如何获取array中的倒数第1个元素
    `
    — 获取 arr 中的最后1个元素
    arr[-1]
    arr[length(arr)]
    `

    https://clickhouse.tech/docs/en/sql-reference/functions/splitting-merging-functions/
    `
    ClickHouse 里面有字符串切分函数:
    splitByChar($separator, $s)
    splitByString($separator, $s)

    但是这里的 $separator 就是一个 字符/字符串 ,不像 Hive 里的 split 函数中可以是一个正则表达式(从而支持多分隔符的切分),但是也有变通的办法来支持,先 replace 再 splitByChar 实现。
    `

    https://clickhouse.tech/docs/en/sql-reference/functions/array-functions/
    `
    –返回arr_name这个数组的长度
    length(arr_name)

    — 获取 arr 中的第 n 个元素(数组下标是从 1 开始计算的)
    arrayElement(arr, n)
    arr[n]
    — 获取 arr 中的最后1个元素
    arr[-1]
    arr[length(arr)]
    `

  3. https://clickhouse.tech/docs/en/sql-reference/functions/string-replace-functions/
    `
    — 替换 haystack 字符串中的 常量字符串-pattern 为 常量字符串-replacement
    replace(haystack, pattern, replacement)

    — 支持 正则表达式 的字符串替换(pattern可以用正则表达式来指定)
    replaceRegexpAll(haystack, pattern, replacement)

    // UserAgent解析
    select
    ua
    , ua_array[2] as os_version // get the 2nd item of array
    , ua_array[-1] as browser_ver // get the last item of array
    from
    (select
    timestamp
    , http_user_agent as ua
    , splitByChar(‘(‘, replace(http_user_agent, ‘)’, ‘(‘)) as ua_array
    from
    table_name
    where
    status = 200
    and timestamp between ‘2020-11-16 00:00:00’ and ‘2020-11-16 23:59:59’
    limit 10
    )x1
    ;
    `

  4. ClickHouse中如何删除首尾的特殊字符
    https://clickhouse.com/docs/en/sql-reference/functions/string-functions/#trim
    `
    # 方法一
    # trim
    Removes all specified characters from the start or end of a string. By default removes all consecutive occurrences of common whitespace (ASCII character 32) from both ends of a string.

    语法:
    trim([[LEADING|TRAILING|BOTH] trim_character FROM] input_string)

    样例:
    ┌─trim(BOTH ‘ ()’ FROM ‘( Hello, world! )’)─┐
    │ Hello, world! │
    └───────────────────────────────────────────────┘

    # 方法二
    # replaceRegexpAll(haystack, pattern, replacement)
    # 正则替换
    # username 字段长这样 [“user123”] 现在想去掉双引号和空格
    ,trim(BOTH ‘ [“]’ FROM username) as user1
    ,replaceRegexpAll(username, ‘\\[|”|\\]’, ”) as user2
    `

  5. ClickHouse中 对日期时间做操作的一些样例
    https://clickhouse.com/docs/en/sql-reference/functions/date-time-functions/
    `
    # clickhouse中 如何获取当前的日期时间
    now()

    ## 样例
    SELECT now();
    ┌───────────────now()─┐
    │ 2020-10-17 07:42:09 │
    └─────────────────────┘

    # clickhouse中 如何获取当前的日期时间对应的时间戳

    toUnixTimestamp(datetime)
    toUnixTimestamp(str, [timezone])

    ## 样例
    toUnixTimestamp(now())
    SELECT toUnixTimestamp(‘2017-11-05 08:07:47’, ‘Asia/Tokyo’) AS unix_timestamp

    # clickhouse中 如何获取 今天/昨天/明天 的日期
    ,today() –今天
    ,toDate(now()) –今天
    ,today()-1 –昨天
    ,today()+1 –明天

    # clickhouse中 如何将时间戳转换成日期时间的字符串

    SELECT FROM_UNIXTIME(423543535); #1983-06-04 10:58:55
    SELECT FROM_UNIXTIME(1234334543, ‘%Y-%m-%d %R:%S’) AS DateTime; #2009-02-11 14:42:23

    ## 如果时间戳是毫秒级别的,需要先除1000然后做个类型转换才行
    FROM_UNIXTIME(toInt32(ck_timestamp/1000)) as ck_time
    `

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注