=Start=
缘由:
简单记录一下,方便后面有需要的时候参考。
正文:
参考解答:
思路整理:
* 先将「非特定字符」替换成「空」,然后再计算字符串的剩余长度:
* 对于单个字符来说,原始字符串长度和替换后的字符串长度的差值就是单个字符出现的次数;
* 对于子串来说,原理一样只是最后用差值除以子串的长度得到子串出现的次数,还有就是对于 Hive SQL 来说,计算字符串长度的函数是 length 而不是 len
# Counting the occurrences of a character
SELECT
LEN(summary) - LEN(REPLACE(summary, 'x', '')) AS occurrences
FROM article
# Counting the occurrences of a substring of any length
If we don’t know the length of the substring, the expression to use is a bit more complex:
SELECT
(LEN(summary) - LEN(REPLACE(summary, 'France',''))) / LEN('France') AS occurrences
FROM article
# Case-sensitivity
Single character, case insensitive:
SELECT
LEN(summary) - LEN(REPLACE(LOWER(summary), LOWER('x'), '')) AS occurrences
FROM article
Any number of characters, case insensitive:
SELECT
(LEN(summary) - LEN(REPLACE(LOWER(summary), LOWER('France','')))) / LEN('France') AS occurrences
FROM article
# Performance
In the examples above, the expressions should be relatively fast (though, of course, it can be a problem when applied to a huge number of rows).
Nothing prevents us to use those expressions in a WHERE clause or in an ORDER BY clause, but this may heavily affect performance. The reason is that, if there is an index on the summary column, it cannot be used because the query optimizer cannot make assumptions on the results of the expressions.
If it is really necessary, this problem can be solved by building a computer column and an index on it, but this is out of the scope of this article.
Hive SQL中的字符串处理函数说明
replace(string A, string OLD, string NEW)
Returns the string A with all non-overlapping occurrences of OLD replaced with NEW (as of Hive 1.3.0 and 2.1.0).
Example:
select replace("ababab", "abab", "Z");
returns "Zab".
==
length(string A)
Returns the length of the string.
参考链接:
SQL Server: How to count occurrences of a substring in a string
https://sql-bits.com/sql-server-how-to-count-occurrences-of-a-substring-in-a-string/
Hive – Count number of occurences of character
https://stackoverflow.com/questions/36736022/hive-count-number-of-occurences-of-character
Hive LanguageManual UDF
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
=END=
《 “如何用SQL查看某个字符串中某个字符出现的次数” 》 有 2 条评论
Hive SQL-字符串替换函数
https://www.dataisfun.cn/?p=22
`
1. replace()
**替换单个字符或子字符串**,replace(original-string, search-string, replace-string),实例:
–将字符串中所有的 e 全部替换为 x
select
‘show me the code’ as string1,
replace(‘show me the code’, ‘e’, ‘x’) as string2
–将字符串中的子字符串 code 删除
select
‘show me the code’ as string1,
replace(‘show me the code’, ‘code’,”) as string2
2. translate()
**分别对应替换单个字符**,translate(string input, string from, string to),实例:
–将字符串中的 s m t c 分别替换为 a b c d
select
‘show me the code’ as string1,
translate(‘show me the code’, ‘smtc’,’abcd’) as string2
–将字符串中的 s m t c 都删除
select
‘show me the code’ as string1,
translate(‘show me the code’, ‘smtc’,”) as string2
3. regexp_replace()
**替换符合正则表达式的字符或子字符串**,regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT),实例:
–将字符串中的数字及多余空格都删除
select
‘show me the 404 code’ as string1,
regexp_replace(‘show me the 404 code’, ‘\\d+ ‘, ”) as string2
–将字符串中的换行符删除
select
address as address1,
regexp_replace(address, ‘\\n’, ”) as address2
from customer_info
`
`
除了用replace特定字符之后再计算字符串长度的差值来得到特定字符的个数之外,还可以通过使用split字符串切分的方式,然后用size获取切分后数组大小来得到特定字符串的个数。比如:
,size(split(query_tables,’,’)) as query_table_cnt
`