用sort命令对(大)文件进行(快速)排序/去重


=Start=

在订阅的博客列表中看到一篇文章「对大文本文件进行去重」,觉得内容不错,但是原文章没有给出代码的出处链接,所以手动搜索了一番,挖到了更多的知识,在此记录一下。


搜索关键字/参考链接:

site:stackoverflow.com MAX_LINES_PER_CHUNK
http://stackoverflow.com/questions/930044/how-could-the-unix-sort-command-sort-a-very-large-file

http://vkundeti.blogspot.com/2008/03/tech-algorithmic-details-of-unix-sort.html

搜索关键字/参考链接:

sorting large files faster with a shell script

参考解答:

多看manual)使用-T选项手动指定临时目录;使用-S选项指定允许sort命令使用的内存大小;如果服务器是多核的话,还可以使用–parallel选项设定并发任务量以提高速度。在某些特殊情况下,你甚至可以通过手动设定环境变量「LC_ALL=C」,来提高处理速度(避免解析UTF-8文本,以及进行复杂的排序操作)。

Look carefully at the options of sort to speed performance and understand it’s impact on your machine and problem. Key parameters on Ubuntu are

  • Location of temporary files -T directory_name
  • Amount of memory to use -S N% ( N% of all memory to use, the more the better but avoid over subscription that causes swapping to disk. You can use it like “-S 80%” to use 80% of available RAM, or “-S 2G” for 2 GB RAM.)

The questioner asks “Why no high memory usage?” The answer to that comes from history, older unix machines were small and the default memory size is set small. Adjust this as big as possible for your workload to vastly improve sort performance. Set the working directory to a place on your fastest device that has enough space to hold at least 1.25 * the size of the file being sorted.

==

Buffer it in memory using -S. For example, to use (up to) 50% of your memory as a sorting buffer do:

sort -S 50% file

Note that modern Unix sort can sort in parallel. My experience is that it automatically uses as many cores as possible. You can set it directly using –parallel. To sort using 4 threads:

sort --parallel=4 file

So all in all, you should put everything into one file and execute something like:

sort -S 50% --parallel=4 file

==

Using the sort command will probably be the fastest option.

But you’ll probably want to fix the locale to C.

sort -u doesn’t report unique lines, but one of each set of lines that sort the same. In the C locale, 2 different lines necessarily don’t sort the same, but that’s not the case in most UTF-8 based locales on GNU systems.

Also, using the C locale avoids the overhead of having to parse UTF-8 and processing complex sort orders so would improve performance dramatically.

So:

LC_ALL=C sort -u file

You can also improve performance by using a faster drive (or a different drive from the one where the input and/or output files are) for the temporary files (using -T or $TMPDIR environment variable), or by fiddling with the -S option supported by some sort implementations).

For some type of input or for slow storage, using the –compress-program option of GNU sort(for instance with lzop) might improve performance in addition to storage usage.

=EOF=

, , ,

《 “用sort命令对(大)文件进行(快速)排序/去重” 》 有 2 条评论

  1. SHELL MAGIC: SET OPERATIONS WITH UNIQ
    http://blog.deadvax.net/2018/05/29/shell-magic-set-operations-with-uniq/
    `
    用 uniq 命令实现集合的各种操作

    $ more *
    ::::::::::::::
    a_list
    ::::::::::::::
    a
    b
    c
    d
    ::::::::::::::
    b_list
    ::::::::::::::
    a
    c
    d
    e

    # 并集
    $ cat a_list b_list | sort | uniq
    # 交集
    $ cat a_list b_list | sort | uniq -c
    # 差集
    $ cat a_list b_list b_list | sort | uniq -c
    #count = 3 present in both files
    #count = 2 present in b_list only
    #count = 1 present in a_list only
    `

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注