sorting large files faster with a shell script
Look carefully at the options of sort to speed performance and understand it’s impact on your machine and problem. Key parameters on Ubuntu are
- Location of temporary files -T directory_name
- Amount of memory to use -S N% ( N% of all memory to use, the more the better but avoid over subscription that causes swapping to disk. You can use it like “-S 80%” to use 80% of available RAM, or “-S 2G” for 2 GB RAM.)
The questioner asks “Why no high memory usage?” The answer to that comes from history, older unix machines were small and the default memory size is set small. Adjust this as big as possible for your workload to vastly improve sort performance. Set the working directory to a place on your fastest device that has enough space to hold at least 1.25 * the size of the file being sorted.
Buffer it in memory using -S. For example, to use (up to) 50% of your memory as a sorting buffer do:
sort -S 50% file
Note that modern Unix sort can sort in parallel. My experience is that it automatically uses as many cores as possible. You can set it directly using –parallel. To sort using 4 threads:
sort --parallel=4 file
So all in all, you should put everything into one file and execute something like:
sort -S 50% --parallel=4 file
Using the sort command will probably be the fastest option.
But you’ll probably want to fix the locale to C.
sort -u doesn’t report unique lines, but one of each set of lines that sort the same. In the C locale, 2 different lines necessarily don’t sort the same, but that’s not the case in most UTF-8 based locales on GNU systems.
Also, using the C locale avoids the overhead of having to parse UTF-8 and processing complex sort orders so would improve performance dramatically.
LC_ALL=C sort -u file
You can also improve performance by using a faster drive (or a different drive from the one where the input and/or output files are) for the temporary files (using -T or $TMPDIR environment variable), or by fiddling with the -S option supported by some sort implementations).
For some type of input or for slow storage, using the –compress-program option of GNU sort(for instance with lzop) might improve performance in addition to storage usage.