Created: 2022-04-11 Mon Last modified: 2022-04-12 Tue
Split a File But Keep Headers¶
A split is a powerful utility. It supports multiple splitting options.
n |
generate n files based on current size of input |
k/n |
output only kth of n to standard output |
l/n |
generate n files without splitting lines or records |
l/k/n |
likewise but output only kth of n to stdout |
r/n |
like ‘l’ but use round robin distribution |
r/k/n |
likewise but output only kth of n to stdout |
l/n
option is almost suitable for splitting files in DSV (delimiter-separated values) format (csv, tsv,
etc). The only problem is preserving headers.
There are multiple good options to handle it:
Baeldung: Split a File With the Header Line
But
l/n
option doesn't workcannot determine file size
StackOverflow: Split CSV files into smaller files but keeping the headers?
Cubersome: need to take care of the first split filename
I generalized an idea a little bit.
The question is how to provide the header for all splits except the first one?
Here is the bash-scipt called split_filter_header
that should be in $PATH
(I use ~/bin
):
#!/usr/bin/env bash
set -o errexit
set -o pipefail
set -o nounset
flag="$XDG_RUNTIME_DIR/.split_filter_header_was_called"
if [[ $# -ne 1 ]]; then
exit 1
fi
if [[ $1 == '--reset' ]]; then
command rm $flag
exit
fi
if [[ -f $flag ]]; then
head -n 1 "$1"
fi
> $flag
And splitting with any options can be performed in the following way:
$ split_filter_header --reset ; split http.csv -d -n l/9 --filter '(split_filter_header http.csv; cat) > $FILE' http.csv.part
split_filter_header --reset
- no header for the first splithttp.csv
- input file-d
- numeric suffixes-n l/9
- split file into 9 pieces without splitting lines--filter '(split_filter_header http.csv; cat) > $FILE'
- main part, prepend output with the headerhttp.csv.part
- split prefix
The restriction is that you shouldn't run splitting in that way in parallel, because
split_filter_header --reset
call affects next split_filter_header <filename>
call.