Split a file with headers

I have a bunch of files with Arabic content that I need to split into chunks so they can be better run in parallel1. But by default, when I open them in a text editor, the encoding changes from windows-1256 to utf-82. I could use the Unix split command to break them into chunks, but I need to preserve the headers. So… how do I fix all this?

Write a script!

Start with this answer from StackOverflow and clean it up / add some features I need:


# Based on:
# https://stackoverflow.com/questions/37386246/split-large-csv-file-and-keep-header-in-each-part/45384974#45384974

# Pass a file in as the first argument on the command line (note, not secure)

tempdir=$(mktemp -d)

# Split header and data
head -1 $file > $tempdir/header
tail -n +2 $file > $tempdir/data

pushd $tempdir > /dev/null
    # Break into chunks
    split -l $size data chunk
    rm data

    # Put them back together 
    for part in `ls -1 $tempdir/chunk*`
        cat $tempdir/header $part > $part-$file
        rm $part
popd > /dev/null

# Pull them here
mv $tempdir/chunk*$file .

rm -rf $tempdir

Use mktemp to not clutter my directory. It’s not perfect, since the files will always be named chunk[a-z]{2}-... but that’s fine. It does what I need it to do.

><> for f in *; split-with-headers $f 5000; end

Source on github

Hopefully someone else will find this useful.

  1. I know that this should probably be done at the app level. ↩︎

  2. I also know I could just turn this off. ↩︎