CSV to JSON

Today at work, I had to process a bunch of CSV data. Realizing that I don’t have any particularly nice tools to work with streaming CSV data (although I did write about querying CSV files with SQL), I decided to write one:

$ cat users.csv

"user_id","name","email","password"
"1","Luke Skywalker","[email protected]","$2b$12$XQ1zDvl5PLS6g.K64H27xewPQMnkELa3LvzFSyay8p9kz0XXHVOFq"
"2","Han Solo","[email protected]","$2b$12$eKJGP.tt9u77PeXgMMFmlOyFWSuRZBUZLvmzuLlrum3vWPoRYgr92"

$ cat users.csv | csv2json | jq '.'

{
  "password": "$2b$12$XQ1zDvl5PLS6g.K64H27xewPQMnkELa3LvzFSyay8p9kz0XXHVOFq",
  "name": "Luke Skywalker",
  "user_id": "1",
  "email": "[email protected]"
}
{
  "password": "$2b$12$eKJGP.tt9u77PeXgMMFmlOyFWSuRZBUZLvmzuLlrum3vWPoRYgr92",
  "name": "Han Solo",
  "user_id": "2",
  "email": "[email protected]"
}

Luckily, Python has nice CSV and JSON libraries built in:

# If --parse is set, try to parse each entry as json
def parse(thing):
    try:
        return json.loads(thing)
    except:
        return thing

# Assume that headers are the first row
reader = csv.reader(sys.stdin)
headers = None
for row in reader:
    if not headers:
        headers = row
        continue

    if '--parse' in sys.argv:
        row = [parse(thing) for thing in row]

    # Recombine the headers with each row (no one said this was saving space)
    row = dict(zip(headers, row))
    print(json.dumps(row, default = str))

Basically, assume that the first row of the CSV data is headers (converting to a JSON dict doesn’t make much sense if it isn’t) and then combine that with each additional row to write out a dictionary. In addition, I put a bit of code in there to assume that you might be storing JSON in your CSV fields. If so, you can --parse the data automagically:

$ cat users-with-preferences.csv | csv2json --parse | jq '.'

{
  "preferences": {
    "force-user": true
  },
  "name": "Luke Skywalker",
  "user_id": 1,
  "password": "$2b$12$XQ1zDvl5PLS6g.K64H27xewPQMnkELa3LvzFSyay8p9kz0XXHVOFq",
  "email": "[email protected]"
}
{
  "preferences": {
    "ship": "Millennium Falcon"
  },
  "name": "Han Solo",
  "user_id": 2,
  "password": "$2b$12$eKJGP.tt9u77PeXgMMFmlOyFWSuRZBUZLvmzuLlrum3vWPoRYgr92",
  "email": "[email protected]"
}

Note that the user_ids are actually numbers, the preferences field has been unpacked, and Luke’s force-user status is a boolean. It’s neat how you get all of that more or less for free.

Also, have I mentioned how nice jq is for working with JSON?

And that’s it. The full source is part of my dotfiles now: github:jpverkamp/dotfiles (although all that’s missing above is the shebang and the imports). Enjoy!