Today at work, I had to process a bunch of CSV data. Realizing that I don’t have any particularly nice tools to work with streaming CSV data (although I did write about querying CSV files with SQL), I decided to write one:
$ cat users.csv
"user_id","name","email","password"
"1","Luke Skywalker","[email protected]","$2b$12$XQ1zDvl5PLS6g.K64H27xewPQMnkELa3LvzFSyay8p9kz0XXHVOFq"
"2","Han Solo","[email protected]","$2b$12$eKJGP.tt9u77PeXgMMFmlOyFWSuRZBUZLvmzuLlrum3vWPoRYgr92"
$ cat users.csv | csv2json | jq '.'
{
"password": "$2b$12$XQ1zDvl5PLS6g.K64H27xewPQMnkELa3LvzFSyay8p9kz0XXHVOFq",
"name": "Luke Skywalker",
"user_id": "1",
"email": "[email protected]"
}
{
"password": "$2b$12$eKJGP.tt9u77PeXgMMFmlOyFWSuRZBUZLvmzuLlrum3vWPoRYgr92",
"name": "Han Solo",
"user_id": "2",
"email": "[email protected]"
}
Luckily, Python has nice CSV and JSON libraries built in:
# If --parse is set, try to parse each entry as json
def parse(thing):
try:
return json.loads(thing)
except:
return thing
# Assume that headers are the first row
reader = csv.reader(sys.stdin)
headers = None
for row in reader:
if not headers:
headers = row
continue
if '--parse' in sys.argv:
row = [parse(thing) for thing in row]
# Recombine the headers with each row (no one said this was saving space)
row = dict(zip(headers, row))
print(json.dumps(row, default = str))
Basically, assume that the first row of the CSV data is headers (converting to a JSON dict doesn’t make much sense if it isn’t) and then combine that with each additional row to write out a dictionary. In addition, I put a bit of code in there to assume that you might be storing JSON in your CSV fields. If so, you can --parse
the data automagically:
$ cat users-with-preferences.csv | csv2json --parse | jq '.'
{
"preferences": {
"force-user": true
},
"name": "Luke Skywalker",
"user_id": 1,
"password": "$2b$12$XQ1zDvl5PLS6g.K64H27xewPQMnkELa3LvzFSyay8p9kz0XXHVOFq",
"email": "[email protected]"
}
{
"preferences": {
"ship": "Millennium Falcon"
},
"name": "Han Solo",
"user_id": 2,
"password": "$2b$12$eKJGP.tt9u77PeXgMMFmlOyFWSuRZBUZLvmzuLlrum3vWPoRYgr92",
"email": "[email protected]"
}
Note that the user_id
s are actually numbers, the preferences
field has been unpacked, and Luke’s force-user
status is a boolean. It’s neat how you get all of that more or less for free.
Also, have I mentioned how nice jq is for working with JSON?
And that’s it. The full source is part of my dotfiles now: github:jpverkamp/dotfiles (although all that’s missing above is the shebang and the imports). Enjoy!