API Tricks: Wikipedia Table JSON API

2025-08-16

Quick, what is the order of the (as of now) 63 released Walk Disney Animation Studios films?

$ api=https://www.wikitable2json.com/api; \
  page=List_of_Walt_Disney_Animation_Studios_films; \
  curl -s "$api/$page?table=0&keyRows=1" \
  | jq '.[0][].Film' -rc \
  | egrep -v '^as' \
  | nl

     1	Snow White and the Seven Dwarfs
     2	Pinocchio
     3	Fantasia
...
    61	Strange World
    62	Wish
    63	Moana 2

There is a list on Wikipedia: List of Walt Disney Animation Studios films, but the tables there are … a bit of a pain to copy paste. I could very well just manually do that, but where’s the fun in that?

Luckily, someone went through the work of providing a wrapper around Wikipedia that will extract all (or selected) tables from a Wikipedia page!

To break down the command:

curl -s https://{...}?table=0&keyRows - Download the first (zero based indexing) table on the page; use the first row as column names (-s for ‘silent’ mode)
jq '.[0][].Film - Extract the first table in the response (.[0]), for each row in that table [] extract the film name .Film
egrep -v '^as - Remove rows starting with ‘as …’; these are extra rows when the studio was renamed

And that’s it!

Go is faster than Python? (an example parsing huge JSON logs)

2022-02-11

Recently at work I came across a problem where I had to go through a year’s worth of logs and corelate two different fields across all of our requests. On the good side, we have the logs stored as JSON objects (archived from Datadog which collects them). On the down side… it’s kind of a huge amount of data. Not as much as I’ve dealt with at previous jobs/in some academic problems, but we’re still talking on the order of terabytes.

On one hand, write up a quick Python script, fire and forget. It takes maybe ten minutes to write the code and (for this specific example) half an hour to run it on the specific cloud instance the logs lived on. So we’ll start with that. But then I got thinking… Python is supposed to be super slow right? Can I do better?

(Note: This problem is mostly disk bound. So Python actually for the most part does just fine.)

A simple Flask Logging/Echo Server

2022-02-01

A very simple server that can be used to catch all incoming HTTP requests and just echo them back + log their contents. I needed it to test what a webhook actually returned to me, but I’m sure that there are a number of other things it could be dropped in for.

It will take in any GET/POST/PATCH/DELETE HTTP request with any path/params/data (optionally JSON), pack that data into a JSON object, and both log that to a file (with a UUID1 based name) plus return this object to the request.

Warning: Off hand, there is already a potential security problem in this regarding DoS. It will happily try to log anything you throw at it, no matter how big and will store those in memory first. So long running requests / large requests / many requests will quickly eat up your RAM/disk. So… don’t leave this running unattended? At least not without additional configuration.

That’s it! Hope it’s helpful.

CSV to JSON

2015-12-11

Today at work, I had to process a bunch of CSV data. Realizing that I don’t have any particularly nice tools to work with streaming CSV data (although I did write about querying CSV files with SQL), I decided to write one:

$ cat users.csv

"user_id","name","email","password"
"1","Luke Skywalker","[email protected]","$2b$12$XQ1zDvl5PLS6g.K64H27xewPQMnkELa3LvzFSyay8p9kz0XXHVOFq"
"2","Han Solo","[email protected]","$2b$12$eKJGP.tt9u77PeXgMMFmlOyFWSuRZBUZLvmzuLlrum3vWPoRYgr92"

$ cat users.csv | csv2json | jq '.'

{
  "password": "$2b$12$XQ1zDvl5PLS6g.K64H27xewPQMnkELa3LvzFSyay8p9kz0XXHVOFq",
  "name": "Luke Skywalker",
  "user_id": "1",
  "email": "[email protected]"
}
{
  "password": "$2b$12$eKJGP.tt9u77PeXgMMFmlOyFWSuRZBUZLvmzuLlrum3vWPoRYgr92",
  "name": "Han Solo",
  "user_id": "2",
  "email": "[email protected]"
}

Decoding escaped Unicode strings

2013-01-17

In one of my current research projects involving large amounts of Twitter data from a variety of countries, I came across an interesting problem. The Twitter stream is encoded as a series of JSON objects–each of which has been written out using ASCII characters. But not all of the Tweets (or even a majority in this case) can be represented with only ASCII. So what happens?

Well, it turns out that they encode the data as JSON strings with Unicode escape characters. So if we had the Russian hashtag #победазанами (victory is ours), that would be encoded as such:

"#\u043f\u043e\u0431\u0435\u0434\u0430\u0437\u0430\u043d\u0430\u043c\u0438"

JP's Blog

Programming, Topic: JSON

All posts

Recent posts

API Tricks: Wikipedia Table JSON API

Go is faster than Python? (an example parsing huge JSON logs)

A simple Flask Logging/Echo Server

CSV to JSON

Decoding escaped Unicode strings