In one of my current research projects involving large amounts of Twitter data from a variety of countries, I came across an interesting problem. The Twitter stream is encoded as a series of JSON objects–each of which has been written out using ASCII characters. But not all of the Tweets (or even a majority in this case) can be represented with only ASCII. So what happens?
Well, it turns out that they encode the data as JSON strings with Unicode escape characters. So if we had the Russian hashtag #победазанами (victory is ours), that would be encoded as such:
"#\u043f\u043e\u0431\u0435\u0434\u0430\u0437\u0430\u043d\u0430\u043c\u0438"