Command line unicode search

Similar to Monday’s post about command line emoji search, I often find myself wanting to look up Unicode characters. I have a custom search engine / bookmark set up in Chrome / Firefox (uni %s maps to http://unicode-search.net/unicode-namesearch.pl?term=%s&.submit=Submit&subs=1). That actually works great, but given how relatively much of my day I spend on the command line, I thought it would be interesting to do something there:

$ uni delta
⍋	apl functional symbol delta stile
⍙	apl functional symbol delta underbar
⍍	apl functional symbol quad delta
≜	delta equal to
Δ	greek capital letter delta
δ	greek small letter delta
ẟ	latin small letter delta
ƍ	latin small letter turned delta
𝚫	mathematical bold capital delta
𝜟	mathematical bold italic capital delta
𝜹	mathematical bold italic small delta
𝛅	mathematical bold small delta
𝛥	mathematical italic capital delta
𝛿	mathematical italic small delta
𝝙	mathematical sans-serif bold capital delta
𝞓	mathematical sans-serif bold italic capital delta
𝞭	mathematical sans-serif bold italic small delta
𝝳	mathematical sans-serif bold small delta
ᵟ	modifier letter small delta

read more...


Decoding escaped Unicode strings

In one of my current research projects involving large amounts of Twitter data from a variety of countries, I came across an interesting problem. The Twitter stream is encoded as a series of JSON objects–each of which has been written out using ASCII characters. But not all of the Tweets (or even a majority in this case) can be represented with only ASCII. So what happens?

Well, it turns out that they encode the data as JSON strings with Unicode escape characters. So if we had the Russian hashtag #победазанами (victory is ours), that would be encoded as such:

"#\u043f\u043e\u0431\u0435\u0434\u0430\u0437\u0430\u043d\u0430\u043c\u0438"

read more...