Decoding escaped Unicode strings

In one of my current research projects involving large amounts of Twitter data from a variety of countries, I came across an interesting problem. The Twitter stream is encoded as a series of JSON objects–each of which has been written out using ASCII characters. But not all of the Tweets (or even a majority in this case) can be represented with only ASCII. So what happens?

Well, it turns out that they encode the data as JSON strings with Unicode escape characters. So if we had the Russian hashtag #победазанами (victory is ours), that would be encoded as such:

"#\u043f\u043e\u0431\u0435\u0434\u0430\u0437\u0430\u043d\u0430\u043c\u0438"

read more...


Generating non-repeating strings

Based on this post from Programming Praxis, today’s goal is to write an algorithm that, given a number N and an alphabet A, will generate all strings of length N made of letters from A with no adjacent substrings that repeat.

So for example, given N = 5 and A = {a, b, c} the string abcba will be allowed, but none of abcbc, ababc, nor even aabcb will be allowed (the bc, ab, and a repeat).

It’s a little more general even than the version Programming Praxis specifies (they limit the alphabet to exactly *A = {1, 2, 3} *and more more general still than their original source which requires only one possible string, but I think it’s worth the extra complications.

read more...


Numbers of Wirth

Niklaus Wirth gave the following problem back in 1973:

Develop a program that generates in ascending order the least 100 numbers of the set M, where M is defined as follows:

a) The number 1 is in M.

b) If x is in M, then y = 2 * x + 1 and z = 3 * x + 1 are also in M.

c) No other numbers are in M.

(via Programming Praxis)

It’s an interesting enough problem, so let’s work out a few different ways of doing it.

read more...


List algorithms and efficiency

Programming Praxis’ new challenge(s) are to write three different list algorithms three times, each with a different runtime complexity. From their first post last week we have list intersection and union and from a newer post yesterday we have the difference of two lists. For each of those, we want to be able to write an algorithm that runs in O(n2) time, one that runs in O(n log n), and finally one that runs in O(n). It turns out that it’s more of an exercise in data structures than anything (although they’re all still technically ’list’ algorithms), but it’s still interesting to see how you can achieve the same goal in different ways that may be far more efficient.

read more...


Determining country by IP

In my line of research, it’s often useful to be able to identify where a country is using just it’s IP address. I’ve done it a few different ways over the years, but the simplest I’ve found is using the MaxMind GeoLite Country database directly. To speed such lookups, I’ve written a simple Python script that can run a whole series of such queries for you.

read more...


Generated HTML index

A simple script today to generate an HTML index listing all of the files in a given directory. This has come in handy in the past when Apache has had Options -Indexes set (disabling their automatically generated indexes) and I didn’t have the permissions to override it.

read more...


Querying CSV files with SQL

Some time ago, I had a bunch of CSV files that I needed to extract some data from. They were all organized into tables with related columns between them all that made me think of a relational database–and it’s really easy to query relational databases, just use SQL. So what did I do? I wrote a script that will let me query CSV files as if they were a relational database.

read more...


Pickles and memoization

Memoization (yes, that’s the correct word) is a method of optimization in computer science where the first time you run a function you cache the result. After that, rather than re-doing the work, you can simply return the previous result. It’s not always perfect as you’re trading time for space.

read more...


Sampling stdin

A relatively simple script today. When I was working with Twitter data, it quickly became apparent that it’s a lot of data. So I needed some way that I could reduce the amount of data that I was dealing with while still keeping many of the same properties. To that end, I wrote a really simple script that would forward lines from stdin to stdout but would only do so a given percentage of the time.

read more...