Command line user agent parsing

2014-02-07 in Programming

Quite often when working with internet data, you will find yourself wanting to figure out what sort of device users are using to access your content. Luckily, if you’re using HTTP, there is a standard for that: The user-agent header.

Since I’m in exactly that position, I’ve added a new script to my Dotfiles that reads user agents on stdin, parses them, and writes them back out in a given format.

Combining sort and uniq

2014-02-07 in Programming

A fairly common set of command line tools (at least for me) is to combine sort and uniq to get a count of unique items in a list of unsorted data. Something like this:

$ find . -type 'f' | rev | cut -d "." -f "1" | rev | sort | uniq -c | sort -nr | head

2649 htm
1458 png
 993 cache
 612 jpg
 135 css
 102 zip
  99 svg
  60 gif
  45 js
  27 pdf

Scanning for DNS resolvers

2013-01-31 in Research

For a research project I’m working on, it has become necessary to scan potentially large IPv4 prefixes in order to find any DNS revolvers that I can and classify them as either open (accepting queries from anyone) or closed.

Disclaimer: This is a form of port scanning and thus has associated ethical and legal considerations. Use it at your own risk.

This project is available on GitHub: jpverkamp/dnsscan

Decoding escaped Unicode strings

2013-01-17 in Programming

In one of my current research projects involving large amounts of Twitter data from a variety of countries, I came across an interesting problem. The Twitter stream is encoded as a series of JSON objects–each of which has been written out using ASCII characters. But not all of the Tweets (or even a majority in this case) can be represented with only ASCII. So what happens?

Well, it turns out that they encode the data as JSON strings with Unicode escape characters. So if we had the Russian hashtag #победазанами (victory is ours), that would be encoded as such:

"#\u043f\u043e\u0431\u0435\u0434\u0430\u0437\u0430\u043d\u0430\u043c\u0438"

Generating non-repeating strings

2012-12-12 in Programming

Based on this post from Programming Praxis, today’s goal is to write an algorithm that, given a number N and an alphabet A, will generate all strings of length N made of letters from A with no adjacent substrings that repeat.

So for example, given N = 5 and A = {a, b, c} the string abcba will be allowed, but none of abcbc, ababc, nor even aabcb will be allowed (the bc, ab, and a repeat).

It’s a little more general even than the version Programming Praxis specifies (they limit the alphabet to exactly *A = {1, 2, 3} *and more more general still than their original source which requires only one possible string, but I think it’s worth the extra complications.

Numbers of Wirth

2012-12-10 in Programming

Niklaus Wirth gave the following problem back in 1973:

Develop a program that generates in ascending order the least 100 numbers of the set M, where M is defined as follows:

a) The number 1 is in M.

b) If x is in M, then y = 2 * x + 1 and z = 3 * x + 1 are also in M.

c) No other numbers are in M.

(via Programming Praxis)

It’s an interesting enough problem, so let’s work out a few different ways of doing it.

List algorithms and efficiency

2012-11-21 in Programming

Programming Praxis’ new challenge(s) are to write three different list algorithms three times, each with a different runtime complexity. From their first post last week we have list intersection and union and from a newer post yesterday we have the difference of two lists. For each of those, we want to be able to write an algorithm that runs in O(n²) time, one that runs in O(n log n), and finally one that runs in O(n). It turns out that it’s more of an exercise in data structures than anything (although they’re all still technically ’list’ algorithms), but it’s still interesting to see how you can achieve the same goal in different ways that may be far more efficient.

Determining country by latitude/longitude

2012-10-25 in Programming

Yesterday I talked about how I used censorship research) where the API will report longitude and latitude but not always country.

Determining country by IP

2012-10-24 in Programming

In my line of research, it’s often useful to be able to identify where a country is using just it’s IP address. I’ve done it a few different ways over the years, but the simplest I’ve found is using the MaxMind GeoLite Country database directly. To speed such lookups, I’ve written a simple Python script that can run a whole series of such queries for you.

Generated HTML index

2012-10-06 in Programming

A simple script today to generate an HTML index listing all of the files in a given directory. This has come in handy in the past when Apache has had Options -Indexes set (disabling their automatically generated indexes) and I didn’t have the permissions to override it.

JP's Blog

Programming, Language: Python

Recent posts (Page 18 of 20)

Command line user agent parsing

Combining sort and uniq

Scanning for DNS resolvers

Decoding escaped Unicode strings

Generating non-repeating strings

Numbers of Wirth

List algorithms and efficiency

Determining country by latitude/longitude

Determining country by IP

Generated HTML index

All posts