A relatively simple script today. When I was working with Twitter data, it quickly became apparent that it’s a lot of data. So I needed some way that I could reduce the amount of data that I was dealing with while still keeping many of the same properties. To that end, I wrote a really simple script that would forward lines from
stdout but would only do so a given percentage of the time.
That’s really all there is to it. I present to you
import random, sys # get the chance # use exceptions for sanity checks try: if len(sys.argv) != 2: raise Exception() chance = float(sys.argv) if chance < 0 or chance > 1: raise Exception() except: print('''Usage: sample [chance] Forward [chance] percentage of stdin to stdout [chance] must be in the range [0,1]''') sys.exit(0) # now just read line by line and output # based on that random chance for line in sys.stdin: if random.random() < chance: print(line[:-1])
Generally, you’d pipe the output of another command to it. It’s not smart enough to act like most scripts where you can either read a file or pipe input, but that’s something that could easily be added with something like this:
# arg is the program name, arg is the chance read = False for arg in sys.argv[2:]: with open(arg, 'r') as fin: for line in fin: if random.random() < chance: print(line[:-1])
You will need to tweak the error checking, but that’s not so hard to do. Some testing (I have the simple version of this program in my path as ‘
┌ ☺ <span style="color: purple;">verkampj</span>@<span style="color: orange;">minty</span> <span style="color: green;">~</span> └ cat sample.py | sample 0.1 for line in sys.stdin: ┌ ☺ <span style="color: purple;">verkampj</span>@<span style="color: orange;">minty</span> <span style="color: green;">~</span> └ cat sample.py | sample 0.1 import sys chance = float(sys.argv) print('''Usage: sample [chance]
If you find it useful, let me know. I’m sure there are all sorts of time when you may want to look at just a representative sample where such a script would come in handy.
You can download the full source here: sample source code