A relatively simple script today. When I was working with Twitter data, it quickly became apparent that it’s a lot of data. So I needed some way that I could reduce the amount of data that I was dealing with while still keeping many of the same properties. To that end, I wrote a really simple script that would forward lines from stdin to stdout but would only do so a given percentage of the time.
That’s really all there is to it. I present to you sample:
import random, sys
# get the chance
# use exceptions for sanity checks
try:
if len(sys.argv) != 2: raise Exception()
chance = float(sys.argv[1])
if chance < 0 or chance > 1: raise Exception()
except:
print('''Usage: sample [chance]
Forward [chance] percentage of stdin to stdout
[chance] must be in the range [0,1]''')
sys.exit(0)
# now just read line by line and output
# based on that random chance
for line in sys.stdin:
if random.random() < chance:
print(line[:-1])
Generally, you’d pipe the output of another command to it. It’s not smart enough to act like most scripts where you can either read a file or pipe input, but that’s something that could easily be added with something like this:
# arg[0] is the program name, arg[1] is the chance
read = False
for arg in sys.argv[2:]:
with open(arg, 'r') as fin:
for line in fin:
if random.random() < chance:
print(line[:-1])
You will need to tweak the error checking, but that’s not so hard to do. Some testing (I have the simple version of this program in my path as ‘sample’):
┌ ☺ <span style="color: purple;">verkampj</span>@<span style="color: orange;">minty</span> <span style="color: green;">~</span>
└ cat sample.py | sample 0.1
for line in sys.stdin:
┌ ☺ <span style="color: purple;">verkampj</span>@<span style="color: orange;">minty</span> <span style="color: green;">~</span>
└ cat sample.py | sample 0.1
import sys
chance = float(sys.argv[1])
print('''Usage: sample [chance]
If you find it useful, let me know. I’m sure there are all sorts of time when you may want to look at just a representative sample where such a script would come in handy.
You can download the full source here: sample source code