ASA Sections on: Statistical Computing
|
[ Awards, Data expo, Video library ] [ Events, News, Newsletter ] |
Dealing with this amount of data is definitely a challenge and we hope that this data expo will inspire you to learn more about dealing with large volumes of data. To make sure you don't get overwhelmed, this page describes some simple command line tools to sort, filter and tabulate.
All of these tools are available on a default install of linux or mac os x. If you want to use them on windows, you will need to install cygwin or similar.
sortSort by the 10th column (flightnum):
sort -t, -k 10,10 2008.csv
awkRemove header rows:
awk -F, '$NR != 1' 2008.csv
Show flights from Des Moines to Chicago O'hare
awk -F, '$17 == "DSM" && $18 == "ORD"' 2008.csv
cutSelect only columns 9 (carrier) and 10 (flight num):
cut -f9,10 -d, 2008.csv
Count the number of flights for each flight number and save to 2008-flights.csv:
cut -f9,10 -d, 2008.csv | sort | uniq -c > 2008-flights.csv