The Amazing Unix Shell
Hilary Mason and Chris Wiggins wrote about the importance of any data scientist being familiar with the command line. So… it’s enough to you start it! Often people use Hadoop and other so-called Big Data tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques.
One especially under-used approach for data processing is using standard shell tools and commands. The benefits of this approach can be massive, since creating a data pipeline out of shell commands means that all the processing steps can be done in parallel. You can pretty easily construct a stream processing pipeline with basic commands that will have extremely good performance compared to many modern Big Data tools.
An additional point is the batch versus streaming analysis approach. Getting short on memory. This is because all data is loaded into RAM for the analysis. However, considering the problem for a bit, it can be easily solved with streaming analysis that requires basically no memory at all. The resulting stream processing pipeline we will create will be over faster than the Hadoop implementation and use virtually no memory.
The command line is essential to my daily work, so I wanted to share some of the commands I’ve found most useful.
Here you’ll see some nice commands on the Unix shell from basic information on the data to analysis.
General syntax
Pipe | result | use result |
The next command will use the previous result |
Save | result > file.txt |
Save the result in a file |
Stop | Control+C | Stops running the command |
Help | head --help |
Explanation and parameters overview |
Data
The dataset is the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.
To reproduce this post download the file and unzip it using the commands below.
$ cd ~/Downloads/
$ wget 'https://vincentarelbundock.github.io/Rdatasets/csv/car/Salaries.csv'
Head & Tail - First look
Sometimes you just need to inspect the structure of a huge file. That’s where head and tail come in. Head prints the first ten lines of a file, while tail prints the last ten lines. Optionally, you can include the -n parameter to change the number of lines displayed.
Cut - Column extracter
Unix command cut is used for text processing. You can use this command to extract portion of text from a file by selecting columns.
Iconv - File encoding
The iconv program converts the encoding of characters in inputfile from one coded character set to another. The file encoding is all right so let’s create a problem:
$ echo 'pão' > new.txt
$ cat new.txt
pão
UTF-8 handles the ã character well. But if we set as ISO-8859-9?
$ cat new.txt | iconv -f UTF-8 -t ISO-8859-9
p�o
A wild � appears! For the sake of curiosity, pão is bread in Portuguese.
If I have a bunch of text files that I’d like to convert from any given charset to UTF-8 encoding. Just use this bash magic.
Sort - Ordering
Sort is a simple and very useful command which will rearrange the lines in a text file so that they are sorted, numerically and alphabetically. Here is the top five salaries.
WC - word count
By default, wc will quickly tell you how many lines, words, and bytes are in a file. If you’re looking for just the line count, you can pass the -l parameter in. I use it most often to verify record counts between files or database tables throughout an analysis.
Unique - Distinct entries
Sometimes you want to check for duplicate records in a large text file - that’s when uniq comes in handy. By using the -c parameter, uniq will output the count of occurrences along with the line. You can also use the -d and -u parameters to output only duplicated or unique records.
Count the number of unique lines.
Grep - Find pattern
Grep allows you to search through plain text files using regular expressions. There’s an assortment of extra parameters you can use with grep, but the ones I tend to use the most are -i (ignore case), -r (recursively search directories), -B N (N lines before), -A N (N lines after).
In case I want to find a entry in the first column that don’t starts with “Prof:
Sed
Sed is similar to grep and awk (next) in many ways, however I find that I most often use it when needing to do some find and replace magic on a very large file.
s
is used to replace the found expression.g
stands for global, meaning to do this for the whole line. If you leave off the g and the expression appears twice on the same line, only the first is changed.-i
option is used to edit in place on filename.-e
option indicates a command to run.
I will substitute the Male entries for a single char M:
AWK - Powerful
Sum
How much is the annual cost of teachers in this school?
The above line says:
- Use the cat command to stream (print) the contents of the file to stdout.
- Pipe the streaming contents from our cat command to the next one - awk.
- With awk:
- Set the field separator to the pipe character (-F “,”).
- Increment the variable sum with the value in the twelfth column ($7). Since we used a pipeline in point #2, the contents of each line are being streamed to this statement.
- Once the stream is done, print out the value of sum, using printf to format the value with none decimal places.
List the column number
The next command is essential to manipulate a data with many columns.
Char count
This command counts the maximum character length of each column in Salaries.csv file. Very nice command to look at your data and parse correctly the text file to a database:
Mean
It makes possible to calculate the mean of the whole file. In the below example I want the average salary for the female professors.
Standard Deviation - population
Also, awk can bring some position measures like the standard deviation.
Split file by some rule
You can also split your file filtered by a rule.