Text processing in the shell

This article is part of a self-published book project by Balthazar Rouberol and Etienne Brodu , ex-roommates, friends and colleagues, aiming at empowering the up and coming generation of developers. We currently are hard at work on it!

If you are interested in the project, we invite you to join the mailing list !

Table of Contents

Text processing in the shell

One of the things that makes the shell an invaluable tool is the amount of available text processing commands, and the ability to easily pipe them into each other to build complex text processing workflows. These commands can make it trivial to perform text and data analysis, convert data between different formats, filter lines, etc.

When working with text data, the philosophy is to break any complex problem you have into a set of smaller ones, and to solve each of them with a specialized tool.

Make each program do one thing well.

The examples in that chapter might seem a little contrived at first, but this is also by design. Each of these tools were designed to solve one small problem. They however become extremely powerful when combined.

We will go over some of the most common and useful text processing commands the shell has to offer, and will demonstrate real-life workflows piping them together. I suggest you take a look at the man of these commands to see the full breadth of options at your disposal.

The example CSV (comma-separated values) file is available online.Feel free to download it yourself to test these commands.

cat

As seen in the previous chapter, cat is used to concatenate a list of one or more files and displays their content on screen.

$ cat Documents/readme
Thanks again for reading this book!
I hope you're following so far!

$ cat Documents/computers
Computers are not intelligent
They're just fast at making dumb things.
$ cat Documents/readme Documents/computers
Thanks again for reading this book!
I hope you are following so far!

Computers are not intelligent
They're just fast at making dumb things.

head

head prints the first n lines in a file. It can be very useful to peek into a file of unknown structure and format without burying your shell under a wall of text.

$ head -n 2 metadata.csv
metric_name,metric_type,interval,unit_name,per_unit_name,description,orientation,integration,short_name
mysql.galera.wsrep_cluster_size,gauge,,node,,The current number of nodes in the Galera cluster.,0,mysql,galera cluster size

If -n is unspecified, head will print the first 10 lines in its argument file or input stream.

tail

tail is head ’s counterpart. It prints the last n lines in a file.

$ tail -n 1 metadata.csv
mysql.performance.queries,gauge,,query,second,The rate of queries.,0,mysql,queries

If you want to print all lines in a file located after the nth line (included), you can use the -n +n argument.

$ tail -n +42 metadata.csv
mysql.replication.slaves_connected,gauge,,,,Number of slaves connected to a replication master.,0,mysql,slaves connected
mysql.performance.queries,gauge,,query,second,The rate of queries.,0,mysql,queries

Our file has 43 lines, so tail -n +42 only prints the 42nd and 43rd line in our file.

If -n is unspecified, tail will print the last 10 lines in its argument file or input stream.

tail -f or tail --follow displays the last lines in a file and displays each new line as the file is being written to. It is very useful to see real time activity that is written to a log file, for example a web server log file, etc.

wc

wc (for word count ) prints either the number of characters (when using -c ), words (when using -w ) or lines (when using -l ) in its argument files or input stream.

$ wc -l metadata.csv
43  metadata.csv
$ wc -w metadata.csv
405 metadata.csv
$ wc -c metadata.csv
5094 metadata.csv

By default, wc prints all of the above.

$ wc metadata.csv
43     405    5094 metadata.csv

Only the count will be printed out if the text data is piped in or redirected into stdin .

$ cat metadata.csv | wc
43     405    5094
$ cat metadata.csv | wc -l
43
$ wc -w < metadata.csv
405

grep

grep is the Swiss Army knife of line filtering. It allows you to filter lines matching a given pattern.

For example, we can use grep to find all occurrences of the word mutex in our metadata.csv file.

$ grep mutex metadata.csv
mysql.innodb.mutex_os_waits,gauge,,event,second,The rate of mutex OS waits.,0,mysql,mutex os waits
mysql.innodb.mutex_spin_rounds,gauge,,event,second,The rate of mutex spin rounds.,0,mysql,mutex spin rounds
mysql.innodb.mutex_spin_waits,gauge,,event,second,The rate of mutex spin waits.,0,mysql,mutex spin waits

grep can either files passed as arguments, or a stream of text passed to its stdin . We can thus chain multiple grep commands to further filter our text. In the next example, we filter lines in our metadata.csv file that contain both the mutex and OS words.

$ grep mutex metadata.csv | grep OS
mysql.innodb.mutex_os_waits,gauge,,event,second,The rate of mutex OS waits.,0,mysql,mutex os waits

Let’s go over some of the options you can pass to grep and their associated behavior.

grep -v performs an invert matching: it filters the lines that do not match the argument pattern.

$ grep -v gauge metadata.csv
metric_name,metric_type,interval,unit_name,per_unit_name,description,orientation,integration,short_name

grep -i performs a case-insensitive matching. In the next example grep -i os matches both OS and os .

$ grep -i os metadata.csv
mysql.innodb.mutex_os_waits,gauge,,event,second,The rate of mutex OS waits.,0,mysql,mutex os waits
mysql.innodb.os_log_fsyncs,gauge,,write,second,The rate of fsync writes to the log file.,0,mysql,log fsyncs

grep -l only lists files containing a match.

$ grep -l mysql metadata.csv
metadata.csv

grep -c counts the number of times a pattern was found.

$ grep -c select metadata.csv
3

grep -r recursively searches files in the current working directory and all subdirectories below it.

$ grep -r are ~/Documents
/home/br/Documents/computers:Computers are not intelligent
/home/br/Documents/readme:I hope you are following so far!

grep -w only matches whole words.

$ grep follow ~/Documents/readme
I hope you are following so far!
$ grep -w follow ~/Documents/readme
$

cut

cut cuts out a portion of a file (or, as always, its input stream). cut works by defining a field delimited (what separates two columns) with the -d option, and what column(s) should be extracted, with the -f option.

For example, the following command extracts the first column of the last 5 lines our CSV file.

$ tail -n 5 metadata.csv | cut -d , -f 1
mysql.performance.user_time
mysql.replication.seconds_behind_master
mysql.replication.slave_running
mysql.replication.slaves_connected
mysql.performance.queries

As we are dealing with a CSV file, we can extract each column by cutting over the , character, and extract the first column with -f 1 .

We could also select both the first and second columns by using the -f 1,2 option.

$ tail -n 5 metadata.csv | cut -d , -f 1,2
mysql.performance.user_time,gauge
mysql.replication.seconds_behind_master,gauge
mysql.replication.slave_running,gauge
mysql.replication.slaves_connected,gauge
mysql.performance.queries,gauge

paste

paste can merge together two different files into one multi-column file.

$ cat ingredients
eggs
milk
butter
tomatoes
$ cat prices
1$
1.99$
1.50$
2$/kg
$ paste ingredients prices
eggs    1$
milk    1.99$
butter  1.50$
tomatoes    2$/kg

By default, paste uses a tab delimiter, but you can change that using the -d option.

$ paste ingredients prices -d:
eggs:1$
milk:1.99$
butter:1.50$
tomatoes:2$/kg

Another common use of paste it to join all lines within a stream or a file using a given delimiter, using a combination of the -s and -d argument.

$ paste -s -d, ingredients
eggs,milk,butter,tomatoes

If - is specified as an input file, stdin will be read instead.

$ cat ingredients | paste -s -d, -
eggs,milk,butter,tomatoes

sort

sort , well, sorts argument files or input.

$ cat ingredients
eggs
milk
butter
tomatoes
salt
$ sort ingredients
butter
eggs
milk
salt
tomatoes

sort -r performs a reverse sort.

$ sort -r ingredients
tomatoes
salt
milk
eggs
butter

sort -n performs a numerical sort, by sorting fields by their arithmetic value.

$ cat numbers
0
2
1
10
3
$ cat numbers | sort
0
1
10
2
3
$ cat numbers | sort -n
0
1
2
3
10

uniq

uniq detects or filters out adjacent identical lines in its argument file or input stream.

$ cat duplicates
and one
and one
and two
and one
and two
and one, two, three
$ uniq duplicates
and one
and two
and one
and two
and one, two, three

As uniq only filters out adjacent identical lines, we can still see more than one unique lines in its output. To filter out all identical lines from our duplicates file, we need to sort its content first.

$ sort duplicates | uniq
and one
and one, two, three
and two

uniq -c prepends all lines with its number of occurrences.

$ sort duplicates | uniq -c
   3 and one
   1 and one, two, three
   2 and two

uniq -u only displays the unique lines within its input.

$ sort duplicates | uniq -u
and one, two, three

uniq is particularly useful used in conjunction with sort , as | sort | uniq allows you to remove any duplicate line in a file or a stream.

awk

awk is a little more than a text processing tool: it’s actually a whole programming language of its own. One thing awk is really good at is splitting files into columns, and it especially shines when these files contain a mix and match of spaces and tabs.

$ cat -t multi-columns
John Smith    Doctor^ITardis
Sarah-James Smith^I    Companion^ILondon
Rose Tyler   Companion^ILondon

cat -t displays tabs as ^I .

We can see that these columns are either separated by spaces or tabs, and that they are not always separated by the same number of spaces. cut would be of no use there, because it only works on a single character delimiter. awk however, can easily make sense of that file.

awk '{ print $n }' prints the nth column in the text.

$ cat multi-columns | awk '{ print $1 }'
John
Sarah-James
Rose
$ cat multi-columns | awk '{ print $3 }'
Doctor
Companion
Companion
$ cat multi-columns | awk '{ print $1,$2 }'
John Smith
Sarah-James Smith
Rose Tyler

There is so much more we can do with awk , however, printing columns probably accounts for 99% of my personal usage.

{ print $NF } prints the last column in the line.

tr

tr stands for translate , and it replaces characters into others. It either works on characters or character classes , such as lowercase, printable, spaces, alphanumeric, etc.

tr <char1> <char2 translates all occurrences of <char1> from its standard input into <char2> .

$ echo "Computers are fast" | tr a A
computers Are fAst

tr can also translate character classes by using the [:class:] notation. The full list of available classes is described in the tr man page, but we’ll demonstrate some of them here.

[:space:] represent all types of spaces, from a simple space, to a tab or a newline.

$ echo "computers are fast" | tr '[:space:]' ','
computers,are,fast,%

All spaces-like characters were translated into a comma. Note that the % character at the end of the output represents the lack of a trailing newline. Indeed, that newline was translated to a comma as well.

[:lower:] represents all lowercase characters, and [:upper:] represents all uppercase characters. Converting between cases is thus made very easy.

$ echo "computers are fast" | tr '[:lower:]' '[:upper:]'
COMPUTERS ARE FAST
$ echo "COMPUTERS ARE FAST" | tr '[:upper:]' '[:lower:]'
computers are fast

tr -c SET1 SET2 will transform any character not in SET1 into the characters in SET2. The following example replaces all non vowels by spaces.

$ echo "computers are fast" | tr -c '[aeiouy]' ' '
 o  u e   a e  a

tr -d deletes the matched characters, instead of replacing them. It’s the equivalent of tr <char> '' .

$ echo "Computers Are Fast" | tr -d '[:lower:]'
C A F

tr can also replace character ranges, for example all letters between a and e , or all numbers between 1 and 8, by using the notation s-e , where s is the start character and e is the end one.

$ echo "computers are fast" | tr 'a-e' 'x'
xomputxrs xrx fxst
$ echo "5uch l337 5p34k" | tr '1-4' 'x'
5uch lxx7 5pxxk

fold

fold wraps each input line to fit in a specified width. It can be useful to make sure an argument text fits in a small display size for example. fold -w n folds the lines at n characters.

$ cat ~/Documents/readme | fold -w 16
Thanks again for
 reading this bo
ok!
I hope you're fo
llowing so far!

fold -s will only break lines on a space character, and can be combined with -w to fold up to a given number of characters.

Thanks again
for reading
this book!
I hope you're
following so
far!

sed

sed is a non-interactive stream editor, used to perform text transformation on its input stream, on a line-per-line basis. It can take its output from a file our its stdin and will output its result either in a file or its stdout .

It works by taking one or many optional addresses , a function and parameters . A sed command thus looks like this:

[address[,address]]function[arguments]

While sed can perform many functions, we will cover only substitution, as it is probably sed ’s most common use.

Substituting text

A sed substitution command looks like this:

s/PATTERN/REPLACEMENT/[options]

Example: replacing the first instance of a word for each line in a file

$ cat hello
hello hello
hello world!
hi
$ cat hello | sed 's/hello/Hey I just met you/'
Hey I just met you hello
Hey I just met you world
hi

We can see that only the first occurrence of hello was replaced in the first line. To replace all occurrences of hello in each line, we can use the g (for global ) option.

$ cat hello | sed 's/hello/Hey I just met you/g'
Hey I just met you Hey I just met you
Hey I just met you world
ji

sed allows you to specify any other separator than / , which is especially useful to keep the command readable if the search of replacement pattern contains forward slashes.

$ cat hello | sed 's@hello@Hey I just met you@g'
Hey I just met you Hey I just met you
Hey I just met you world
ji

By specifying an address, we can tell sed on which line or line range to actually perform the substitution.

$ cat hello | sed '1s/hello/Hey I just met you/g'
Hey I just met you hello
hello world
hi
$ cat hello | sed '2s/hello/Hey I just met you/g'
hello hello
Hey I just met you  world
hi

The address 1 tells sed to only replace hello by Hey I just met you on line 1. We can specify an address range with the notation <start>,<end> where <end> can either be a line number or $ , meaning the last line in the file.

$ cat hello | sed '1,2s/hello/Hey I just met you/g'
Hey I just met you Hey I just met you
Hey I just met you world
hi
$ cat hello | sed '2,3s/hello/Hey I just met you/g'
hello hello
Hey I just met you world
hi
$ cat hello | sed '2,$s/hello/Hey I just met you/g'
hello hello
Hey I just met you world
hi

By default, sed displays its result in its stdout , but it can also edit the initial file in-place, with the use of the -i option.

$ sed -i '' 's/hello/Bonjour/' sed-data
$ cat sed-data
Bonjour hello
Bonjour world
hi

On Linux, only -i needs to be specified. However, due to the fact that sed ’s behavior on macOS is slightly different, the '' needs to be added right after -i .

Real-life examples

Filtering a CSV using grep and awk

$ grep -w gauge metadata.csv | awk -F, '{ if ($4 == "query") { print $1, "per", $5 } }'
mysql.performance.com_delete per second
mysql.performance.com_delete_multi per second
mysql.performance.com_insert per second
mysql.performance.com_insert_select per second
mysql.performance.com_replace_select per second
mysql.performance.com_select per second
mysql.performance.com_update per second
mysql.performance.com_update_multi per second
mysql.performance.questions per second
mysql.performance.slow_queries per second
mysql.performance.queries per
我来评几句
登录后评论

已发表评论数()

相关站点

+订阅
热门文章