Here we'll see some hints on how to write more idiomatic (and usually shorter and more efficient) awk programs. Many awk programs you’re likely to encounter, especially short ones, make large use of these notions.
The power of conditions
As a starting example, suppose you want to print all the records (normally lines) in a file that match some pattern (a kind of awk-grep, if you like). A reasonable first shot is usually something like
awk '{if ($0 ~ /pattern/) print $0}'
That works, but there are some things to note.
The first thing to note is that it is not structured according to the awk’s definition of a program, which is
condition { actions }
Our program can clearly be rewritten using this form, since both the condition and the action are very clear:
awk '$0 ~ /pattern/ {print $0}'
Our next step in the perfect awk-ification of this program is to note that the /pattern/ syntax is the same as
awk '/pattern/ {print $0}'
Now, let’s turn our attention to the action part (the stuff inside braces).
awk '/pattern/ {print}'
But let's make another step. When it finds that a condition is true, and there are no associated actions, awk performs a default action, and that action (you guessed it) is print (which we already know is equivalent to
awk '/pattern/'
Now we have reduced the initial program to its simplest (and more idiomatic) form. In many cases, if all you want to do is print some records (lines), according to a condition, you can write awk programs composed only of a condition (although complex):
awk '(NR%2 && /pattern/) || (!(NR%2) && /anotherpattern/)'
That prints odd lines that match /pattern/ and even lines that match /anotherpattern/. Naturally, if you don’t want to print $0 but instead do something else, then you’ll have to manually add a specific action to do what you want.
From the above, it follows that
awk 1 awk '"a"' # single quotes are important!
are two awk programs that just print their input unchanged, both "1" and the string "a" obviously being always-true conditions. This is not terribly useful by itself, but it can be used in combination with other code in a number of circumstances.
For example, sometimes you want to operate only on some records of the input (according to some condition), but also want to print all records, regardless of whether they were affected by your operation or not. A typical example is a program like this:
awk '{sub(/pattern/, "foobar")} 1'
This tries to replace whatever matches
sed 's/pattern/foobar/'
Here are some examples of typical awk programs, using only conditions:
awk 'NR % 6' # prints all lines except lines 6,12,18... awk 'NR > 5' # prints from line 6 onwards (like tail -n +6, or sed '1,5d') awk '$2 == "foo"' # prints lines where the second field is "foo" awk 'NF >= 6' # prints lines with 6 or more fields awk '/foo/ && /bar/' # prints lines that match /foo/ and /bar/, in any order awk '/foo/ && !/bar/' # prints lines that match /foo/ but not /bar/ awk '/foo/ || /bar/' # prints lines that match /foo/ or /bar/ (like grep -e 'foo' -e 'bar') awk '/foo/,/bar/' # prints from line matching /foo/ to line matching /bar/, inclusive awk 'NF' # prints only nonempty lines (or: do not print empty lines, where NF==0) awk 'NF--' # removes last field and prints the line awk '$0 = NR" "$0' # prepends line numbers (assignments are valid in conditions)
As an extreme example of the power of conditions, let's examine the following code:
awk 'ORS = NR % 5 ? FS : RS'
You might also find it written with no spaces at all, especially in golf-ish contexts:
awk 'ORS=NR%5?FS:RS'
Let's run it using some simple input:
$ seq 1 30 | awk 'ORS=NR%5?FS:RS' 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
So what it does is columnate the input (in groups of five columns here, but just change the number 5 in the code to the number you want, or use a variable and pass the value using -v). Why does it work? Well, when awk prints an output record (line), the special variable ORS, as the name suggests, contains the separator to append to the line. By default, ORS is "\n", ie a newline character. But here we are explicitly assigning a value to ORS depending on the outcome of the ternary operator. So if
awk 'ORS=NR%5?" ":"\n"'
Overall, in the end the whole code is an assignment which, as we've seen, is evaluated and returns the assigned value (so either the value of RS or FS here, neither of which is nontrue). Since we're having a true condition, the default action (print) is executed for every line, and as that is done, ORS either adds a newline or a space, depending on whether we are at column 5 or not.
An issue with this code is that if the number of lines in the input is not a multiple of the number of columns (5 in this case), the output ends with a space rather than a newline. You can add an END block to correct that.
Note that the ternary operator can be changed to yield different results; for example, this code
awk 'ORS=/foo$/?FS:RS'
joins the following line to the current one if the current one ends in foo. You are encouraged to find your own variations.
Self-assignments
Let's take another relatively common task: replacing the delimiters. Your input fields are delimited, for example, by semicolons (or any other arbitrarily complex separator), and you want to change that to, say, commas. Armed with the knowledge you gathered in the first part of this article, you do
awk -v FS=';' -v OFS=',' 1 # doesn't work!
but that doesn't work, and it outputs the input unchanged. The reason for this behavior is that awk does not rebuild $0 (that is, replacing FS with OFS) until some field is modified. That might seem strange at first, but it makes sense (and is even useful) in many circumstances. One reason for this behavior is one of efficiency: if the input and output separator are the same, as it happens many times, replacing FS with OFS is pointless. And if you think about it, if awk really always replaced FS with OFS, a program as simple as
$ echo 'foo;bar' | awk -v FS=';' -v OFS=',' '/foo/'
would output
foo,bar # ????
which violates the principle of least surprise and most certainly is not what one expects here.
On the other hand, if replacing FS with OFS is needed, then obviously awk has to do that at some point. So the question is: when is it that awk thinks FS has to be replaced with OFS? As mentioned above, awk assumes that it's time to replace FS with OFS (and thus recompute $0) when a field is modified, which is a sensible assumption, and almost always produces the output that one would expect.
So back to the original problem, you can now see a solution: to force recomputation of $0, just let awk think you've changed a field, but without changing it:
awk -v FS=';' -v OFS=',' '{$1=$1}1'
The
Warning: useless information follows.
If you're sure that $1 will never be an empty string, you can even golf the code a bit more and use the assignment as the condition, so
awk -v FS=';' -v OFS=',' '$1=$1'
If you want to shorten it even more (and make it more cryptic) to impress your friends or whatever reason, you can move the assignments at the end to save the -v and remove some quotes:
awk '$1=$1' FS=\; OFS=,
That exploits an obscure feature of awk where any argument that has a "=" in it is treated as a variable assignment instead of a file to read (which, btw, makes it hard to persuade awk to operate on files whose name contains "=". I'm sure your life will never be the same now that you know this). Well, I did say that this was useless information, so lt's get back to something more practical...
Build strings with separators
This is similar to the so-called fencepost problem. In many occasions you need to build a string using concatenation, starting from an empty string, and adding values as you go. The values should be separated by some separator (let's say, a semicolon for these examples). One might do this, in some loop:
string = string ";" somedata
but then string has an unwanted leading semicolon. Putting the semicolon after the variable has a similar problem. So a typical way to do this with awk is this:
string = string sep somedata; sep = ";"
This exploits the fact that awk variables start out containing the dual value empty string or zero, so the first time the code is executed, sep is empty (you can explicitly initialize it to the empty string in a BEGIN block, if you like, but it's redundant). Then it's set to a semicolon, and it will have that value from the second time the code is executed onwards. The result is that at the end string will have a neat list of values with the semicolons only where they should be.
As a practical example of this idiom, let's see how to transpose a matrix using awk:
$ cat matrix.txt
a1;a2;a3;a4;a5
b1;b2;b3;b4;b5
c1;c2;c3;c4;c5
d1;d2;d3;d4;d5
$ awk -F\; '{for(i=1;i<=NF;i++)r[i]=r[i] sep $i;sep=FS}END{for(i=1;i<=NF;i++)print r[i]}' matrix.txt
a1;b1;c1;d1
a2;b2;c2;d2
a3;b3;c3;d3
a4;b4;c4;d4
a5;b5;c5;d5
The idea here is to build an array r with NF elements (the number of columns in the original input), each of which will hold a line of the output. For each input line, every element of r has another "column" added. The variable sep is initially empty, then (after the first input line has been processed) it's set to semicolon. Of course, for more complex processing, an array of separators could be used.
Two-file processing
Another construct that is often used in awk is as follows:
$ awk 'NR == FNR { # some actions; next} # other condition {# other actions}' file1.txt file2.txt
This is used when processing two files. When processing more than one file, awk reads each file sequentially, one after another, in the order they are specified on the command line. The special variable NR stores the total number of input records read so far, regardless of how many files have been read. The value of NR starts at 1 and always increases until the program terminates. Another variable, FNR, stores the number of records read from the current file being processed. The value of FNR starts at 1, increases until the end of the current file is reached, then is set again to 1 as soon as the first line of the next file is read, and so on. So, the condition NR == FNR is only true while awk is reading the first file. Thus, in the program above, the actions indicated by
Probably, it all becomes much clearer with some examples. There are really many problems that involve two files that can be solved using this technique. Let's look at this:
# prints lines that are both in file1.txt and file2.txt (intersection)
$ awk 'NR == FNR{a[$0];next} $0 in a' file1.txt file2.txt
Here we see another typical idiom:
# prints lines that are only in file1.txt and not in file2.txt
$ awk 'NR == FNR{a[$0];next} !($0 in a)' file2.txt file1.txt
Note the order of the arguments.
Another example. Suppose we have a data file like this
$ cat data.txt 20081010 1123 xxx 20081011 1234 def 20081012 0933 xyz 20081013 0512 abc 20081013 0717 def ...thousand of lines...
where "xxx", "def", etc. are some kind of operation codes. We want to replace each operation code with its description. We have another file that maps operation codes to human readable descriptions, like this:
$ cat map.txt abc withdrawal def payment xyz deposit xxx balance ...other codes...
We can easily replace the opcodes in the data file with this simple awk program, that again uses the two-files idiom (and other idioms that were already introduced):
# use information from a map file to modify a data file
$ awk 'NR == FNR{a[$1]=$2;next} {$3=a[$3]}1' map.txt data.txt
First, the array a, indexed by opcode, is populated with the human readable descriptions, read from the map file. Then, it is used during the reading of the data file to do the replacements. Each line of the data file is then printed after the substitution has been made.
Another case where the two-files idiom is useful is when you have to read the same file twice, the first time to get some information that can be correctly defined only by reading the whole file, and the second time to process the file using that information. For example, you want to replace each number in a list of numbers with its difference from the largest number in the list:
# replace each number with its difference from the maximum
$ awk 'NR == FNR{if($0 > max) max = $0;next} {$0 = max - $0}1' file.txt file.txt
Note that we specify
As with all other idioms, you are encouraged to find your own uses and variations.
Caveat: all the programs that use the two-files idiom will not work correctly if the first file is empty (in that case, awk will execute the actions associated to









