Skip to content
 

Idiomatic awk

Here we'll see some hints on how to write more idiomatic (and usually shorter and more efficient) awk programs. Many awk programs you’re likely to encounter, especially short ones, make large use of these notions.

The power of conditions

As a starting example, suppose you want to print all the records (normally lines) in a file that match some pattern (a kind of awk-grep, if you like). A reasonable first shot is usually something like

awk '{if ($0 ~ /pattern/) print $0}'

That works, but there are some things to note.

The first thing to note is that it is not structured according to the awk’s definition of a program, which is

condition { actions }

Our program can clearly be rewritten using this form, since both the condition and the action are very clear:

awk '$0 ~ /pattern/ {print $0}'

Our next step in the perfect awk-ification of this program is to note that the /pattern/ syntax is the same as $0 ~ /pattern/. That is, when awk sees a regular expression literal used as an expression, it implicitly applies it to $0, and returns true if there is a match. So now we have:

awk '/pattern/ {print $0}'

Now, let’s turn our attention to the action part (the stuff inside braces). print $0 is redundant, since print alone, by default, prints $0.

awk '/pattern/ {print}'

But let's make another step. When it finds that a condition is true, and there are no associated actions, awk performs a default action, and that action (you guessed it) is print (which we already know is equivalent to print $0). Thus we can finally do this:

awk '/pattern/'

Now we have reduced the initial program to its simplest (and more idiomatic) form. In many cases, if all you want to do is print some records (lines), according to a condition, you can write awk programs composed only of a condition (although complex):

awk '(NR%2 && /pattern/) || (!(NR%2) && /anotherpattern/)'

That prints odd lines that match /pattern/ and even lines that match /anotherpattern/. Naturally, if you don’t want to print $0 but instead do something else, then you’ll have to manually add a specific action to do what you want.

From the above, it follows that

awk 1
awk '"a"'   # single quotes are important!

are two awk programs that just print their input unchanged, both "1" and the string "a" obviously being always-true conditions. This is not terribly useful by itself, but it can be used in combination with other code in a number of circumstances.
For example, sometimes you want to operate only on some records of the input (according to some condition), but also want to print all records, regardless of whether they were affected by your operation or not. A typical example is a program like this:

awk '{sub(/pattern/, "foobar")} 1'

This tries to replace whatever matches /pattern/ with "foobar". But whether or not the substitution succeeds, the always-true condition "1" prints each line (you could even use 42, or 19, or any other nonzero value if you so prefer; 1 is just what people traditionally use). This results in a program that does the same job as

sed 's/pattern/foobar/'

Here are some examples of typical awk programs, using only conditions:

awk 'NR % 6'            # prints all lines except lines 6,12,18...
awk 'NR > 5'            # prints from line 6 onwards (like tail -n +6, or sed '1,5d')
awk '$2 == "foo"'       # prints lines where the second field is "foo"
awk 'NF >= 6'           # prints lines with 6 or more fields
awk '/foo/ && /bar/'    # prints lines that match /foo/ and /bar/, in any order
awk '/foo/ && !/bar/'   # prints lines that match /foo/ but not /bar/
awk '/foo/ || /bar/'    # prints lines that match /foo/ or /bar/ (like grep -e 'foo' -e 'bar')
awk '/foo/,/bar/'       # prints from line matching /foo/ to line matching /bar/, inclusive
awk 'NF'                # prints only nonempty lines (or: do not print empty lines, where NF==0)
awk 'NF--'              # removes last field and prints the line
awk '$0 = NR" "$0'      # prepends line numbers (assignments are valid in conditions)
awk '!a[$0]++'          # suppresses duplicated lines! (figure out how it works)

As an extreme example of the power of conditions, let's examine the following code:

awk 'ORS = NR % 5 ? FS : RS'

You might also find it written with no spaces at all, especially in golf-ish contexts:

awk 'ORS=NR%5?FS:RS'

Let's run it using some simple input:

$ seq 1 30 | awk 'ORS=NR%5?FS:RS'
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30

So what it does is columnate the input (in groups of five columns here, but just change the number 5 in the code to the number you want, or use a variable and pass the value using -v). Why does it work? Well, when awk prints an output record (line), the special variable ORS, as the name suggests, contains the separator to append to the line. By default, ORS is "\n", ie a newline character. But here we are explicitly assigning a value to ORS depending on the outcome of the ternary operator. So if NR%5 is zero (ie, we are at line 5, 10, 15, etc.), ORS gets the value of RS (which by default contains "\n"); otherwise ORS gets the value of FS (by default, a space). If RS and FS have their standard values, it can be rewritten like this:

awk 'ORS=NR%5?" ":"\n"'

Overall, in the end the whole code is an assignment which, as we've seen, is evaluated and returns the assigned value (so either the value of RS or FS here, neither of which is nontrue). Since we're having a true condition, the default action (print) is executed for every line, and as that is done, ORS either adds a newline or a space, depending on whether we are at column 5 or not.
An issue with this code is that if the number of lines in the input is not a multiple of the number of columns (5 in this case), the output ends with a space rather than a newline. You can add an END block to correct that.

Note that the ternary operator can be changed to yield different results; for example, this code

awk 'ORS=/foo$/?FS:RS'

joins the following line to the current one if the current one ends in foo. You are encouraged to find your own variations.

Self-assignments

Let's take another relatively common task: replacing the delimiters. Your input fields are delimited, for example, by semicolons (or any other arbitrarily complex separator), and you want to change that to, say, commas. Armed with the knowledge you gathered in the first part of this article, you do

awk -v FS=';' -v OFS=',' 1   # doesn't work!

but that doesn't work, and it outputs the input unchanged. The reason for this behavior is that awk does not rebuild $0 (that is, replacing FS with OFS) until some field is modified. That might seem strange at first, but it makes sense (and is even useful) in many circumstances. One reason for this behavior is one of efficiency: if the input and output separator are the same, as it happens many times, replacing FS with OFS is pointless. And if you think about it, if awk really always replaced FS with OFS, a program as simple as

$ echo 'foo;bar' | awk -v FS=';' -v OFS=',' '/foo/'

would output

foo,bar   # ????

which violates the principle of least surprise and most certainly is not what one expects here.
On the other hand, if replacing FS with OFS is needed, then obviously awk has to do that at some point. So the question is: when is it that awk thinks FS has to be replaced with OFS? As mentioned above, awk assumes that it's time to replace FS with OFS (and thus recompute $0) when a field is modified, which is a sensible assumption, and almost always produces the output that one would expect.

So back to the original problem, you can now see a solution: to force recomputation of $0, just let awk think you've changed a field, but without changing it:

awk -v FS=';' -v OFS=',' '{$1=$1}1'

The $1=$1 bit is what confuses many people, who wonder what it's for. It's a typical awk idiom to force awk to rebuild $0 (usually to apply some OFS).

Warning: useless information follows.

If you're sure that $1 will never be an empty string, you can even golf the code a bit more and use the assignment as the condition, so

awk -v FS=';' -v OFS=',' '$1=$1'

If you want to shorten it even more (and make it more cryptic) to impress your friends or whatever reason, you can move the assignments at the end to save the -v and remove some quotes:

awk '$1=$1' FS=\; OFS=,

That exploits an obscure feature of awk where any argument that has a "=" in it is treated as a variable assignment instead of a file to read (which, btw, makes it hard to persuade awk to operate on files whose name contains "=". I'm sure your life will never be the same now that you know this). Well, I did say that this was useless information, so let's get back to something more practical...

Build strings with separators

This is similar to the so-called fencepost problem. In many occasions you need to build a string using concatenation, starting from an empty string, and adding values as you go. The values should be separated by some separator (let's say, a semicolon for these examples). One might do this, in some loop:

string = string ";" somedata

but then string has an unwanted leading semicolon. Putting the semicolon after the variable has a similar problem. So a typical way to do this with awk is this:

string = string sep somedata; sep = ";"

This exploits the fact that awk variables start out containing the dual value empty string or zero, so the first time the code is executed, sep is empty (you can explicitly initialize it to the empty string in a BEGIN block, if you like, but it's redundant). Then it's set to a semicolon, and it will have that value from the second time the code is executed onwards. The result is that at the end string will have a neat list of values with the semicolons only where they should be.

As a practical example of this idiom, let's see how to transpose a matrix using awk:

$ cat matrix.txt
a1;a2;a3;a4;a5
b1;b2;b3;b4;b5
c1;c2;c3;c4;c5
d1;d2;d3;d4;d5
$ awk -F\; '{for(i=1;i<=NF;i++)r[i]=r[i] sep $i;sep=FS}END{for(i=1;i<=NF;i++)print r[i]}' matrix.txt
a1;b1;c1;d1
a2;b2;c2;d2
a3;b3;c3;d3
a4;b4;c4;d4
a5;b5;c5;d5

The idea here is to build an array r with NF elements (the number of columns in the original input), each of which will hold a line of the output. For each input line, every element of r has another "column" added. The variable sep is initially empty, then (after the first input line has been processed) it's set to semicolon. Of course, for more complex processing, an array of separators could be used.

Two-file processing

Another construct that is often used in awk is as follows:

$ awk 'NR == FNR { # some actions; next} # other condition {# other actions}' file1.txt file2.txt

This is used when processing two files. When processing more than one file, awk reads each file sequentially, one after another, in the order they are specified on the command line. The special variable NR stores the total number of input records read so far, regardless of how many files have been read. The value of NR starts at 1 and always increases until the program terminates. Another variable, FNR, stores the number of records read from the current file being processed. The value of FNR starts at 1, increases until the end of the current file is reached, then is set again to 1 as soon as the first line of the next file is read, and so on. So, the condition NR == FNR is only true while awk is reading the first file. Thus, in the program above, the actions indicated by # some actions are executed when awk is reading the first file; the actions indicated by # other actions are executed when awk is reading the second file, if the condition in # other condition is met. The next at the end of the first action block is needed to prevent the condition in # other condition from being evaluated, and the actions in # other actions from being executed, while awk is reading the first file.

Probably, it all becomes much clearer with some examples. There are really many problems that involve two files that can be solved using this technique. Let's look at this:

# prints lines that are both in file1.txt and file2.txt (intersection)
$ awk 'NR == FNR{a[$0];next} $0 in a' file1.txt file2.txt

Here we see another typical idiom: a[$0] alone has the only purpose of creating the array element indexed by $0, even if we don't assign any value to it. During the pass over the first file, all the lines seen are remembered as indexes of the array a. The pass over the second file just needs to check whether each line being read exists as an index in the array a (that's what the condition $0 in a does). If the condition is true, the line being read from file2.txt is printed (as we already know). In a very similar way, we can easily write the code to print the lines that appear in only one of the two files:

# prints lines that are only in file1.txt and not in file2.txt
$ awk 'NR == FNR{a[$0];next} !($0 in a)' file2.txt file1.txt

Note the order of the arguments. file2.txt is given first. To print lines that are only in file2.txt and not in file1.txt, just reverse the order of the arguments.

Another example. Suppose we have a data file like this

$ cat data.txt
20081010 1123 xxx
20081011 1234 def
20081012 0933 xyz
20081013 0512 abc
20081013 0717 def
...thousand of lines...

where "xxx", "def", etc. are some kind of operation codes. We want to replace each operation code with its description. We have another file that maps operation codes to human readable descriptions, like this:

$ cat map.txt
abc withdrawal
def payment
xyz deposit
xxx balance
...other codes...

We can easily replace the opcodes in the data file with this simple awk program, that again uses the two-files idiom (and other idioms that were already introduced):

# use information from a map file to modify a data file
$ awk 'NR == FNR{a[$1]=$2;next} {$3=a[$3]}1' map.txt data.txt

First, the array a, indexed by opcode, is populated with the human readable descriptions, read from the map file. Then, it is used during the reading of the data file to do the replacements. Each line of the data file is then printed after the substitution has been made.

Another case where the two-files idiom is useful is when you have to read the same file twice, the first time to get some information that can be correctly defined only by reading the whole file, and the second time to process the file using that information. For example, you want to replace each number in a list of numbers with its difference from the largest number in the list:

# replace each number with its difference from the maximum
$ awk 'NR == FNR{if($0 > max) max = $0;next} {$0 = max - $0}1' file.txt file.txt

Note that we specify file.txt file.txt on the command line, so the file will be read twice. This makes no difference to awk, which just thinks it has two files to read.

As with all other idioms, you are encouraged to find your own uses and variations.

Caveat: all the programs that use the two-files idiom will not work correctly if the first file is empty (in that case, awk will execute the actions associated to NR == FNR while reading the second file). To correct that, you can reinforce the NR == FNR condition by adding a test that, for example, checks that also FILENAME is equal to ARGV[1].

Be Sociable, Share!

10 Comments

  1. Saurabh says:

    I want following format for my program:
    awk
    1. {Body (processing file_1)}
    2. END{ computation from previous data}
    3. {Body update fields in file_1 using computed result}

    the problem is with 3. part as i am unable to read the file_1 from begining after end.

    The problem is dificult for me as I am new to awk.
    Thanks in advance.

    • waldner says:

      So you can read the same file twice, and do your computations just before you process the first line for the second time.
      For example:

      awk '
      NR == FNR { # first pass over file_1
                  process file_1; next }
      
      FNR == 1 { # this must be the first line of the second pass over file_1
                 computation from previous data (stored in arrays or whatever) }
      
      { normal processing of all lines of file_1 (second pass) }' file_1 file_1
      
  2. Mike says:

    Don't see how the 1 in this example matches the template of pattern{action}
    awk '{sub(/pattern/, "foobar")} 1'

    • waldner says:

      It's explained in there. "1" is an always-true pattern; the action is missing, which means that it is {print} (the default action that is executed if the pattern is true). So essentially the lone "1" is used to print all lines. The example you cite can thus be rewritten as either awk '{sub(/pattern/, "foobar")} 1 {print}' or simply awk '{sub(/pattern/, "foobar")} {print}'.

  3. Murpholinox Peligro says:

    there is a way to do this but with five files, instead of two?

    # prints lines that are both in file1.txt and file2.txt (intersection)
    $ awk 'NR == FNR{a[$0];next} $0 in a' file1.txt file2.txt

    • waldner says:

      I assume you want to print the lines that appear in all five files (or N files, for that matter). This should do it, assuming no file is empty (a bit reformatted for readability, but can be written all on a single line):

      awk 'BEGIN{
        for(i = 1; i<ARGC; i++){
          ref = ref s i
          s = "."
        }
      } 
      FNR == 1 {count++} 
      {a[$0] = a[$0] sep[$0] count; sep[$0] = "."}
      a[$0] == ref' file1 file2 file3 ... fileN
  4. [...] http://backreference.org/2010/02/10/idiomatic-awk/ [...]

Leave a Reply

(required)