Idiomatic awk

Posted by waldner on 10 February 2010, 8:10 pm

Here we'll see some hints on how to write more idiomatic (and usually shorter and more efficient) awk programs. Many awk programs you’re likely to encounter, especially short ones, make large use of these notions.

The power of conditions

As a starting example, suppose you want to print all the records (normally lines) in a file that match some pattern (a kind of awk-grep, if you like). A reasonable first shot is usually something like

awk '{if ($0 ~ /pattern/) print $0}'

That works, but there are some things to note.

The first thing to note is that it is not structured according to awk’s definition of a program, which is

condition { actions }

Our program can clearly be rewritten using this form, since both the condition and the action are very clear:

awk '$0 ~ /pattern/ {print $0}'

Our next step in the perfect awk-ification of this program is to note that the /pattern/ syntax is the same as $0 ~ /pattern/. That is, when awk sees a regular expression literal used as an expression, it implicitly applies it to $0, and returns true if there is a match. So now we have:

awk '/pattern/ {print $0}'

Now, let’s turn our attention to the action part (the stuff inside braces). print $0 is redundant, since print alone, by default, prints $0.

awk '/pattern/ {print}'

But let's make another step. When it finds that a condition is true, and there are no associated actions, awk performs a default action, and that action (you guessed it) is print (which we already know is equivalent to print $0). Thus we can finally do this:

awk '/pattern/'

Now we have reduced the initial program to its simplest (and more idiomatic) form. In many cases, if all you want to do is print some records (lines), according to a condition, you can write awk programs composed only of a condition (although complex):

awk '(NR%2 && /pattern/) || (!(NR%2) && /anotherpattern/)'

That prints odd lines that match /pattern/ and even lines that match /anotherpattern/. Naturally, if you don’t want to print $0 but instead do something else, then you’ll have to manually add a specific action to do what you want.

From the above, it follows that

awk 1
awk '"a"'   # single quotes are important!

are two awk programs that just print their input unchanged, both "1" and the string "a" obviously being always-true conditions. This is not terribly useful by itself, but it can be used in combination with other code in a number of circumstances.
For example, sometimes you want to operate only on some records of the input (according to some condition), but also want to print all records, regardless of whether they were affected by your operation or not. A typical example is a program like this:

awk '{sub(/pattern/, "foobar")} 1'

This tries to replace whatever matches /pattern/ with "foobar". But whether or not the substitution succeeds, the always-true condition "1" prints each line (you could even use 42, or 19, or any other nonzero value if you so prefer; 1 is just what people traditionally use). This results in a program that does the same job as

sed 's/pattern/foobar/'

Here are some examples of typical awk programs, using only conditions:

awk 'NR % 6'            # prints all lines except lines 6,12,18...
awk 'NR > 5'            # prints from line 6 onwards (like tail -n +6, or sed '1,5d')
awk '$2 == "foo"'       # prints lines where the second field is "foo"
awk 'NF >= 6'           # prints lines with 6 or more fields
awk '/foo/ && /bar/'    # prints lines that match /foo/ and /bar/, in any order
awk '/foo/ && !/bar/'   # prints lines that match /foo/ but not /bar/
awk '/foo/ || /bar/'    # prints lines that match /foo/ or /bar/ (like grep -e 'foo' -e 'bar')
awk '/foo/,/bar/'       # prints from line matching /foo/ to line matching /bar/, inclusive
awk 'NF'                # prints only nonempty lines (or: do not print empty lines, where NF==0)
awk 'NF--'              # removes last field and prints the line
awk '$0 = NR" "$0'      # prepends line numbers (assignments are valid in conditions)
awk '!a[$0]++'          # suppresses duplicated lines! (figure out how it works)

As an extreme example of the power of conditions, let's examine the following code:

awk 'ORS = NR % 5 ? FS : RS'

You might also find it written with no spaces at all, especially in golf-ish contexts:

awk 'ORS=NR%5?FS:RS'

Let's run it using some simple input:

$ seq 1 30 | awk 'ORS=NR%5?FS:RS'
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30

So what it does is columnate the input (in groups of five columns here, but just change the number 5 in the code to the number you want, or use a variable and pass the value using -v). Why does it work? Well, when awk prints an output record (line), the special variable ORS, as the name suggests, contains the separator to append to the line. By default, ORS is "\n", ie a newline character. But here we are explicitly assigning a value to ORS depending on the outcome of the ternary operator. So if NR%5 is zero (ie, we are at line 5, 10, 15, etc.), ORS gets the value of RS (which by default contains "\n"); otherwise ORS gets the value of FS (by default, a space). If RS and FS have their standard values, it can be rewritten like this:

awk 'ORS=NR%5?" ":"\n"'

Overall, in the end the whole code is an assignment which, as we've seen, is evaluated and returns the assigned value (so either the value of RS or FS here, neither of which is nontrue). Since we're having a true condition, the default action (print) is executed for every line, and as that is done, ORS either adds a newline or a space, depending on whether we are at column 5 or not.
An issue with this code is that if the number of lines in the input is not a multiple of the number of columns (5 in this case), the output ends with a space rather than a newline. You can add an END block to correct that.

Note that the ternary operator can be changed to yield different results; for example, this code

awk 'ORS=/foo$/?FS:RS'

joins the following line to the current one if the current one ends in foo. You are encouraged to find your own variations.

Self-assignments

Let's take another relatively common task: replacing the delimiters. Your input fields are delimited, for example, by semicolons (or any other arbitrarily complex separator), and you want to change that to, say, commas. Armed with the knowledge you gathered in the first part of this article, you do

awk -v FS=';' -v OFS=',' 1   # doesn't work!

but that doesn't work, and it outputs the input unchanged. The reason for this behavior is that awk does not rebuild $0 (that is, replacing FS with OFS) until some field is modified. That might seem strange at first, but it makes sense (and is even useful) in many circumstances. One reason for this behavior is one of efficiency: if the input and output separator are the same, as it happens many times, replacing FS with OFS is pointless. And if you think about it, if awk really always replaced FS with OFS, a program as simple as

$ echo 'foo;bar' | awk -v FS=';' -v OFS=',' '/foo/'

would output

foo,bar   # ????

which violates the principle of least surprise and most certainly is not what one expects here.
On the other hand, if replacing FS with OFS is needed, then obviously awk has to do that at some point. So the question is: when is it that awk thinks FS has to be replaced with OFS? As mentioned above, awk assumes that it's time to replace FS with OFS (and thus recompute $0) when a field is modified, which is a sensible assumption, and almost always produces the output that one would expect.

So back to the original problem, you can now see a solution: to force recomputation of $0, just let awk think you've changed a field, but without changing it:

awk -v FS=';' -v OFS=',' '{$1=$1}1'

The $1=$1 bit is what confuses many people, who wonder what it's for. It's a typical awk idiom to force awk to rebuild $0 (usually to apply some OFS).

Warning: useless information follows.

If you're sure that $1 will never be an empty string, you can even golf the code a bit more and use the assignment as the condition, so

awk -v FS=';' -v OFS=',' '$1=$1'

If you want to shorten it even more (and make it more cryptic) to impress your friends or whatever reason, you can move the assignments at the end to save the -v and remove some quotes:

awk '$1=$1' FS=\; OFS=,

That exploits an obscure feature of awk where any argument that has a "=" in it is treated as a variable assignment instead of a file to read (which, btw, makes it hard to persuade awk to operate on files whose name contains "=". I'm sure your life will never be the same now that you know this). Well, I did say that this was useless information, so let's get back to something more practical...

Build strings with separators

This is similar to the so-called fencepost problem. In many occasions you need to build a string using concatenation, starting from an empty string, and adding values as you go. The values should be separated by some separator (let's say, a semicolon for these examples). One might do this, in some loop:

string = string ";" somedata

but then string has an unwanted leading semicolon. Putting the semicolon after the variable has a similar problem. So a typical way to do this with awk is this:

string = string sep somedata; sep = ";"

This exploits the fact that awk variables start out containing the dual value empty string or zero, so the first time the code is executed, sep is empty (you can explicitly initialize it to the empty string in a BEGIN block, if you like, but it's redundant). Then it's set to a semicolon, and it will have that value from the second time the code is executed onwards. The result is that at the end string will have a neat list of values with the semicolons only where they should be.

As a practical example of this idiom, let's see how to transpose a matrix using awk:

$ cat matrix.txt
a1;a2;a3;a4;a5
b1;b2;b3;b4;b5
c1;c2;c3;c4;c5
d1;d2;d3;d4;d5
$ awk -F\; '{for(i=1;i<=NF;i++)r[i]=r[i] sep $i;sep=FS}END{for(i=1;i<=NF;i++)print r[i]}' matrix.txt
a1;b1;c1;d1
a2;b2;c2;d2
a3;b3;c3;d3
a4;b4;c4;d4
a5;b5;c5;d5

The idea here is to build an array r with NF elements (the number of columns in the original input), each of which will hold a line of the output. For each input line, every element of r has another "column" added. The variable sep is initially empty, then (after the first input line has been processed) it's set to semicolon. Of course, for more complex processing, an array of separators could be used.

Two-file processing

Another construct that is often used in awk is as follows:

$ awk 'NR == FNR { # some actions; next} # other condition {# other actions}' file1.txt file2.txt

This is used when processing two files. When processing more than one file, awk reads each file sequentially, one after another, in the order they are specified on the command line. The special variable NR stores the total number of input records read so far, regardless of how many files have been read. The value of NR starts at 1 and always increases until the program terminates. Another variable, FNR, stores the number of records read from the current file being processed. The value of FNR starts at 1, increases until the end of the current file is reached, then is set again to 1 as soon as the first line of the next file is read, and so on. So, the condition NR == FNR is only true while awk is reading the first file. Thus, in the program above, the actions indicated by # some actions are executed when awk is reading the first file; the actions indicated by # other actions are executed when awk is reading the second file, if the condition in # other condition is met. The next at the end of the first action block is needed to prevent the condition in # other condition from being evaluated, and the actions in # other actions from being executed, while awk is reading the first file.

Probably, it all becomes much clearer with some examples. There are really many problems that involve two files that can be solved using this technique. Let's look at this:

# prints lines that are both in file1.txt and file2.txt (intersection)
$ awk 'NR == FNR{a[$0];next} $0 in a' file1.txt file2.txt

Here we see another typical idiom: a[$0] alone has the only purpose of creating the array element indexed by $0, even if we don't assign any value to it. During the pass over the first file, all the lines seen are remembered as indexes of the array a. The pass over the second file just needs to check whether each line being read exists as an index in the array a (that's what the condition $0 in a does). If the condition is true, the line being read from file2.txt is printed (as we already know). In a very similar way, we can easily write the code to print the lines that appear in only one of the two files:

# prints lines that are only in file1.txt and not in file2.txt
$ awk 'NR == FNR{a[$0];next} !($0 in a)' file2.txt file1.txt

Note the order of the arguments. file2.txt is given first. To print lines that are only in file2.txt and not in file1.txt, just reverse the order of the arguments.

Another example. Suppose we have a data file like this

$ cat data.txt
20081010 1123 xxx
20081011 1234 def
20081012 0933 xyz
20081013 0512 abc
20081013 0717 def
...thousand of lines...

where "xxx", "def", etc. are some kind of operation codes. We want to replace each operation code with its description. We have another file that maps operation codes to human readable descriptions, like this:

$ cat map.txt
abc withdrawal
def payment
xyz deposit
xxx balance
...other codes...

We can easily replace the opcodes in the data file with this simple awk program, that again uses the two-files idiom (and other idioms that were already introduced):

# use information from a map file to modify a data file
$ awk 'NR == FNR{a[$1]=$2;next} {$3=a[$3]}1' map.txt data.txt

First, the array a, indexed by opcode, is populated with the human readable descriptions, read from the map file. Then, it is used during the reading of the data file to do the replacements. Each line of the data file is then printed after the substitution has been made.

Another case where the two-files idiom is useful is when you have to read the same file twice, the first time to get some information that can be correctly defined only by reading the whole file, and the second time to process the file using that information. For example, you want to replace each number in a list of numbers with its difference from the largest number in the list:

# replace each number with its difference from the maximum
$ awk 'NR == FNR{if($0 > max) max = $0;next} {$0 = max - $0}1' file.txt file.txt

Note that we specify file.txt file.txt on the command line, so the file will be read twice. This makes no difference to awk, which just thinks it has two files to read.

As with all other idioms, you are encouraged to find your own uses and variations.

Caveat: all the programs that use the two-files idiom will not work correctly if the first file is empty (in that case, awk will execute the actions associated to NR == FNR while reading the second file). To correct that, you can reinforce the NR == FNR condition by adding a test that, for example, checks that also FILENAME is equal to ARGV[1].

Filed under awk, shell Tagged awk, oneliners, shell, text processing

Comments are closed | Permalink

26 Comments

Iulian says:

May 29, 2019 at 19:45

Hi,

Let's say I have file1 and file2. I want to match a sequence in file1 with a sequence in file2 and return the output with some other sequence from file 2. For example, in file1 I have the line:
Jessica Robert 123 USA association

In file 2 I have:
Mike association UK 567

And I want to return Mike from file2.

Thank you
- Iulian says:
  
  May 29, 2019 at 20:23
  
  Actually, I have used the command:
  
  awk 'NR == FNR{a[$5];next} $2 in a' Route_guide_1.txt Carrier_associations_1.txt
  
  but it displays the entire line of file2, and I need only the first sequence.
  
  Thanks
- Iulian says:
  
  May 29, 2019 at 20:49
  
  Ok, I have managed to do this with:
  
  awk 'NR == FNR {a[$5];next} $2 in a {print $1} ' file1 file2
  
  If you have any advice please let me know. And sorry for the spam, but i am new in this and really wanted some help :)
  
  Thank you
  - waldner says:
    
    May 30, 2019 at 09:47
    
    You did it right. Cheers
Akshat says:

April 1, 2019 at 05:36

Hello Guys!

I need help on the requirement where i need to perform the join operation on 3 csv files where 1 column will be common in 3 files.
And I also want to print the resulting values in another csv file. Order of Printing columns in csv is also required.

Best Regards
Akshat
- waldner says:
  
  April 2, 2019 at 10:56
  
  It's not quite clear from your description, an example of input and expected output would be better.
Shilpi says:

May 1, 2018 at 15:07

I have two files that I want to merge based on first column as key in both the files.
So if first column matches, I want to merge the two records (one from each file) based on a condition on the value of the second column (if the second columm="abc" in second file take the value of second colum from first file provided it is not null in first file)
If the first column do not exist in both the files I want the record in my output as it is from whichever file it is in.

First file record => alphabeta,xyz,Thu Apr 05 09:30:50 AM 18,sss,ttt,uuu,0
Second file record => alphabeta,abc, Tue Apr 10 12:40:50 EDT 2018 , sss ,ttr , xxe, 95

Merged file record => alphabeta,xyz, Tue Apr 10 12:40:50 EDT 2018 , sss ,ttr , xxe, 95
- waldner says:
  
  May 6, 2018 at 22:47
  
  You don't say what you want to do when second column in file2 is "abc" but the corresponding column in file1 is null (I suppose you mean "empty" rather); I'm assuming that in this case you keep the "abc" value from file2.
  Also it's not clear whether order matters in the output (I'm assuming it doesn't).
  I'm also assuming that each key (ie, column 1) cannot appear more than once in a file.
  
  Try with
```
awk -F',' -v OFS=',' '
  # first file, save 2nd column and whole line
  NR==FNR{
    col2[$1]=$2
    a[$1]=$0
    next
  }

  # second file: if column 1 was also in file1...
  $1 in a {
    # take col2 for current line from there
    $2=col2[$1]
    # delete key from arrays
    delete a[$1]
    delete col2[$1]
  }
  {
    print
  }
  END{
    # keys left in a were not in file2, so print their lines unchanged
    for (k in a){print a[k]}
  }' file1 file2
```
Krishna says:

April 25, 2017 at 13:09

Thanks for this useful stuff waldner, really helpful. If I want to do the following with awk, could you please help?

cat file1.txt

8888234234224234
hello world - wonderful
8888824545534334
hellow rold - nice
8888334323234234
hello world - amazing
hello world - excellent

cat file2.txt

some random text
some other random text
andmorerandom:{'and: somemore text','text: and more/du0038888334323234234','name:''hello world'}

some random text
some other random text
andmorerandom:{'and: somemore text','text: and more/du0038888234234224234','name:''hello world'}

Now I want to replace 'hello world' in file2 with 'hello world - xxxx' next line to matching 8888xxxxxx from file1.

Note: if there are 2 hello world line for matching entry in file1, then I want to duplicate the entire matching line in file2 and replace second 'hello world' entry there. (am not sure if this is possible though)

So my output should look like this..

cat file2.txt

some random text
some other random text
andmorerandom:{'and: somemore text','text: and more/du0038888334323234234','name:''hello world - amazing'}

some random text
some other random text
andmorerandom:{'and: somemore text','text: and more/du0038888234234224234','name:''hello world - wonderful'}

Thank you so much in advance for your precious time and help on this.
- waldner says:
  
  April 27, 2017 at 10:43
  
  Your problem isn't well specified and in many places there could be many possibilities. I've made some assumptions and come up with the following code (mind you, needs GNU awk):
```
NR == FNR {
  if (/^[0-9]+$/) {
    code = $0
    i = 0
  } else {
    i++
    a[code][i] = $0
  }
  next
}

{
  if (/'name:'/) {
    match($0, /'text: .*\/du003([0-9]+)'/, m)
    code = m[1]
    for (i=1; i<= length(a[code]); i++) {
      line = $0
      sub(/'name:''[^']+'/, "'name:''" a[code][i] "'", line)
      print line
    }
  } else {
    print
  }
}
```
  With that code in a file, you can do
```
awk -f code.awk file1.txt file2.txt
```
  and get the output you describe (or at least my understanding of it). If not, I hope you can do the necessary adjustments yourself.
Fatmice says:

April 20, 2017 at 03:35

In your colunmation example, you said adding an END block would eliminate the trailing space for sequences that don't evenly divide. Do you have an example of that END block?

My solution requires two passes.
seq 1 29 | awk 'ORS=NR%5?FS:RS;END{printf ""}' | awk 'NR>1{print prev}{prev=$0}END{sub(/ $/,"",$0); print}'
- waldner says:
  
  April 20, 2017 at 10:34
  
  In the END block you just put the printing of the newline, eg
```
seq 1 29 | awk 'ORS=NR%5?FS:RS;END{printf "\n"}'
```
  - Fatmice says:
    
    April 21, 2017 at 00:30
    
    That END block would terminate the output as " \n", whereas all other lines terminated without space. I didn't see any way of doing that in the END block while keeping the same code without piping to another awk for post-processing.
    - waldner says:
      
      April 21, 2017 at 10:56
      
      I see what you mean. Well in that case it's not possible to remove the space, since it has already been printed. You can change the code and lose a lot of idiomaticity to accumulate the line instead of printing it so you have the chance to change it before printing, eg something like
      
      seq 1 29 | awk '{line = line $0; if (NR%5) { line = line FS } else { print line; line = "" } } END{ if (line"") { sub(/ $/, "", line); print line } }'
Saurabh says:

January 18, 2014 at 20:47

I want following format for my program:
awk
1. {Body (processing file_1)}
2. END{ computation from previous data}
3. {Body update fields in file_1 using computed result}

the problem is with 3. part as i am unable to read the file_1 from begining after end.

The problem is dificult for me as I am new to awk.
Thanks in advance.
- waldner says:
  
  January 18, 2014 at 20:59
  
  So you can read the same file twice, and do your computations just before you process the first line for the second time.
  For example:
```
awk '
NR == FNR { # first pass over file_1
            process file_1; next }

FNR == 1 { # this must be the first line of the second pass over file_1
           computation from previous data (stored in arrays or whatever) }

{ normal processing of all lines of file_1 (second pass) }' file_1 file_1
```
  - Saurabh says:
    
    January 19, 2014 at 07:28
    
    Thanks a lot waldner,
    It worked like a charm..
Mike says:

June 23, 2013 at 15:01

Don't see how the 1 in this example matches the template of pattern{action}
awk '{sub(/pattern/, "foobar")} 1'
- waldner says:
  
  June 26, 2013 at 21:51
  
  It's explained in there. "1" is an always-true pattern; the action is missing, which means that it is {print} (the default action that is executed if the pattern is true). So essentially the lone "1" is used to print all lines. The example you cite can thus be rewritten as either awk '{sub(/pattern/, "foobar")} 1 {print}' or simply awk '{sub(/pattern/, "foobar")} {print}'.
  - Mike says:
    
    July 3, 2013 at 00:11
    
    Follow up question:
    Are pattern{action} normally separated by a ; like
    pattern1{action1};pattern2{action2}
    
    Mike
    - waldner says:
      
      July 3, 2013 at 12:18
      
      The semicolon is not mandatory, as long as awk is able to tell where a pattern begins.
Murpholinox Peligro says:

June 12, 2011 at 23:30

there is a way to do this but with five files, instead of two?

# prints lines that are both in file1.txt and file2.txt (intersection)
$ awk 'NR == FNR{a[$0];next} $0 in a' file1.txt file2.txt
- waldner says:
  
  June 13, 2011 at 17:10
  
  I assume you want to print the lines that appear in all five files (or N files, for that matter). This should do it, assuming no file is empty (a bit reformatted for readability, but can be written all on a single line):
```
awk 'BEGIN{
  for(i = 1; i<ARGC; i++){
    ref = ref s i
    s = "."
  }
} 
FNR == 1 {count++} 
{a[$0] = a[$0] sep[$0] count; sep[$0] = "."}
a[$0] == ref' file1 file2 file3 ... fileN
```
Idiomatic awk « El blog de Fer says:

November 30, 2010 at 13:55

[...] http://backreference.org/2010/02/10/idiomatic-awk/ [...]
- Ali says:
  
  June 27, 2017 at 18:16
  
  I have two files with the difference in rows and columns but we have a common column, first column in file1 is same in last column in file2 and I'm struggling to merge them and doesn't work as expected. Any help would be appreciated.
  
  I'm using this command but doesn't work as expected. I'm putting file1 in array and for second file I'm trying to compare on column -> awk 'NR==FNR{a[$1]=$2;next} $NF=a[$1]{print $0, a[$NF]}' file1 file2
  
  File1
  -----
  /g01/ffb /systst/tst.ds
  
  File2
  -----
  21 0 rw- 1 gem gem 12 Jul 19 2016 /g01/ffb
  22 0 rw- 1 gem gem 12 Jul 19 2016 /g01/ffa
  ...
  
  output - I'm looking for
  ==========================
  21 0 rw- 1 gem gem 12 Jul 19 2016 /g01/ffb /systst/tst.ds
  
  Thanks
  - waldner says:
    
    June 29, 2017 at 11:14
    
    Try the following:
```
awk 'NR == FNR {a[$1]=$2;next} ($NF in a) {print $0, a[$NF]}' file1 file2
```

\1

Idiomatic awk

The power of conditions

Self-assignments

Build strings with separators

Two-file processing

26 Comments

BTC

Recent Posts

Categories

Archives

\1

Idiomatic awk

The power of conditions

Self-assignments

Build strings with separators

Two-file processing

26 Comments

BTC

Recent Posts

Categories

Tags

Archives