Sorting with awk

Posted by waldner on 16 May 2010, 8:14 pm

Arrays in awk are associative, which is a fancy word to mean that indexes can be non-numeric (eg strings). Think of them as a dictionary, or lookup table: you lookup a key, and it gives you the corresponding value.
In the most typical uses of associative arrays, sorting is not required, or even meaningful for that matter. Nonetheless, it happens every so often that array sorting is needed in awk. Traditional awk does not support that, so you have to write a function to implement it (which may be a good thing anyway, especially if you put it in its own source file and plan to reuse it in other programs).

Here we'll look at sorting in awk "for the lazy", that is, how to do sorting without having to implement it yourself.

GNU awk

Update 17/01/2013: Note that most of what's written in this section has been rendered obsolete or partially incorrect by the new sorting features introduced with GNU awk version 4. For more information see here and here. Most notably, GNU awk version 4 removed the (already undocumented) WHINY_USERS feature described below (but the alternatives are far superior).

GNU awk has built-in functions for sorting (unless it's running in POSIX mode, like when using --posix). asort() sorts an array based on its values, and asorti() sorts based on keys (or indexes, or indices, hence the "i"). In both cases, the array is changed "in place" by default, and the result is the same array, with the data sorted, and whole new indexes (starting from 1). With asorti(), the values of the elements are the old keys, so the original values are lost.
To protect against these problems, both functions accept an optional second argument, which should be an array where the original array is copied (keys and values) prior to being sorted. That way, the sorted array will be the second argument, and the original array (first argument) will still be available untouched. Thanks to this, a common trick to preserve values when sorting by keys is as follows:

# our array is a; sort by key,
# and output the corresponding
# values
n = asorti(a,copy)
for(i=1;i<=n;i++){
  print a[copy[i]]
}

Also note that GNU awk has a (poorly documented) feature that helps when sorting by keys. Setting the environment variable WHINY_USERS to any value has the effect of returning keys in sorted order when using the for (key in array) construct, so in most cases you can avoid asorti() entirely. See this example:

$ gawk 'BEGIN{a["foo"];a["bar"];a["baz"];a["xxx"];a["a"];a["b"];for(key in a)print key}'
baz
a
b
xxx
bar
foo
$ WHINY_USERS=1 gawk 'BEGIN{a["foo"];a["bar"];a["baz"];a["xxx"];a["a"];a["b"];for(key in a)print key}'
a
b
bar
baz
foo
xxx

There's not much to be said about asort() and asorti(), except that one should be aware of the usual awk typing rules. In particular:

Keys are always strings, so expect orderings like "1 10 2" when sorting keys
Awk "knows" the type of the variables, and usually does the right thing depending on the context; however, if an explicit type has been given to a variable, awk remembers that, which can lead to unexpected results. So remember to explicitly convert values to the desired type (usually this is done by adding 0 to force numeric type, and concatenating the empty string "" to force string) if you want them treated as such.

The last thing to mention about GNU awk sorting functions is that they sort on the whole value, so it's not quite easy to sort on a specific field/string somewhere in the middle of the value; this usually involves prepending the field or string in question to the original value in each element, sorting the array, and finally removing the stuff we prepended previously to restore the original values, which is a lot of work. (Alternatively, we can avoid the last step and pretend that the prepended data does not exist by always reading values from a certain position, but this is not pretty either and may not always be possible). If you are in this situation, probably the methods explained below are appealing also when using GNU awk.

Standard awk

With standard awk, the usual way to sort (short of implementing a sort function yourself in awk, which is of course perfectly possible if you want) is to use the external shell command sort. If the sorted values are the last result you need, you can just print the values from awk using a loop and pipe the output to sort. This looks like the following code:

$ awk '{.... populate the array ....}END{for(i in array)print array[i]}' | sort ....

With the suitable options to sort, you can get pretty much any result you want, including subfield/substring sort, which as we saw is not straightforward with asort()/asorti(). If you want to sort by index (like asorti()), you can use the same code as above but instead of printing array[i], just print "i".

More interesting is the case where you want to have the sorted data available again in awk for further processing. There are two approaches to this.
The quick and dirty way is to just add another pipe after sort and send the result to another awk instance, where it will be reread sorted and further processed.
The more elaborate way is to pass the values to sort and write the output to a temporary file, and then read the sorted values back from the file. Here's some sample code that does the latter:

BEGIN{
  n=split("f3o b5r b8z x9x a1 b2", a)
  tempfile = "/tmp/XojKlu2"             # of course use a good name for it in real code
  command = "sort -k1.2,1.2n > " tempfile
  # sort the array...
  for(i=1;i<=n;i++)print a[i] | command
  close(command)
  i=0
  # ...and read it back
  while((getline a[++i] < tempfile) > 0)
    ;
  close(tempfile)
  for(i=1;i<=n;i++)print a[i]
  # ... or do whatever with "a"...
}

It is always good practice to close any file descriptor that we opened (this happens automatically whenever a pipe or redirection is used) when it's no longer needed. In this case it wouldn't probably matter, but often it does, so no harm in picking up the good habit and doing it even when it doesn't seem necessary.

GNU awk users, again, have an advantage here as they can avoid using a temporary file. GNU awk supports coprocesses, which is another fancy name to indicate a bidirectional pipe (it can be read from and written to at the same time). So in this case, the sort command can be opened as a coprocess:

BEGIN{
  n=split("f3o b5r b8z x9x a1 b2", a)
  command = "sort -k1.2,1.2n"
  # Write to the coprocess...
  for(i=1;i<=n;i++)print a[i] |& command
  close(command, "to")
  i=0
  # ...and read back from it
  while((command |& getline a[++i]) > 0)
    ;
  close(command)
  for(i=1;i<=n;i++)print a[i]
  # ... or do whatever with "a"...
}

Being bidirectional pipes, coprocesses can also be half closed, that is, closed in one direction only. In this specific case, it is necessary to close the pipe in the "to" direction before reading, otherwise the "from" direction would not read anything. For an in-depth explanation of coprocesses (and of this case in particular), refer to the relevant section of the GNU awk manual.

Filed under awk, shell, tips Tagged awk, shell, sorting, text processing

Comments are closed | Permalink

\1

Sorting with awk

GNU awk

Standard awk

BTC

Recent Posts

Categories

Archives

\1

Sorting with awk

GNU awk

Standard awk

BTC

Recent Posts

Categories

Tags

Archives