Argument juggling with awk

Posted by waldner on 22 October 2013, 2:06 pm

This seems to be a sort of FAQ. A typical formulation goes like "I have a bash array, how do I pass it to awk so that it becomes an awk array"?

Leaving aside the fact that it may be possible to extend the awk code to do whatever one is doing with the shell array (in which cases the problem goes away), let's focus on how to do strictly what is requested (and more).

ARGC and ARGV

Like many other languages, awk has two special variables ARGC and ARGV that give information on the arguments passed to the awk program. ARGC contains the number of total arguments (including the awk interpreter or script), and ARGV is an array of ARGC elements (indexed from 0 to ARGC - 1) that contains all the arguments (ARGV[0] is always the name of the awk interpreter or script).
Let's demonstrate this with a simple example:

awk 'BEGIN{print "ARGC is " ARGC; for(i = 0; i < ARGC; i++) print "ARGV["i"] is " ARGV[i]}' foo bar
ARGC is 3
ARGV[0] is awk
ARGV[1] is foo
ARGV[2] is bar

There are two important things to know:

Unlike other languages, in awk ARGC and ARGV can be modified
When awk's main loop starts (and only then), awk processes whatever it finds in ARGV, starting from ARGV[1] up to ARGV[ARGC - 1].

Of course, these should normally be file names or variable assignments. But this is only relevant when the main loop starts; before then, in the BEGIN block we can manipulate ARGC and ARGV to our taste, and as long as what's left afterwards in ARGV is a list of files to process (or variable assignments), awk doesn't really care how those values got there.

So let's see some use cases for ARGC/ARGV manipulation.

Double pass over a file

Some code uses the two-file idiom to process the same file twice. So instead of doing

awk .... file.txt file.txt

we could just specify the file name once and double it in the BEGIN block so awk sees it twice:

# this is as if we said awk ..... file.txt file.txt
awk 'BEGIN{ARGV[ARGC++] = ARGV[1]} { ... }' file.txt

Fixed arguments

Let's assume that our awk code always has to process one or more files, whose names do not change. Of course, we could specify those names at each invocation of awk; nothing new here. However, for some reason we don't want to specify those names at each invocation, since they never change anyway; we only want to specify the variable file names. So if we have two never-changing files ("fixed1.txt" and "fixed2.txt"), we want to invoke our code with

process.awk file1 file2 file3 ...

but in fact we want awk to run as if we said

process.awk fixed1.txt fixed2.txt file1 file2 file3 ...

Let's see how the code to accomplish this may look like (of course it has to be adapted to the specific situation):

awk 'BEGIN {
  for(i = ARGC+1; i > 2; i--)
    ARGV[i] = ARGV[i - 2]
  ARGC += 2
  ARGV[1] = "fixed1.txt"
  ARGV[2] = "fixed2.txt"
}
# now awk processes fixed1.txt and fixed2.txt first, then whatever was specified on the command line
{
  ...
}' file1 file2 file3 ...

Passing a shell array (and more or less arbitrary data)

So, to back to the original question, how can we take advantage of this juggling to pass in an array? A simple way would be to pass all the array elements as normal awk arguments, process them in the BEGIN block, then remove them so when the main loop starts awk is unaware of what happened. Let's see an example:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
 
awk 'BEGIN{
 
  # ARGV[1] is the number of elements we have
  arrlen = ARGV[1]
 
  for(i = 2; i <= arrlen + 1; i++)
    awkarr[i - 1] = ARGV[i]
 
  # clean up
  j = 1
  for(i = arrlen + 2; i < ARGC; i++)
    ARGV[j++] = ARGV[i]
  ARGC = j
}
 
# here awk starts processing from file1, unaware of what we did earlier
# but we have awkarr[] populated with the values from shellarr (and arrlen is its length)
{
  ...
}
 
' ${#shellarr[@]} "${shellarr[@]}" file1 file2

awkarr has its elements indexed starting from 1, as is customary in awk; it's easy to adapt the code to use 0-based or another number.
We could also pass the number of elements in the array as a normal value using -v, which simplifies processing somewhat:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
 
awk -v arrlen="${#shellarr[@]}" 'BEGIN{
 
  for(i = 1; i <= arrlen; i++)
    awkarr[i] = ARGV[i]
 
  # clean up
  for(i = arrlen + 1; i < ARGC; i++)
    ARGV[i - arrlen] = ARGV[i]
  ARGC -= arrlen
}
# ... as before
 
' "${shellarr[@]}" file1 file2

If the number of files to process is known (which should be the most common case), then it's even easier as we can specify them first and the array elements afterwards. Let's assume we know that we always process two files:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
 
awk -v nfiles=2 'BEGIN{
  for(i = nfiles + 1; i < ARGC; i++)
    awkarr[i - nfiles] = ARGV[i]
  arrlen = ARGC - (nfiles + 1)
  ARGC = nfiles + 1
}
# ... as before
 
' file1 file2 "${shellarr[@]}"

Finally, if we want to "pass" a shell associative array to awk (such that it exists with the same keys and values in the awk code), we could do this:

declare -A shellarr
shellarr=( [fook]='foov' [bark]='barv' [bazk]='bazv' [xxxk]='xxxv' [yyyk]='yyyv' )
 
awk -v nfiles=2 'BEGIN{
  arrlen = ( ARGC - (nfiles + 1) ) / 2
  for(i = nfiles + 1; i < nfiles + 1 + arrlen; i++)
    awkarr[ARGV[i]] = ARGV[i + arrlen]
  ARGC = nfiles + 1
}
# ... as before
 
' file1 file2 "${!shellarr[@]}" "${shellarr[@]}"

This works because in bash, the order of expansion of "${!shellarr[@]}" and "${shellarr[@]}" is the same (currently, at least). To be 100% sure, however, we could of course copy all the key, value pairs to another array and pass that one, as in the following example:

declare -A shellarr
shellarr=( [fook]='foov' [bark]='barv' [bazk]='bazv' [xxxk]='xxxv' [yyyk]='yyyv' )
 
declare -a temp
for key in "${!shellarr[@]}"; do
  temp+=( "$key" "${shellarr[$key]}" )
done
 
awk -v nfiles=2 'BEGIN{
  arrlen = ( ARGC - (nfiles + 1) ) / 2
  for(i = nfiles + 1; i < ARGC; i += 2)
    awkarr[ARGV[i]] = ARGV[i + 1]
  ARGC = nfiles + 1
}
# ... as before
 
' file1 file2 "${temp[@]}"

In the last two examples, it should be noted that, as usual with associative arrays, the concept of array "length" doesn't make much sense; it's just an indication of how many elements the hash has, and nothing more (in awk, all arrays are associative regardless, though they can be used as "normal" ones as we did in the first examples).

Update 31/10/2013: So there's always something new to learn, and in my case it was that if an element of ARGV is the empty string, awk just skips it. This simplifies the examples where the ARGV elements are moved down to fill the positions where the shell array elements were. In fact, all that's needed is to set those elements to "", and awk will naturally skip them. So the first two examples above become:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
 
awk 'BEGIN{
 
  # ARGV[1] is the number of elements we have
  arrlen = ARGV[1]
  ARGV[1] = ""
 
  for(i = 2; i <= arrlen + 1; i++) {
    awkarr[i - 1] = ARGV[i]
    ARGV[i] = ""
  }
}
...' ${#shellarr[@]} "${shellarr[@]}" file1 file2

Second example:

shellarr=( 'foo' 'bar' 'baz' 'xxx' 'yyy' )
 
awk -v arrlen="${#shellarr[@]}" 'BEGIN{
  for(i = 1; i <= arrlen; i++) {
    awkarr[i] = ARGV[i]
    ARGV[i] = ""
  }
}
...' "${shellarr[@]}" file1 file2

Filed under awk, shell Tagged awk

Comments are closed | Permalink

\1