We have an awk script that can process multiple files, and we want to do some special task at the beginning of each file (for this example we just print the file name, but it can be anything of course). The classic awk idiom to do this is something like
function process_file(){ print "Processing " FILENAME "..." } FNR == 1 { process_file() } # rest of code
So we call our script with three files and get:
$ awk -f script.awk file1 file2 file3 Processing file1... Processing file2... Processing file3...
Alright. But what happens if some file is empty? Let's try it (we use /dev/null to simulate an empty file):
$ awk -f script.awk file1 /dev/null file3 Processing file1... Processing file3...
Right, since an empty file has no lines, it can never match
GNU awk
If we have GNU awk and can assume it's available anywhere our script will run (or can force it as a prerequisite for users), then it's easy: just use the special BEGINFILE block instead of
function process_file(){ print "Processing " FILENAME "..." } BEGINFILE { process_file() }
(Btw, GNU awk also has a corresponding ENDFILE special block.)
And there we have it:
$ gawk -f script.awk file1 /dev/null file3 Processing file1... Processing /dev/null... Processing file3...
But alas, for the time being this is not standard, so it can only run with GNU awk.
Standard awk
With standard awk, we have to stick to what is available, namely the
First version of our processing function (we choose to ignore the lint/style issue represented by the fact that passing a global variable to a function that accepts a parameter of the same name shadows it, as it's totally harmless here and improves code readability):
function process_it(filename, is_empty) { print "Processing " filename " (" (is_empty ? "empty" : "nonempty") ")..." } function process_file(argind) { argind++ # if ARGV[argind] differs from FILENAME, we skipped some files. Catch up while (ARGV[argind] != FILENAME) { process_it(ARGV[argind], 1) argind++ } # finally, process the current file process_it(ARGV[argind], 0) return argind } BEGIN { argind = 0 } FNR == 1 { argind = process_file(argind) } # rest of code here
(The index variable is named argind. The name is not random; GNU awk has an equivalent built-in variable, called ARGIND)
Let's test it:
$ awk -f script.awk file1 /dev/null file3 Processing file1 (nonempty)... Processing /dev/null (empty)... Processing file3 (nonempty)... $ awk -f script.awk /dev/null /dev/null file3 Processing /dev/null (empty)... Processing /dev/null (empty)... Processing file3 (nonempty)... $ awk -f script.awk file1 /dev/null /dev/null Processing file1 (nonemtpy)... $ # Oops...
So there's a corner case where it doesn't work, namely where the last file(s) are all empty: since there's no later non-empty file, our function doesn't get any further chance to be called to catch up. This can be fixed: we just call our function from the END block. When we're called from the END block, we just process all the arguments that haven't been processed (that is, from argind to
function process_it(filename, is_empty) { print "Processing " filename " (" (is_empty ? "empty" : "nonempty") ")..." } function process_file(argind, end) { argind++ if (end) { for(; argind <= ARGC - 1; argind++) # we had empty files at the end of arguments process_it(ARGV[argind], 1) return argind } else { # if ARGV[argind] differs from FILENAME, we skipped some files. Catch up while (ARGV[argind] != FILENAME) { process_it(ARGV[argind], 1) argind++ } # finally, process the current file process_it(ARGV[argind], 0) return argind } } BEGIN { argind = 0 } FNR == 1 { argind = process_file(argind, 0) } # rest of code here... END { argind = process_file(argind, 1) # here argind == ARGC }
Let's test it again:
$ awk -f script.awk file1 /dev/null file3 Processing file1 (nonempty)... Processing /dev/null (empty)... Processing file3 (nonempty)... $ awk -f script.awk /dev/null /dev/null file3 Processing /dev/null (empty)... Processing /dev/null (empty)... Processing file3 (nonempty)... $ awk -f script.awk file1 /dev/null /dev/null Processing file1 (nonempty)... Processing /dev/null (empty)... Processing /dev/null (empty)... $ awk -f script.awk /dev/null /dev/null /dev/null Processing /dev/null (empty)... Processing /dev/null (empty)... Processing /dev/null (empty)...
But wait, we aren't done yet!
$ awk -f script.awk file1 /dev/null a=10 file3 Processing file1 (nonempty)... Processing /dev/null (empty)... Processing a=10 (empty)... Processing file3 (nonempty)...
That is, awk allows mixing filenames and variable assignments in the argument list. This is really a feature as it allows, for example, to modify FS between files. Here's the relevant text from the standard:
An operand that begins with an <underscore> or alphabetic character from the portable character set [...], followed by a sequence of underscores, digits, and alphabetics from the portable character set, followed by the '=' character, shall specify a variable assignment rather than a pathname.
But this also means that we, in our processing, should detect assignments and not treat them as if they were filenames. Based on the above rules, we can write a function that checks whether its argument is or not an assignment, and use it to decide whether an argument should be processed.
Final code that includes this check:
function is_assignment(s) { return (s ~ /^[_a-zA-Z][_a-zA-Z0-9]*=/) } function process_it(filename, is_empty) { if (! is_assignment(filename)) print "Processing " filename " (" (is_empty ? "empty" : "nonempty") ")..." } function process_file(argind, end) { argind++ if (end) { for(; argind <= ARGC - 1; argind++) # we had empty files at the end of arguments process_it(ARGV[argind], 1) return argind } else { # if ARGV[argind] differs from FILENAME, we skipped some files. Catch up while (ARGV[argind] != FILENAME) { process_it(ARGV[argind], 1) argind++ } # finally, process the current file process_it(ARGV[argind], 0) return argind } } BEGIN { argind = 0 } FNR == 1 { argind = process_file(argind, 0) } # rest of code here... END { argind = process_file(argind, 1) # here argind == ARGC }
Final tests:
$ awk -f script.awk file1 /dev/null a=10 file3 Processing file1 (nonempty)... Processing /dev/null (empty)... Processing file3 (nonempty)... $ awk -f script.awk file1 /dev/null a=10 /dev/null Processing file1 (nonempty)... Processing /dev/null (empty)... Processing /dev/null (empty)... $ awk -f script.awk /dev/null a=10 /dev/null file1 Processing /dev/null (empty)... Processing /dev/null (empty)... Processing file1 (nonempty)... # now we have an actual file called a=10 $ awk -f script.awk /dev/null ./a=10 /dev/null file1 Processing /dev/null (empty)... Processing ./a=10 (nonempty)... Processing /dev/null (empty)... Processing file1 (nonempty)...