Detecting empty files in awk

Posted by waldner on 7 April 2015, 1:47 pm

We have an awk script that can process multiple files, and we want to do some special task at the beginning of each file (for this example we just print the file name, but it can be anything of course). The classic awk idiom to do this is something like

function process_file(){
  print "Processing " FILENAME "..."
}

FNR == 1 { process_file() }
# rest of code

So we call our script with three files and get:

$ awk -f script.awk file1 file2 file3
Processing file1...
Processing file2...
Processing file3...

Alright. But what happens if some file is empty? Let's try it (we use /dev/null to simulate an empty file):

$ awk -f script.awk file1 /dev/null file3
Processing file1...
Processing file3...

Right, since an empty file has no lines, it can never match FNR == 1, so for the purposes of our per-file processing task it's effectively skipped. Depending on the exact needs, this may or may not be acceptable. Usually it is, but what if we want to be sure that we run our code for each file, regardless of whether it's empty or not?

GNU awk

If we have GNU awk and can assume it's available anywhere our script will run (or can force it as a prerequisite for users), then it's easy: just use the special BEGINFILE block instead of FNR == 1.

function process_file(){
  print "Processing " FILENAME "..."
}

BEGINFILE { process_file() }

(Btw, GNU awk also has a corresponding ENDFILE special block.)

And there we have it:

$ gawk -f script.awk file1 /dev/null file3
Processing file1...
Processing /dev/null...
Processing file3...

But alas, for the time being this is not standard, so it can only run with GNU awk.

Standard awk

With standard awk, we have to stick to what is available, namely the FNR == 1 condition. If our process_file function is executed, then we know we're seeing a non-empty file. So our only option is, within this function, to check whether some previous file has been skipped and if so, catch up with their processing. How do we do this check? Well, awk stores all the arguments to the program in the ARGV[] array, so we can keep our own pointer to the index of the expected "current" file being processed and check that it matches FILENAME (which is set by awk and always matches the current file); if they are not the same, it means some previous file was skipped, so we catch up.

First version of our processing function (we choose to ignore the lint/style issue represented by the fact that passing a global variable to a function that accepts a parameter of the same name shadows it, as it's totally harmless here and improves code readability):

function process_it(filename, is_empty) {
  print "Processing " filename " (" (is_empty ? "empty" : "nonempty") ")..."
}

function process_file(argind) {
  argind++

  # if ARGV[argind] differs from FILENAME, we skipped some files. Catch up
  while (ARGV[argind] != FILENAME) {
    process_it(ARGV[argind], 1)
    argind++
  }
  # finally, process the current file
  process_it(ARGV[argind], 0)
  return argind
}

BEGIN {
  argind = 0
}

FNR == 1 {
  argind = process_file(argind)
}
# rest of code here

(The index variable is named argind. The name is not random; GNU awk has an equivalent built-in variable, called ARGIND)

Let's test it:

$ awk -f script.awk file1 /dev/null file3
Processing file1 (nonempty)...
Processing /dev/null (empty)...
Processing file3 (nonempty)...
$ awk -f script.awk /dev/null /dev/null file3
Processing /dev/null (empty)...
Processing /dev/null (empty)...
Processing file3 (nonempty)...
$ awk -f script.awk file1 /dev/null /dev/null
Processing file1 (nonemtpy)...
$
# Oops...

So there's a corner case where it doesn't work, namely where the last file(s) are all empty: since there's no later non-empty file, our function doesn't get any further chance to be called to catch up. This can be fixed: we just call our function from the END block. When we're called from the END block, we just process all the arguments that haven't been processed (that is, from argind to ARGC - 1), if any (these would all be empty files). Revised code:

function process_it(filename, is_empty) {
  print "Processing " filename " (" (is_empty ? "empty" : "nonempty") ")..."
}

function process_file(argind, end) {
  argind++

  if (end) {
    for(; argind <= ARGC - 1; argind++)
      # we had empty files at the end of arguments
      process_it(ARGV[argind], 1)
    return argind
  } else {
    # if ARGV[argind] differs from FILENAME, we skipped some files. Catch up
    while (ARGV[argind] != FILENAME) {
      process_it(ARGV[argind], 1)
      argind++
    }
    # finally, process the current file
    process_it(ARGV[argind], 0)
    return argind
  }
}

BEGIN {
  argind = 0
}

FNR == 1 {
  argind = process_file(argind, 0)
}

# rest of code here...

END {
  argind = process_file(argind, 1)
  # here argind == ARGC
}

Let's test it again:

$ awk -f script.awk file1 /dev/null file3
Processing file1 (nonempty)...
Processing /dev/null (empty)...
Processing file3 (nonempty)...
$ awk -f script.awk /dev/null /dev/null file3
Processing /dev/null (empty)...
Processing /dev/null (empty)...
Processing file3 (nonempty)...
$ awk -f script.awk file1 /dev/null /dev/null
Processing file1 (nonempty)...
Processing /dev/null (empty)...
Processing /dev/null (empty)...
$ awk -f script.awk /dev/null /dev/null /dev/null
Processing /dev/null (empty)...
Processing /dev/null (empty)...
Processing /dev/null (empty)...

But wait, we aren't done yet!

$ awk -f script.awk file1 /dev/null a=10 file3
Processing file1 (nonempty)...
Processing /dev/null (empty)...
Processing a=10 (empty)...
Processing file3 (nonempty)...

That is, awk allows mixing filenames and variable assignments in the argument list. This is really a feature as it allows, for example, to modify FS between files. Here's the relevant text from the standard:

An operand that begins with an <underscore> or alphabetic character from the portable character set [...], followed by a sequence of underscores, digits, and alphabetics from the portable character set, followed by the '=' character, shall specify a variable assignment rather than a pathname.

But this also means that we, in our processing, should detect assignments and not treat them as if they were filenames. Based on the above rules, we can write a function that checks whether its argument is or not an assignment, and use it to decide whether an argument should be processed.

Final code that includes this check:

function is_assignment(s) {
  return (s ~ /^[_a-zA-Z][_a-zA-Z0-9]*=/)
}

function process_it(filename, is_empty) {
  if (! is_assignment(filename))
    print "Processing " filename " (" (is_empty ? "empty" : "nonempty") ")..."
}

function process_file(argind, end) {
  argind++

  if (end) {
    for(; argind <= ARGC - 1; argind++)
      # we had empty files at the end of arguments
      process_it(ARGV[argind], 1)
    return argind
  } else {
    # if ARGV[argind] differs from FILENAME, we skipped some files. Catch up
    while (ARGV[argind] != FILENAME) {
      process_it(ARGV[argind], 1)
      argind++
    }
    # finally, process the current file
    process_it(ARGV[argind], 0)
    return argind
  }
}

BEGIN {
  argind = 0
}

FNR == 1 {
  argind = process_file(argind, 0)
}

# rest of code here...

END {
  argind = process_file(argind, 1)
  # here argind == ARGC
}

Final tests:

$ awk -f script.awk file1 /dev/null a=10 file3
Processing file1 (nonempty)...
Processing /dev/null (empty)...
Processing file3 (nonempty)...
$ awk -f script.awk file1 /dev/null a=10 /dev/null
Processing file1 (nonempty)...
Processing /dev/null (empty)...
Processing /dev/null (empty)...
$ awk -f script.awk /dev/null a=10 /dev/null file1
Processing /dev/null (empty)...
Processing /dev/null (empty)...
Processing file1 (nonempty)...

# now we have an actual file called a=10
$ awk -f script.awk /dev/null ./a=10 /dev/null file1
Processing /dev/null (empty)...
Processing ./a=10 (nonempty)...
Processing /dev/null (empty)...
Processing file1 (nonempty)...

Filed under awk, faq, shell, tips Tagged awk, empty files

Comments are closed | Permalink

\1