“Range of fields” in awk

Posted by waldner on 13 October 2014, 6:03 pm

This is an all-time awk FAQ. It can be stated in various ways. A typical way is:

"How can I print the whole line except the first (or the first N, or the Nth) field?"

Or also:

"How can I print only from field N to field M?"

The underlying general question is:

"How can I print a range of fields with awk?"

There are actually quite a few ways to accomplish the task, each has its applicability scenario(s) and its pros and cons. Let's start with methods that only use standard Awk features, then we'll get to GNU awk.

Use a loop

This is the most obvious way: just loop from N to M and print the corresponding fields.

sep = ""
for (i = 3; i<=NF; i++) {
  printf "%s%s", sep, $i
  sep = FS
}
print ""

This is easy, but has some issues: first, the original record spacing is lost. If the input record (line) was, say,

  abc  def   ghi    jkl     mno

the above code will print

ghi jkl mno

instead. This might or might not be a problem. For the same reason, if FS is a complex regular expression, whatever separated the fields in the original input is lost.
On the other hand, if FS is exactly a single-char expression (except space, which is the default and special cased), the above code works just fine.

Assign the empty string to the unwanted fields

So for example one might do:

$1 = $2 = ""; print substr($0, 3)

That presents the same problems as the first solution (formatting is lost), although for different reasons (here it's because awk rebuilds the line with OFS between fields), and introduces empty fields, which have to be skipped when printing the line (in the above example, the default OFS of space is assumed, so we must print starting from the third character; adapt accordingly if OFS is something else).

Delete the unwanted fields

Ok, so it's not possible to delete a field by assigning the empty string to it, but if we modify $0 directly we can indeed remove parts of it and thus fields. We can use sub() for the task:

# default FS
# removes first 2 fields
sub(/^[[:blank:]]*([^[:blank:]]+[[:blank:]]+){2}/,""); print
# removes last 3 fields
sub(/([[:blank:]]+[^[:blank:]]+){3}[[:blank:]]*$/,""); print

# one-char FS, for example ";"
# removes first 2 fields
sub(/^([^;]+;){2}/,""); print
# removes last 3 fields
sub(/(;[^;]+){3}$/,""); print

While this approach has the advantage that it preserves the original formatting (this is especially important if FS is the default, which in awk is slightly special-cased, as can be seen from the first example), it has the problem that it's not applicable at all if FS is a regular expression (that is, when it's not the default and is longer than one character).
It also requires that the Awk implementation in use understands the regex {} quantifier operator, something many awks don't do (although this can be worked around by "expanding" the expression, that is, for example, using "[^;]+;[^;]+;[^;]+;" instead of "([^;]+;){3}". However, the resulting expression might be quite long and awkward - pun intended).

Manually find start and end of fields

Let's now try to find a method that works regardless of FS or OFS. We observe that we can use index($0, $1) to find where $1 begins. We also know the length of $1, so we know where it ends within $0. Now, we can use again index() starting from the next character to find where $2 begins, and so on for all fields of $0. so we can discover the starting positions within $0 for all fields. Sample code:

pos = 0
for (i=1; i<= NF; i++) {
  start[i] = index(substr($0, pos + 1), $i) + pos
  pos = start[i] + length($i)
}

Now, start[1] contains the starting position of field 1 ($1), start[2] the starting position of $2, etc. (As customary in awk, the first character of a string is at position 1.) With this information, printing field 3 to NF without losing information is as simple as doing

first = 3
last = NF
print substr($0, start[first], start[last] - start[first] + length($last))

Seems easy right? Well, this approach has a problem: it assumes that the input has no empty fields, which however are perfectly fine in awk. If some of the fields in the desired range are empty, it may or may not work. So let's see if we can do better.

Manually find the separators

By design, FS can never match the empty string (more on this later), so perhaps we can look for matches of FS (using match()) and use those offsets to extract the needed fields. The idea is the same as in the previous approach, each match is attempted starting from where the previous one left off plus the length of the following field.
If we go this route, however, we must keep in mind that the default FS in awk is special-cased, in that leading and trailing blanks (spaces + tabs) in the record are not counted for the purpose of field splitting, and furthermore fields are separated by runs of blanks despite FS being just a single space. This only happens with the default FS; with any other value, each match terminates exactly one field. Fortunately, it is possible to check whether FS is the default by comparing it to the string " " (a space). If we detect the default FS, we remove leading and trailing blanks from the record, and, for the purpose of matching, change it to its effectively equivalent pattern, that is, "[[:blank:]]+".
If FS is not the default, there is still another special case we should check. The awk specification says that if FS is exactly one character (and is not a space), it must NOT be treated as a regular expression. Since we want to use match() and FS as a pattern, this is especially important, for example if FS is ".", or "+", or "*", which are special regular expression metacharacters but should be treated literally in this case.
All that being said, here's some code that finds and saves all matches of FS:

BEGIN {
  # sep_re is the "effective" FS, so to speak, to be
  # used to find where separators are
  sep_re = FS
  defaultfs = 0

  # ...but check for special cases
  if (FS == " ") {
    defaultfs = 1
    sep_re = "[[:blank:]]+"
  } else if (length(FS) == 1) {
    if (FS ~ /[][^$.*?+{}\\()|]/) {
      sep_re = "\\" FS
    }
  }
}

{
  # save $0 and work on the copy
  record = $0

  if (defaultfs) {
    gsub(/^[[:blank:]]+|[[:blank:]]+$/, "", record)
  }

  # find separators
  i = 0
  while(1) {
    if (match(record, sep_re)) {
      i++
      seps[i] = substr(record, RSTART, RLENGTH)
      record = substr(record, RSTART + RLENGTH)
    } else {
      break
    }
  }

  # ...continued below

With the above code seps[i] contains the string that matched FS between field i and i + 1. We of course also have the fields themselves in $1...$NF, so we can finally write the code that extracts a range of fields from the line:

  # ...continued from above

  result = ""

  first = 3
  last = NF
  for (i = first; i < last; i++) {
    result = result $i seps[i]
  }
  result = result $last
  print result
}

Are we still overlooking something? Unfortunately, yes.
We said earlier that FS can't match the empty string; however, technically we can obviously set it to a value that would ordinarily match the empty string, for example

FS="a*"

That matches zero or more a's, so in particular it will produce a zero-length match if it can't find an "a".
But, just as obviously, an FS that can match a zero-length string is useless as field "separator", so what happens in these cases is that awk just does not allow it to match:

$ echo 'XXXaYYYaaZZZ' | awk -F 'a*' '{for (i=1; i<=NF; i++) print i, $i}'
1 XXX
2 YYY
3 ZZZ

In other words, if awk finds a match of length zero it just ignores it and skips to the next character until it can find a match of length at least 1 for FS.

(Let's leave aside the fact that setting FS to "a*" makes no sense, as in that case what's really wanted is "a+" instead and let's try to make the code handle the worst case.)

In our sample code, we're using match(), which can indeed produce zero-length matches, but we are not checking for those cases; the result is that running it with an FS that can produce zero-length matches will loop forever.

Thus we need to mimic awk's field splitting a little bit more, in that if we find a zero-length match, we just ignore it and try to match again starting from the next character.
So here's the full code to print a range of fields preserving format and separators, with the revised loop to find separators skipping zero-length matches:

BEGIN {
  # sep_re is the "effective" FS, so to speak, to be
  # used to find where separators are
  sep_re = FS
  defaultfs = 0

  # ...but check for special cases
  if (FS == " ") {
    defaultfs = 1
    sep_re = "[[:blank:]]+"
  } else if (length(FS) == 1) {
    if (FS ~ /[][^$.*?+{}\\()|]/) {
      sep_re = "\\" FS
    }
  }
}

{
  # save $0 and work on the copy
  record = $0

  if (defaultfs) {
    gsub(/^[[:blank:]]+|[[:blank:]]+$/, "", record)
  }

  # find separators
  i = 0
  while(1) {
    if (length(record) == 0) break;
    if (match(record, sep_re)) {
      if (RLENGTH > 0) {
        i++
        seps[i] = substr(record, RSTART, RLENGTH)
        record = substr(record, RSTART + RLENGTH)
      } else {
        # ignore zero-length match: go to next char
        record = substr(record, 2)
      }
    } else {
      break
    }
  }

  result = ""

  first = 3
  last = NF
  for (i = first; i < last; i++) {
    result = result $i seps[i]
  }
  result = result $last
  print result
}

A simple optimization of the above code would be to directly skip the next field upon finding a match for FS, eg

# attempt next match after the field that begins here
record = substr(record, RSTART + RLENGTH + length($i))

since, by definition, a field can never match FS, so it can be skipped entirely for the purpose of finding matches of FS.

GNU awk

As it often happens, life is easier for GNU awk users. In this case, thanks to the optional fourth argument to the split() function (a GNU awk extension present at least since 4.0), which is an array where the separators are saved. So all that is needed is something like:

# this does all the hard work, as split() is
# guaranteed to behave like field splitting
nf = split($0, fields, FS, seps)

first = 3
last = NF
for (i = first; i < last; i++) {
  result = result fields[i] seps[i]
}
result = result $last
print result

For more and a slightly different take on the subject, see also this page on the awk.freeshell.org wiki.

Filed under awk, faq Tagged awk, ranges

Comments are closed | Permalink

One Comment

galaxywatcher says:

October 23, 2014 at 15:50

Thanks for your thorough explanation for printing ranges in awk. I find your posts exceptionally informative.

\1