Skip to content
 

Awk pitfall: string concatenation

This is a bit of a dark corner of awk. This beast is string concatenation.
There is a classical example in the awk FAQ number 28:

$ awk 'BEGIN { print 6 " " -22 }'
6-22

Where did the space go? First, " " -22 is evaluated, in numeric context, yielding 0-22 = -22. This is then concatenated with 6, producing the output that we see.

But recently another case was presented on the mailing list. Here is a minimal example that demonstrates the problem:

$ awk 'BEGIN { print "Hello " ++count }'
Hello 1
$ awk 'BEGIN{ msg = "Hello "; print msg ++count }'
0

These should do the same thing, but...what's going on in the second example? Turns out that the culprit is the seemingly innocuous string concatenation:

print msg ++count

Believe it or not, that is parsed by awk as

print (msg++) count

Obviously, "msg" is a string, so to apply the postincrement operator it must be converted to a number, and that number is 0. "count" is not touched at all; to awk, it's still in the default state (dual empty string/numeric 0; in string context like here, the empty string value is used). The concatenation of 0 with an empty string gives 0, which is the result we see.

But...but...why the does the first version with the literal string work then? Simple: because a literal string can't be postincremented, so the "++" is parsed as preincrement for "count". (Well, simple, yes, even obvious, once somebody tells you.)

So here's the test case again:

$ awk 'BEGIN{ msg = "Hello "; print msg             ++count }'
0

This is purposely exaggerated to show that the amount of spaces before the "++" is completely irrelevant; it will still be applied to "msg". As I've been reminded, the awk grammar states that "A <blank> shall have no effect, except to delimit lexical tokens or within STRING or ERE tokens", which is exactly the point.

It's not difficult to make up other cases involving string concatenation where the results differ from what one may expect.

Now it should be apparent that the problem is not something that is likely happen all the time; in fact, depending on the programmer's coding style, the specific task to be solved, and other elements, it may even remain unknown to many and never show up. But when one happens to trigger it, it may be difficult to understand what's going on.

How to avoid this problem then? To quote the GNU awk manual:

when doing concatenation, parenthesize. Otherwise, you're never quite sure what you'll get.

In our examples, parentheses will indeed produce the intended results:

$ awk 'BEGIN { print 6 " " (-22) }'
6 -22
$ awk 'BEGIN { msg = "Hello "; print msg (++count) }'
Hello 1

And so on. Better clutter the code with some extra few parentheses than leave the outcome at the mercy of awk's grammar.

Here is the whole thread where this was discussed.