Working with blocks in awk

Posted by waldner on 24 July 2010, 6:07 pm

In a previous article, we looked at how to recognize and process multi-line "blocks" using sed. Here we will perform the same tasks using awk, which is probably more straightforward and understandable (and usually more efficient too).

Again, we're going to recognize and isolate blocks, and print them if the match the pattern/^ERROR: .*Error code is 12/.

Fixed length blocks

This is the easiest one. If we know that each block is, say, 4 lines long:

{
  block = block sep $0
  sep = RS
} 
!(NR%4){
  if (block ~ /^ERROR: .*Error code is 12/) print block
  block = sep = ""
}

The lines (well, records) that make up the block are accumulated in the variable block (doh) using the concatenation idiom, and when the current record is a multiple of 4 (4, 8, etc.) the block is matched against the pattern, and printed if there is a match. Regardless of the outcome, the variables block and sep are reset to the empty string, so the next block can started to be concatenated.

However, this assumes that the number of input records is an exact multiple of 4. If that may or may not be the case, we need to add an END section to process the final block (if any):

{
  block = block sep $0
  sep=RS
}
!(NR%4){
  if (block ~ /^ERROR: .*Error code is 12/) print block
  block = sep = ""
}
END{
  if(block"" && block ~ /^ERROR: .*Error code is 12/) print block
}

Since the above code checks that the final block is not empty before attempting the match, it can be used generally, even if the number of input records is an exact multiple of 4. The only downside (in most cases unavoidable when doing this kind of processing with awk) is that the pattern and the match logic has to be repeated twice in the code, which is not pretty. To avoid that, we can put the match into a function and only call the function:

function matches(block) {
  return (block ~ /^ERROR: .*Error code is 12/)
}

{
  block = block sep $0
  sep=RS
}
!(NR%4){
  if (matches(block)) print block
  block = sep = ""
}
END{
  if(block"" && matches(block)) print block
}

To make it more general and able to manage blocks of any length, the number of records in a block can be turned into a variable and the value passed from the outside using -v. For even more generality, the pattern to match against can also be passed, but then you need to make sure that you escape it properly for awk to recognize it as intended. This will be discussed in a future article (it was partly covered in this article, but the topic is more complex than what was covered there).

Variable length blocks

Here's a sample input with variable length blocks (it's the same data we used for the sed case):

INFO: nothing here                                                                                                                                      
WARNING: and neither here                                                                                                                               
but this is multiline                                                                                                                                   
ERROR: here is                                                                                                                                       
an interesting error
Error code is 12                                                                                                                                    
and this is a block to print
ERROR: here is
another block to print
Error code is 12
the failure is not serious
INFO: nothing here
but a multiline nothing
INFO: nothing here
WARNING: and neither here
ERROR: this is an error
Error code is 18
this block should not be printed
INFO: nothing here

Turns out that processing variable-length blocks of this kind in awk is much easier than with sed; we can in fact use the same ideas we used for the fixed length case, only deferring the processing of a block to the time the start of a new block is detected (or the end of input is reached). So:

function matches(block) {
  return (block ~ /^ERROR: .*Error code is 12/)
}

/^(INFO|WARNING|ERROR): /{
  # start of a new block, process the previous
  if (prevblock"" && matches(prevblock)) print prevblock
  prevblock = sep = ""
}

{
  prevblock = prevblock sep $0
  sep = RS
}

END {
  if (prevblock"" && matches(prevblock)) print prevblock
}

Save it in a file, say block.awk, and run it on the sample input:

$ awk -f block.awk sample.txt
ERROR: here is
an interesting error
Error code is 12
and this is a block to print
ERROR: here is
another block to print
Error code is 12
the failure is not serious

As with the sed solution, this last method can handle "blocks" of any length (including a single line), as long as the pattern that identifies the start of a block is correct.

Filed under awk, faq, shell, tips Tagged awk, blocks, oneliners, shell, text processing

Comments are closed | Permalink

\1