Skip to content
 

Safely escape variables in awk

In "using shell variables in sed" we saw how to make shell variables "sed-safe"; here we'll see how to make awk variables safe to be used as regular expression patterns.

The problem

Some awk functions and operators expect their arguments to be regular expressions, think for example sub()/gsub(). Sometimes you need to use those functions and operators, and your pattern is in a variable (ie, NOT in a regular expression literal like /pattern/).
The risk in doing

gsub(patternvar, "something")

is that if patternvar contains special regex metacharacters, these will be interpreted and the contents will thus be taken as a regular expression. This may or may not be what you want. If this is NOT what you want, and patternvar might contain arbitrary text, then you have a problem.

The solution

Here is a simple function that can be used to sanitize the string contained in the variable by escaping the regular expression metacharacters, and make it safe to be used wherever awk expects a regular expression:

function escape_pattern(pat,   safe) {
  safe = pat
  gsub(/[][^$.*?+{}\\()|]/, "\\\\&", safe)
  return safe
}

This makes the pattern "safe" in the sense that, whatever it contains, it will never be taken as a regular expression (or better, it will if awk expects one of course, but the result will be as if you had dealt with the verbatim string).
See the GNU awk manual for an explanation of the reason we need \\\\& to get a literal "\" plus the matching text (yes it's weird).

Note that the above code doesn't seem to work with mawk, which instead requires the brackets in the character class to be escaped:

gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", safe)

This seems to be a mawk-specific problem, as all the other awks I've tested (including GNU awk, busybox awk and the one true awk) accept the original syntax (which btw is also what seems to be the POSIX-compliant form).

Update 30/03/2010: The latest mawk snapshot (1.3.4-20100224) fixes the above issue and finally accepts the standard syntax.

Be Sociable, Share!

One Comment

  1. Steven B. says:

    Thanks for this. Based on tests with GNU Awk 4.1 this didn't work well for me when passing the regex string to an external program (tre-agrep) which supports regex statements. Here is a similar solution:

    #
    # Escape regex symbols
    #
    function regesc(str, safe) {
    safe = str
    gsub(/[][^$*?+{}\\()|]/, "[&]", safe)
    gsub("[\\^]","\\^",safe) # replace "[^]" with "[\^]"
    return safe
    }

    Using square brackets instead of \ to escape, and special case for ^ since it's legitimate inside a square bracket.

Leave a Reply

(required)