In "using shell variables in sed" we saw how to make shell variables "sed-safe"; here we'll see how to make awk variables safe to be used as regular expression patterns.
The problem
Some awk functions and operators expect their arguments to be regular expressions, think for example sub()/gsub(). Sometimes you need to use those functions and operators, and your pattern is in a variable (ie, NOT in a regular expression literal like
The risk in doing
gsub(patternvar, "something")
is that if patternvar contains special regex metacharacters, these will be interpreted and the contents will thus be taken as a regular expression. This may or may not be what you want. If this is NOT what you want, and patternvar might contain arbitrary text, then you have a problem.
The solution
Here is a simple function that can be used to sanitize the string contained in the variable by escaping the regular expression metacharacters, and make it safe to be used wherever awk expects a regular expression:
function escape_pattern(pat, safe) { safe = pat gsub(/[][^$.*?+{}\\()|]/, "\\\\&", safe) return safe }
This makes the pattern "safe" in the sense that, whatever it contains, it will never be taken as a regular expression (or better, it will if awk expects one of course, but the result will be as if you had dealt with the verbatim string).
See the GNU awk manual for an explanation of the reason we need
Note that the above code doesn't seem to work with mawk, which instead requires the brackets in the character class to be escaped:
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", safe)
This seems to be a mawk-specific problem, as all the other awks I've tested (including GNU awk, busybox awk and the one true awk) accept the original syntax (which btw is also what seems to be the POSIX-compliant form).
Update 30/03/2010: The latest mawk snapshot (1.3.4-20100224) fixes the above issue and finally accepts the standard syntax.
Thanks for this. Based on tests with GNU Awk 4.1 this didn't work well for me when passing the regex string to an external program (tre-agrep) which supports regex statements. Here is a similar solution:
#
# Escape regex symbols
#
function regesc(str, safe) {
safe = str
gsub(/[][^$*?+{}\\()|]/, "[&]", safe)
gsub("[\\^]","\\^",safe) # replace "[^]" with "[\^]"
return safe
}
Using square brackets instead of \ to escape, and special case for ^ since it's legitimate inside a square bracket.