This is definitely a FAQ. So you have your PATTERN in a shell variable, and it can be an arbitrary string. How to make sure that it is safe to use it as a lhs/rhs/regex in sed?
First, there is a distinction to make. Things are different depending on whether you want to use your variable in the LHS of an s command or in an address regex, or you want to use it in the RHS of a s command. The first case has two sub-cases, respectively when you're using BREs (Basic Regular Expressions, the default in sed) or EREs (Extended Regular Expressions, for those seds that support them like GNU sed). Each case is analyzed separately below.
Use in the LHS or in a regex address
So, you have your variable and want to be able to safely do something like
$ sed "s/${pattern}/FOOBAR/" file.txt
# or
$ sed "/${pattern}/{do something;}" file.txt
# Note double quotes rather than single quotes, so the shell can expand the variables
but of course the risk is that the variable contains slashes, or special regular expression characters that you don't want sed to interpret. In other words, you want that your variable is taken literally by sed, whatever it might contain. You can do that by escaping all the problematic characters in the variable, so that sed takes them literally. And, this escaping can be done using sed as well. If you're using BREs (Basic Regular Expressions) (the default in sed), you can do somthing like this:
$ safe_pattern=$(printf "%s\n" "$pattern" | sed 's/[][\.*^$/]/\\&/g')
# now you can safely do
$ sed "s/${safe_pattern}/FOOBAR/g" file.txt
$ sed -n "/${safe_pattern}/p" file.txt
# these and the following are just examples, of course
If you're using EREs (Extended Regular Expressions), which are supported by GNU sed and some other implementations, then you need to include more characters in the list of the characters to escape:
$ safe_pattern=$(printf "%s\n" "$pattern" | sed 's/[][\.*^$(){}?+|/]/\\&/g')
# now you can safely do (GNU sed)
$ sed -r "s/${safe_pattern}/FOOBAR/g" file.txt
$ sed -n -r "/${safe_pattern}/p" file.txt
Use in the RHS
In the RHS of an s command, less characters are special. The slash is still special of course (although that is special to sed, not to regexps). In addition, & is special because for sed it means "the entire substring that matched the LHS". Finally, backslashes are special, because they usually introduce backreferences or special escape sequences. So we need to escape all these:
$ safe_replacement=$(printf "%s\n" "$replacement" | sed 's/[\&/]/\\&/g')
# now you can safely do
$ sed "s/something/${safe_replacement}/g" file.txt
Finally, keep in mind that all the above is done to have sed treat all of the variable contents as literal. If the variable contains characters that you want indeed sed to treat as special, then you have to remove those characters from the list of characters to be escaped.
Update 25/09/2011: in fact, digging a bit deeper, the whole LHS escaping is more complex than described above. There are two related issues: bracket expressions, and BRE anchors ("^" and "$"). According to POSIX, anchors are special only when they occur in certain positions, and thus they should only be escaped when they are special, not anywhere else.
^
The <circumflex> shall be special when used as:An anchor (see BRE Expression Anchoring )
The first character of a bracket expression (see RE Bracket Expression )
$
The <dollar-sign> shall be special when used as an anchor.
In general, escaping a character where it wouldn't be special produces undefined results (although in practice most implementations will just silently ignore the escape, this can probably not be relied upon). So the circumflex and the dollar are not special, except when used as anchors or, in the case of the circumflex, as the first character of a bracket expression. The section on anchors says that
A <circumflex> ( '^' ) shall be an anchor when used as the first character of an entire BRE. The implementation may treat the <circumflex> as an anchor when used as the first character of a subexpression. The <circumflex> shall anchor the expression (or optionally subexpression) to the beginning of a string; only sequences starting at the first character of a string shall be matched by the BRE. For example, the BRE "^ab" matches "ab" in the string "abcdef" , but fails to match in the string "cdefab" . The BRE "\(^ab\)" may match the former string. A portable BRE shall escape a leading <circumflex> in a subexpression to match a literal circumflex.
A <dollar-sign> ( '$' ) shall be an anchor when used as the last character of an entire BRE. The implementation may treat a <dollar-sign> as an anchor when used as the last character of a subexpression. The <dollar-sign> shall anchor the expression (or optionally subexpression) to the end of the string being matched; the <dollar-sign> can be said to match the end-of-string following the last character.
So we only have to escape the anchors where they are special:
$ safe_pattern=$(printf "%s\n" "$pattern" | sed 's/[][\.*/]/\\&/g; s/$$/\\&/; s/^^/\\&/')
This should be fine since it will turn
The dot ("."), the star ("*") and backslash can always be escaped: if they appear outside of a bracket expression, they should be escaped anyway; if they appear inside a bracket expression, since we're escaping the square brackets, they should be escaped otherwise they would become special in the result. Similarly, we don't have to worry about anchors in subexpressions, since the escaping of backslashes means there are no subexpressions in the final escaped pattern (all the "\(", which is where subexpressions would be introduced, are turned into "\\(").
EREs are less problematic, since the circumflex and the dollar are always considered anchors except when in bracket expressions; since we escape square brackets, then by the same logic described above for dot and star it's safe to escape them anywhere.
Any report about flaws in the above logic or other errors will be most welcome.