Using shell variables in sed

Posted by waldner on 9 December 2009, 4:48 pm

This is definitely a FAQ. So you have your PATTERN in a shell variable, and it can be an arbitrary string. How to make sure that it is safe to use it as a lhs/rhs/regex in sed?

First, there is a distinction to make. Things are different depending on whether you want to use your variable in the LHS of an s command or in an address regex, or you want to use it in the RHS of a s command. The first case has two sub-cases, respectively when you're using BREs (Basic Regular Expressions, the default in sed) or EREs (Extended Regular Expressions, for those seds that support them like GNU sed). Each case is analyzed separately below.

Use in the LHS or in a regex address

So, you have your variable and want to be able to safely do something like

$ sed "s/${pattern}/FOOBAR/" file.txt
# or
$ sed "/${pattern}/{do something;}" file.txt
# Note double quotes rather than single quotes, so the shell can expand the variables

but of course the risk is that the variable contains slashes, or special regular expression characters that you don't want sed to interpret. In other words, you want that your variable is taken literally by sed, whatever it might contain. You can do that by escaping all the problematic characters in the variable, so that sed takes them literally. And, this escaping can be done using sed as well. If you're using BREs (Basic Regular Expressions) (the default in sed), you can do somthing like this:

$ safe_pattern=$(printf '%s\n' "$pattern" | sed 's/[[\.*^$/]/\\&/g')
# now you can safely do
$ sed "s/${safe_pattern}/FOOBAR/g" file.txt
$ sed -n "/${safe_pattern}/p" file.txt
# these and the following are just examples, of course

If you're using EREs (Extended Regular Expressions), which are supported by GNU sed and some other implementations, then you need to include more characters in the list of the characters to escape:

$ safe_pattern=$(printf '%s\n' "$pattern" | sed 's/[[\.*^$(){}?+|/]/\\&/g')
# now you can safely do (GNU sed)
$ sed -r "s/${safe_pattern}/FOOBAR/g" file.txt
$ sed -n -r "/${safe_pattern}/p" file.txt

Use in the RHS

In the RHS of an s command, less characters are special. The slash is still special of course (although that is special to sed, not to regexps). In addition, & is special because for sed it means "the entire substring that matched the LHS". Finally, backslashes are special, because they usually introduce backreferences or special escape sequences. So we need to escape all these:

$ safe_replacement=$(printf '%s\n' "$replacement" | sed 's/[\&/]/\\&/g')
# now you can safely do
$ sed "s/something/${safe_replacement}/g" file.txt

Finally, keep in mind that all the above is done to have sed treat all of the variable contents as literal. If the variable contains characters that you want indeed sed to treat as special, then you have to remove those characters from the list of characters to be escaped.

Update 25/09/2011: in fact, digging a bit deeper, the whole LHS escaping is more complex than described above. There are two related issues: bracket expressions, and BRE anchors ("^" and "$"). According to POSIX, anchors are special only when they occur in certain positions, and thus they should only be escaped when they are special, not anywhere else.

^
The <circumflex> shall be special when used as:

An anchor (see BRE Expression Anchoring )

The first character of a bracket expression (see RE Bracket Expression )

$
The <dollar-sign> shall be special when used as an anchor.

In general, escaping a character where it wouldn't be special produces undefined results (although in practice most implementations will just silently ignore the escape, this can probably not be relied upon). So the circumflex and the dollar are not special, except when used as anchors or, in the case of the circumflex, as the first character of a bracket expression. The section on anchors says that

A <circumflex> ( '^' ) shall be an anchor when used as the first character of an entire BRE. The implementation may treat the <circumflex> as an anchor when used as the first character of a subexpression. The <circumflex> shall anchor the expression (or optionally subexpression) to the beginning of a string; only sequences starting at the first character of a string shall be matched by the BRE. For example, the BRE "^ab" matches "ab" in the string "abcdef" , but fails to match in the string "cdefab" . The BRE "$^ab$" may match the former string. A portable BRE shall escape a leading <circumflex> in a subexpression to match a literal circumflex.

A <dollar-sign> ( '$' ) shall be an anchor when used as the last character of an entire BRE. The implementation may treat a <dollar-sign> as an anchor when used as the last character of a subexpression. The <dollar-sign> shall anchor the expression (or optionally subexpression) to the end of the string being matched; the <dollar-sign> can be said to match the end-of-string following the last character.

So we only have to escape the anchors where they are special:

$ safe_pattern=$(printf '%s\n' "$pattern" | sed 's/[[\.*/]/\\&/g; s/$$/\\&/; s/^^/\\&/')

This should be fine since it will turn "^abc[^def$]ghi$" into "\^abc\[^def$\]ghi\$" and in the result the anchors are escaped only where they would be special.

The dot ("."), the star ("*") and backslash can always be escaped: if they appear outside of a bracket expression, they should be escaped anyway; if they appear inside a bracket expression, since we're escaping the square brackets, they should be escaped otherwise they would become special in the result. Similarly, we don't have to worry about anchors in subexpressions, since the escaping of backslashes means there are no subexpressions in the final escaped pattern (all the "\(", which is where subexpressions would be introduced, are turned into "\\(").

EREs are less problematic, since the circumflex and the dollar are always considered anchors except when in bracket expressions; since we escape square brackets, then by the same logic described above for dot and star it's safe to escape them anywhere.

Any report about flaws in the above logic or other errors will be most welcome.

Filed under faq, sed, shell, tips Tagged escaping, sed, shell, tips

Comments are closed | Permalink

21 Comments

Kusuma says:

February 20, 2015 at 16:58

How to escape single quote in RHS?

I tried like this
safe_replacement_oracle=$(printf '%s\n' "$ORACLE_PASS" | sed 's/[\&/'"]/\\&/g')

ORACLE_PASS is !_#%1

/cygdrive/D/KonyServer1/Bash_Scripts/User_Input.properties.bash: line 22: syntax error near unexpected token `('
/cygdrive/D/KonyServer1/Bash_Scripts/User_Input.properties.bash: line 22: `ORACLE_PASS='!_#%1''
- waldner says:
  
  February 20, 2015 at 17:41
  
  Your problem has nothing to do with sed, but rather with your shell. In any case, the single quote must NOT be escaped, as it's a perfectly normal character for sed.
  
  If you want to know how to insert a single quote in a single-quoted string, see for example this page: http://wiki.bash-hackers.org/syntax/quoting#strong_quoting
pierocampa says:

February 24, 2014 at 20:30

I g**damn love you. !!
- pierocampa says:
  
  February 25, 2014 at 09:11
  
  Hi man,
  a couple of questions:
  
  i. Are you missing the double-quotes `"' and the dots in the escape-set of RHS? --> sed 's/[\&/".]/\\&/g'
  ii. Is there some more char to be escaped in the RHS part in case I use extended regexp (-r) ?
  
  I cannot get it to work here:
  
  > repl="cat `pwd`/CSA-day.xml"
  > escaped_repl=$( eval "$repl" | sed 's/[\&/]/\\&/g' )
  > sed -r "s/something/${escaped_repl}/p" file
  sed: -e expression #1, char 68: unknown option to `s'
  
  Whereas replacing ${escaped_repl} with its actual value, things work fine (if I add the double-quotes in the set of chars to be escaped!)
  
  Any clue?
  - pierocampa says:
    
    February 25, 2014 at 09:18
    
    My replacement was split on several lines, and that was breaking the sed replacement, I believe.
    Appending a ``tr -d '\n' '' to remove newlines made the trick.
    - waldner says:
      
      February 25, 2014 at 09:39
      
      To have a literal newline in the RHS you have to escape it. Example follows without variables:
      
      $ echo foobar | sed 's/foo/X\ Y/' X Ybar
      
      So if you use a variable, it must end up with the value (characters spaced for clarity, \n is a literal newline character)
      
      X \ \n Y
      
      Example:
      
      $ repl=$'X\\\nY' $ printf '%s' "$repl" | od -c 0000000 X \ \n Y 0000004 $ echo foobar | sed "s/foo/$repl/" X Ybar
  - waldner says:
    
    February 25, 2014 at 09:32
    
    Irrespective of BRE or ERE, in the RHS only slashes, backslashes and and ampersands are special (slashes only because it's sed's default separator).
    
    Double quotes and dots are not special:
```
$ echo 'foobar' | sed 's/foo/"../'
"..bar
```
    In your example, it would be interesting to see why you are using "eval" (which most of the time is not necessary when not downright evil) and how the resulting string looks like.
    - pierocampa says:
      
      February 25, 2014 at 09:55
      
      Thanks for you quick comments. !
      I used `eval' because my replacement is yielded by the execution of a command (namely a `cat').
      
      I do not mind dropping the newlines in my specific application, but just in case: how would I escape a newline with sed?
      Tried with $ ans \n but does not work:
      
      > sed 's/[\&/$]/\\&/g'
      > sed 's/[\&/\n]/\\&/g'
      
      This is my $replacement by the way:
      
      {{{
      
      Coordinate system axis for the recording of days [d].
      http://www.opengis.net/def/axis/OGC/0/days
      day
      http://www.opengis.net/def/axisDirection/OGC/1.0/future
      
      }}}
      
      Again: it works out by escaping [\&/] (no need for quotes and dots, you were right), and dropping newlines \n.
      - waldner says:
        
        February 25, 2014 at 11:02
        
        So you have a variable whose value is, to continue with the last example
        
        X \n Y
        
        and it should become, to be used in the RHS,
        
        X \ \n Y
        
        (plus the other normal RHS escaping).
        
        Since sed doesn't see newlines directly, all you have to do is just to put a backslash at the end of each line (except the last), so you can do:
        
        $ repl1=$'X\nY' $ printf '%s' "$repl1" | od -c 0000000 X \n Y 0000003 $ repl2=$(printf '%s' "$repl1" | sed 's/[\&/]/\\&/g; $!s/$/\\/') # the trick is the second s/// command $ printf '%s' "$repl2" | od -c 0000000 X \ \n Y 0000004 $ sed "s/blah/$repl2/" file ...
        
        Hope this helps.
William Castro says:

April 4, 2013 at 23:58

I'm trying to change the name to pathnames which have spaces within to mark the time when tar is run. But I have the following message. Ired manual but I don't see the error. Some tipe or guide?

RUNNING: /usr/sfw/bin/gtar --transform=s/"VirtualBox VM"/"VirtualBox VM_04_04_2013_14:14"/ --show-transformed-names -clvMSpf /dev/rmt/0n /New_VDI/"VirtualBox VM"
/usr/sfw/bin/gtar: Invalid transform expression

Thanks William
- waldner says:
  
  April 5, 2013 at 09:29
  
  It works for me written the way you pasted it, running it on the command line. Perhaps there's something else that interpolates the command before gtar sees it.
  - William Castro says:
    
    April 5, 2013 at 21:19
    
    You're right. I did the same test and it works. The error is presented when it is executed from a bash shell script. Regarding that, I found a solution which was to give this command line as argument to the command sh -c
    
    Thanks,
Håkon A. Hjortland says:

March 8, 2013 at 11:32

I would generally use
printf '%s\n' ...
rather than
printf "%s\n" ...
to avoid the extra escaping-level which double-quoting gives.

A newline in the middle of the variable value will cause sed to exit with an error. E.g.:
pattern='x
y'
Newlines can be filtered out with:
tr -d '\n'
- waldner says:
  
  March 8, 2013 at 12:21
  
  Regarding the double vs. single quote issue you're correct, although for the string in question ( %s\n ) the result is the same (apart from the fact that the shell peeks into the double quoted version), since there's no special character inside. But I agree that in this case double quotes are gratuitous and unnecessary. I've updated the code to use single quotes.
  
  Regarding newlines, again you're correct. The article was written with "normal" string variables in mind. If the input variable contains newlines, they can either be removed as you show, or they can be properly escaped to be used in sed. Simple example:
```
$ pattern='a
> b'
$ safe_pattern=$(printf '%s\n' "$pattern" | sed 's/[[\.*/]/\\&/g; s/$$/\\&/; s/^^/\\&/; $!s/$/\\/')
$ echo "$safe_pattern"
a\
b
$ echo 'z
> a
> b
> c' | sed "\$!N; s/$safe_pattern/XXX/; P; D"
z
XXX
c
```
  Thanks!
  - Håkon A. Hjortland says:
    
    March 8, 2013 at 15:17
    
    (The system ate my angle brackets. Trying again.)
    
    Come to think of it, in this case it's actually sufficient with:
    printf '%s' ...
    instead of
    printf '%s\n' ...
    
    I think POSIX specifies that newlines should be given like this:
```
sed 's/\n/\
/g'
```
    I.e. '\n' in the regexp and '\<newline>' in the replacement.
    
    From http://pubs.opengroup.org/onlinepubs/009695399/utilities/sed.html:
    
    "The escape sequence '\n' shall match a <newline> embedded in the pattern space. A literal <newline> shall not be used in the BRE of a context address or in the substitute function."
    
    "A line can be split by substituting a <newline> into it. The application shall escape the <newline> in the replacement by preceding it by a backslash."
    - waldner says:
      
      March 8, 2013 at 17:23
      
      Yes, using '%s\n' and just '%s' is probably equivalent in this case. However, I prefer the '%s\n' form because we're sending the output of printf to sed, and sed operates on well-specified text input, where "well-specified" means "all lines are terminated by the newline character" (this is, in essence, the POSIX definition of text file).
      
      Good catch regarding the literal newline in the LHS or address expression, I had always thought that the escaped literal newline would be allowed everywhere, but I was wrong (btw the latest version of the standard is at http://pubs.opengroup.org/onlinepubs/9699919799/). So if the pattern is being used in the LHS or in an address expression, newlines have to be replaced with the string '\n', eg
      
      $ safe_pattern=$(printf '%s\n' "$pattern" | sed ':a; $!{N; ba;}; s/[[\.*/]/\\&/g; s/$$/\\&/; s/^^/\\&/; s/\n/\\n/g') $ echo "$safe_pattern" a\nb
      - Håkon A. Hjortland says:
        
        March 10, 2013 at 01:31
        
        Ah, I didn't know about the POSIX definition of a text file. Thanks.
        
        Seems like that was an old version of the standard, yes. Good idea to always use the latest version.
        
        I think FreeBSD/MacOS requires labels in sed to be terminated with newlines, not semicolons.
        See http://stackoverflow.com/questions/12272065/sed-undefined-label-on-macos .
        I have made myself this snippet:
        
        sed_read_all_lines=':a;$!{N;ba;}' sed_read_all_lines="$(printf ':a\n$!{N\nba\n}')" # For compatibility with FreeBSD / MacOS sed "$sed_read_all_lines;s/foo/bar/g"
        
        I don't think ']' should be escaped.
        In http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
        it's not mentioned in "9.3.3 BRE Special Characters", and then there's this:
        
        QUOTED_CHAR In a BRE, one of the character sequences: \^ \. \* \[ \$ \\
        
        Concerning the context-sensitive escaping of '^' and '$':
        You write:
        "In general, escaping a character where it wouldn't be special produces undefined
        results (although in practice most implementations will just silently ignore the
        escape, this can probably not be relied upon)."
        Does it really say this in the standard?
        From http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html :
        "9.3.2 BRE Ordinary Characters
        An ordinary character is a BRE that matches itself: any character in the
        supported character set, except for the BRE special characters listed in
        BRE Special Characters .
        The interpretation of an ordinary character preceded by a <backslash>
        ( '\\' ) is undefined, ..."
        "9.3.3 BRE Special Characters
        A BRE special character has special properties in certain contexts. Outside
        those contexts, or when preceded by a <backslash>, such a character is
        a BRE that matches the special character itself. The BRE special characters
        and the contexts in which they have their special meaning are as follows:"
        So it seems that e.g. '^' is always a "special character",
        but that it sometimes has "special properties".
        It never becomes an "ordinary character" and thus undefined
        with a <backslash> in front of it.
        But in these parts of the standard which I have looked at,
        this behaviour is not very explicitly stated.
        Have you found more explicit stating of the way you understand it?
        A problem arises here if context-sensitive escaping is really required:
        
        pattern='^FOO' safe_pattern=... sed "s/X=${safe_pattern}/X=^BAR/g" # Which equals: sed 's/X=\^FOO/X=^BAR/g' # But which should be (if context-sensitive escaping is required): sed 's/X=^FOO/X=^BAR/g'
        
        waldner says:
        
        March 10, 2013 at 21:30
        
        Again, you make some good points.
        
        Regarding which commands can be separated by a semicolon and which need a newline (or a separate -e code fragment), it may very well be that labels belong to the second group, and GNU sed (which is where I tried the examples) additionally accepts the semicolon as an enhancement. That part of the sed specification has always been a bit obscure (and admittedly, I haven't been trying to understand it very hard, anyway). The standard says:
        
        Command verbs other than {, a, b, c, i, r, t, w, :, and # can be followed by a <semicolon>, optional <blank> characters, and another command verb. However, when the s command verb is used with the w flag, following it with another command in this manner produces undefined results.
        
        so it would look like ":" can NOT be followed by semicolon, and you may very well be correct.
        
        Regarding "]", you're correct, and for some reason I even thought that I was already not including it among the characters that have to be escaped, but I was wrong. I've fixed it now.
        
        I have to admit that your reading of the "specialness" of ^ and $ makes sense although, like you, I haven't found a clearer statement about that. If they can always be escaped safely, then we can just (re)include them among the list of characters that are escaped, thus
        
        safe_pattern=$(printf '%s\n' "$pattern" | sed 's/[[\.*^$/]/\\&/g')
        
        should do the right thing. If context-sensitive escaping is wanted, then they should be escaped only where they would be special, so using the code shown in the "update" part of the article, which basically escapes "^" only if it's the very first character, and "$" if it's the last. Currently I'm 60/40 in favour of your interpretation (which has the side effect of making things easier).
        
        Thanks.
paz says:

August 17, 2012 at 17:42

you are only using double quotes with your variable examples... how would you do it with a single quote?
there is several cases such as:

# who -u | sed "/root /!d"

where it requires to use a single quote:

# who -u | sed '/$USER_ID /!d'

what now?
- waldner says:
  
  August 17, 2012 at 17:53
  
  If you use single quotes the shell doesn't expand the variable, so when you have a variable you have to use double quotes.
satish says:

June 8, 2012 at 07:50

thank you, that helped.

\1