Skip to content
 

Using shell variables in sed

This is definitely a FAQ. So you have your PATTERN in a shell variable, and it can be an arbitrary string. How to make sure that it is safe to use it as a lhs/rhs/regex in sed?

First, there is a distinction to make. Things are different depending on whether you want to use your variable in the LHS of an s command or in an address regex, or you want to use it in the RHS of a s command. The first case has two sub-cases, respectively when you're using BREs (Basic Regular Expressions, the default in sed) or EREs (Extended Regular Expressions, for those seds that support them like GNU sed). Each case is analyzed separately below.

Use in the LHS or in a regex address

So, you have your variable and want to be able to safely do something like

$ sed "s/${pattern}/FOOBAR/" file.txt
# or
$ sed "/${pattern}/{do something;}" file.txt
# Note double quotes rather than single quotes, so the shell can expand the variables

but of course the risk is that the variable contains slashes, or special regular expression characters that you don't want sed to interpret. In other words, you want that your variable is taken literally by sed, whatever it might contain. You can do that by escaping all the problematic characters in the variable, so that sed takes them literally. And, this escaping can be done using sed as well. If you're using BREs (Basic Regular Expressions) (the default in sed), you can do somthing like this:

$ safe_pattern=$(printf '%s\n' "$pattern" | sed 's/[[\.*^$/]/\\&/g')
# now you can safely do
$ sed "s/${safe_pattern}/FOOBAR/g" file.txt
$ sed -n "/${safe_pattern}/p" file.txt
# these and the following are just examples, of course

If you're using EREs (Extended Regular Expressions), which are supported by GNU sed and some other implementations, then you need to include more characters in the list of the characters to escape:

$ safe_pattern=$(printf '%s\n' "$pattern" | sed 's/[[\.*^$(){}?+|/]/\\&/g')
# now you can safely do (GNU sed)
$ sed -r "s/${safe_pattern}/FOOBAR/g" file.txt
$ sed -n -r "/${safe_pattern}/p" file.txt

Use in the RHS

In the RHS of an s command, less characters are special. The slash is still special of course (although that is special to sed, not to regexps). In addition, & is special because for sed it means "the entire substring that matched the LHS". Finally, backslashes are special, because they usually introduce backreferences or special escape sequences. So we need to escape all these:

$ safe_replacement=$(printf '%s\n' "$replacement" | sed 's/[\&/]/\\&/g')
# now you can safely do
$ sed "s/something/${safe_replacement}/g" file.txt

Finally, keep in mind that all the above is done to have sed treat all of the variable contents as literal. If the variable contains characters that you want indeed sed to treat as special, then you have to remove those characters from the list of characters to be escaped.

Update 25/09/2011: in fact, digging a bit deeper, the whole LHS escaping is more complex than described above. There are two related issues: bracket expressions, and BRE anchors ("^" and "$"). According to POSIX, anchors are special only when they occur in certain positions, and thus they should only be escaped when they are special, not anywhere else.

^
The <circumflex> shall be special when used as:

An anchor (see BRE Expression Anchoring )

The first character of a bracket expression (see RE Bracket Expression )

$
The <dollar-sign> shall be special when used as an anchor.

In general, escaping a character where it wouldn't be special produces undefined results (although in practice most implementations will just silently ignore the escape, this can probably not be relied upon). So the circumflex and the dollar are not special, except when used as anchors or, in the case of the circumflex, as the first character of a bracket expression. The section on anchors says that

A <circumflex> ( '^' ) shall be an anchor when used as the first character of an entire BRE. The implementation may treat the <circumflex> as an anchor when used as the first character of a subexpression. The <circumflex> shall anchor the expression (or optionally subexpression) to the beginning of a string; only sequences starting at the first character of a string shall be matched by the BRE. For example, the BRE "^ab" matches "ab" in the string "abcdef" , but fails to match in the string "cdefab" . The BRE "\(^ab\)" may match the former string. A portable BRE shall escape a leading <circumflex> in a subexpression to match a literal circumflex.

A <dollar-sign> ( '$' ) shall be an anchor when used as the last character of an entire BRE. The implementation may treat a <dollar-sign> as an anchor when used as the last character of a subexpression. The <dollar-sign> shall anchor the expression (or optionally subexpression) to the end of the string being matched; the <dollar-sign> can be said to match the end-of-string following the last character.

So we only have to escape the anchors where they are special:

$ safe_pattern=$(printf '%s\n' "$pattern" | sed 's/[[\.*/]/\\&/g; s/$$/\\&/; s/^^/\\&/')

This should be fine since it will turn "^abc[^def$]ghi$" into "\^abc\[^def$\]ghi\$" and in the result the anchors are escaped only where they would be special.

The dot ("."), the star ("*") and backslash can always be escaped: if they appear outside of a bracket expression, they should be escaped anyway; if they appear inside a bracket expression, since we're escaping the square brackets, they should be escaped otherwise they would become special in the result. Similarly, we don't have to worry about anchors in subexpressions, since the escaping of backslashes means there are no subexpressions in the final escaped pattern (all the "\(", which is where subexpressions would be introduced, are turned into "\\(").

EREs are less problematic, since the circumflex and the dollar are always considered anchors except when in bracket expressions; since we escape square brackets, then by the same logic described above for dot and star it's safe to escape them anywhere.

Any report about flaws in the above logic or other errors will be most welcome.

Be Sociable, Share!

19 Comments

    • pierocampa says:

      Hi man,
      a couple of questions:

      i. Are you missing the double-quotes `"' and the dots in the escape-set of RHS? --> sed 's/[\&/".]/\\&/g'
      ii. Is there some more char to be escaped in the RHS part in case I use extended regexp (-r) ?

      I cannot get it to work here:

      > repl="cat `pwd`/CSA-day.xml"
      > escaped_repl=$( eval "$repl" | sed 's/[\&/]/\\&/g' )
      > sed -r "s/something/${escaped_repl}/p" file
      sed: -e expression #1, char 68: unknown option to `s'

      Whereas replacing ${escaped_repl} with its actual value, things work fine (if I add the double-quotes in the set of chars to be escaped!)

      Any clue?

      • pierocampa says:

        My replacement was split on several lines, and that was breaking the sed replacement, I believe.
        Appending a ``tr -d '\n' '' to remove newlines made the trick.

        • waldner says:

          To have a literal newline in the RHS you have to escape it. Example follows without variables:

          $ echo foobar | sed 's/foo/X\
          Y/'
          X
          Ybar

          So if you use a variable, it must end up with the value (characters spaced for clarity, \n is a literal newline character)

          X  \  \n  Y

          Example:

          $ repl=$'X\\\nY'
          $ printf '%s' "$repl" | od -c
          0000000   X   \  \n   Y
          0000004
          $ echo foobar | sed "s/foo/$repl/"
          X
          Ybar
      • waldner says:

        Irrespective of BRE or ERE, in the RHS only slashes, backslashes and and ampersands are special (slashes only because it's sed's default separator).

        Double quotes and dots are not special:

        $ echo 'foobar' | sed 's/foo/"../'
        "..bar

        In your example, it would be interesting to see why you are using "eval" (which most of the time is not necessary when not downright evil) and how the resulting string looks like.

        • pierocampa says:

          Thanks for you quick comments. !
          I used `eval' because my replacement is yielded by the execution of a command (namely a `cat').

          I do not mind dropping the newlines in my specific application, but just in case: how would I escape a newline with sed?
          Tried with $ ans \n but does not work:

          > sed 's/[\&/$]/\\&/g'
          > sed 's/[\&/\n]/\\&/g'

          This is my $replacement by the way:

          {{{

          Coordinate system axis for the recording of days [d].
          http://www.opengis.net/def/axis/OGC/0/days
          day
          http://www.opengis.net/def/axisDirection/OGC/1.0/future

          }}}

          Again: it works out by escaping [\&/] (no need for quotes and dots, you were right), and dropping newlines \n.

          • waldner says:

            So you have a variable whose value is, to continue with the last example

            X  \n  Y

            and it should become, to be used in the RHS,

            X  \  \n  Y

            (plus the other normal RHS escaping).

            Since sed doesn't see newlines directly, all you have to do is just to put a backslash at the end of each line (except the last), so you can do:

            $ repl1=$'X\nY'
            $ printf '%s' "$repl1" | od -c
            0000000   X  \n   Y
            0000003
            $ repl2=$(printf '%s' "$repl1" | sed 's/[\&/]/\\&/g; $!s/$/\\/')    # the trick is the second s/// command
            $ printf '%s' "$repl2" | od -c
            0000000   X   \  \n   Y
            0000004
            $ sed "s/blah/$repl2/" file ...
            

            Hope this helps.

  1. I'm trying to change the name to pathnames which have spaces within to mark the time when tar is run. But I have the following message. Ired manual but I don't see the error. Some tipe or guide?

    RUNNING: /usr/sfw/bin/gtar --transform=s/"VirtualBox VM"/"VirtualBox VM_04_04_2013_14:14"/ --show-transformed-names -clvMSpf /dev/rmt/0n /New_VDI/"VirtualBox VM"
    /usr/sfw/bin/gtar: Invalid transform expression

    Thanks William

  2. I would generally use
    printf '%s\n' ...
    rather than
    printf "%s\n" ...
    to avoid the extra escaping-level which double-quoting gives.

    A newline in the middle of the variable value will cause sed to exit with an error. E.g.:
    pattern='x
    y'
    Newlines can be filtered out with:
    tr -d '\n'

    • waldner says:

      Regarding the double vs. single quote issue you're correct, although for the string in question ( %s\n ) the result is the same (apart from the fact that the shell peeks into the double quoted version), since there's no special character inside. But I agree that in this case double quotes are gratuitous and unnecessary. I've updated the code to use single quotes.

      Regarding newlines, again you're correct. The article was written with "normal" string variables in mind. If the input variable contains newlines, they can either be removed as you show, or they can be properly escaped to be used in sed. Simple example:

      $ pattern='a
      > b'
      $ safe_pattern=$(printf '%s\n' "$pattern" | sed 's/[[\.*/]/\\&/g; s/$$/\\&/; s/^^/\\&/; $!s/$/\\/')
      $ echo "$safe_pattern"
      a\
      b
      $ echo 'z
      > a
      > b
      > c' | sed "\$!N; s/$safe_pattern/XXX/; P; D"
      z
      XXX
      c
      

      Thanks!

      • (The system ate my angle brackets. Trying again.)

        Come to think of it, in this case it's actually sufficient with:
        printf '%s' ...
        instead of
        printf '%s\n' ...

        I think POSIX specifies that newlines should be given like this:

        sed 's/\n/\
        /g'
        

        I.e. '\n' in the regexp and '\<newline>' in the replacement.

        From http://pubs.opengroup.org/onlinepubs/009695399/utilities/sed.html:

        "The escape sequence '\n' shall match a <newline> embedded in the pattern space. A literal <newline> shall not be used in the BRE of a context address or in the substitute function."

        "A line can be split by substituting a <newline> into it. The application shall escape the <newline> in the replacement by preceding it by a backslash."

        • waldner says:

          Yes, using '%s\n' and just '%s' is probably equivalent in this case. However, I prefer the '%s\n' form because we're sending the output of printf to sed, and sed operates on well-specified text input, where "well-specified" means "all lines are terminated by the newline character" (this is, in essence, the POSIX definition of text file).

          Good catch regarding the literal newline in the LHS or address expression, I had always thought that the escaped literal newline would be allowed everywhere, but I was wrong (btw the latest version of the standard is at http://pubs.opengroup.org/onlinepubs/9699919799/). So if the pattern is being used in the LHS or in an address expression, newlines have to be replaced with the string '\n', eg

          $ safe_pattern=$(printf '%s\n' "$pattern" | sed ':a; $!{N; ba;}; s/[[\.*/]/\\&/g; s/$$/\\&/; s/^^/\\&/; s/\n/\\n/g')
          $ echo "$safe_pattern"
          a\nb
          
          • Ah, I didn't know about the POSIX definition of a text file. Thanks.

            Seems like that was an old version of the standard, yes. Good idea to always use the latest version.

            I think FreeBSD/MacOS requires labels in sed to be terminated with newlines, not semicolons.
            See http://stackoverflow.com/questions/12272065/sed-undefined-label-on-macos .
            I have made myself this snippet:

            sed_read_all_lines=':a;$!{N;ba;}'
            sed_read_all_lines="$(printf ':a\n$!{N\nba\n}')"   # For compatibility with FreeBSD / MacOS
            sed "$sed_read_all_lines;s/foo/bar/g"
            

            I don't think ']' should be escaped.
            In http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
            it's not mentioned in "9.3.3 BRE Special Characters", and then there's this:

            QUOTED_CHAR
                In a BRE, one of the character sequences:
                \^    \.   \*    \[    \$    \\
            

            Concerning the context-sensitive escaping of '^' and '$':
            You write:
            "In general, escaping a character where it wouldn't be special produces undefined
            results (although in practice most implementations will just silently ignore the
            escape, this can probably not be relied upon)."
            Does it really say this in the standard?
            From http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html :
            "9.3.2 BRE Ordinary Characters
            An ordinary character is a BRE that matches itself: any character in the
            supported character set, except for the BRE special characters listed in
            BRE Special Characters .
            The interpretation of an ordinary character preceded by a <backslash>
            ( '\\' ) is undefined, ..."
            "9.3.3 BRE Special Characters
            A BRE special character has special properties in certain contexts. Outside
            those contexts, or when preceded by a <backslash>, such a character is
            a BRE that matches the special character itself. The BRE special characters
            and the contexts in which they have their special meaning are as follows:"
            So it seems that e.g. '^' is always a "special character",
            but that it sometimes has "special properties".
            It never becomes an "ordinary character" and thus undefined
            with a <backslash> in front of it.
            But in these parts of the standard which I have looked at,
            this behaviour is not very explicitly stated.
            Have you found more explicit stating of the way you understand it?
            A problem arises here if context-sensitive escaping is really required:

            pattern='^FOO'
            safe_pattern=...
            sed "s/X=${safe_pattern}/X=^BAR/g"
            # Which equals:
            sed 's/X=\^FOO/X=^BAR/g'
            # But which should be (if context-sensitive escaping is required):
            sed 's/X=^FOO/X=^BAR/g'
            
            • waldner says:

              Again, you make some good points.

              Regarding which commands can be separated by a semicolon and which need a newline (or a separate -e code fragment), it may very well be that labels belong to the second group, and GNU sed (which is where I tried the examples) additionally accepts the semicolon as an enhancement. That part of the sed specification has always been a bit obscure (and admittedly, I haven't been trying to understand it very hard, anyway). The standard says:

              Command verbs other than {, a, b, c, i, r, t, w, :, and # can be followed by a <semicolon>, optional <blank> characters, and another command verb. However, when the s command verb is used with the w flag, following it with another command in this manner produces undefined results.

              so it would look like ":" can NOT be followed by semicolon, and you may very well be correct.

              Regarding "]", you're correct, and for some reason I even thought that I was already not including it among the characters that have to be escaped, but I was wrong. I've fixed it now.

              I have to admit that your reading of the "specialness" of ^ and $ makes sense although, like you, I haven't found a clearer statement about that. If they can always be escaped safely, then we can just (re)include them among the list of characters that are escaped, thus

              safe_pattern=$(printf '%s\n' "$pattern" | sed 's/[[\.*^$/]/\\&/g')

              should do the right thing. If context-sensitive escaping is wanted, then they should be escaped only where they would be special, so using the code shown in the "update" part of the article, which basically escapes "^" only if it's the very first character, and "$" if it's the last. Currently I'm 60/40 in favour of your interpretation (which has the side effect of making things easier).

              Thanks.

  3. paz says:

    you are only using double quotes with your variable examples... how would you do it with a single quote?
    there is several cases such as:

    # who -u | sed "/root /!d"

    where it requires to use a single quote:

    # who -u | sed '/$USER_ID /!d'

    what now?

  4. satish says:

    thank you, that helped.

Leave a Reply

(required)