How to match newlines in sed

Posted by waldner on 23 December 2009, 9:40 pm

Lots of sed newcomers ask why the following snippets of code, or some variation thereof, don't work (they actually work as expected; it's just that the results are not what they think should be):

# All these do NOT produce the expected result!
sed 's/\n//g'             # remove all newline characters
sed 's/PATTERN\n//        # if the line ends in PATTERN, join it with the next line
sed 's/FOO\nBAR/FOOBAR/'  # if a line ends in FOO and the next starts with BAR, join them

To understand why those "don't work", it's necessary to look at how sed reads its input.
Basically, sed reads only one line at a time and, unless you perform special actions, there is always a single input line in the pattern space at any time. That line does NOT have a trailing newline characters, because sed removes it. When the line is printed at the end of the cycle, sed adds back a newline character, but while the line is in the pattern buffer, there's simply no \n in it. Now it's easy to see why none of the above programs will do what you think: the lhs (left hand side) will never match what's in the pattern space, so no replacement will be performed. However, sed does add a newline when you ask it to perform certain commands.

So the next question is: how to do the things that the above programs wrongly attempted to do?

Three not-so well-known commands that are useful for these applications are N, P and D.

N reads in another line of input and appends it to the current pattern space, separated by a newline;
P prints the contents of the pattern space, up to the first newline (or to the end if there is no newline);
D deletes the contents of the patterns space, up to the first newline (or to the end if there is no newline), and starts a new cycle. The latter means that any commands that come after the D in the sed program will not be executed if D itself is executed.

So let's put these commands to good use:

sed ':begin;$!N;s/\n//;tbegin'                   # deletes all newlines except the last; see also tr -d '\n'
sed ':begin;$!N;s/\n/ /;tbegin'                  # same as before, but replaces newlines with spaces; see also tr '\n' ' '
sed ':begin;$!N;s/\(PATTERN\)\n/\1/;tbegin;P;D'  # if the line ends in PATTERN, join it with the next line
sed ':begin;$!N;/PATTERN\n/s/\n//;tbegin;P;D'    # same as above
sed ':begin;$!N;s/FOO\nBAR/FOOBAR/;tbegin;P;D'   # if a line ends in FOO and the next starts with BAR, join them

The programs that join lines, above, keep joining lines as long as the conditions for joining with the next line are true. Note that the mentioned solutions based on tr are not exactly equivalent, in that they will remove or replace the very last newline of the input too, meaning that the output won't be terminated by \n.

For more information, see the sed FAQ, especially this section, and the sed oneliners.

8 Comments

Robin says:

February 8, 2019 at 06:30

Why is it so slow:

cd "$(mktemp -d)"
tr -dc '[:print:]' file.txt
time sed ':begin;$!N;s/\n/ /;tbegin' processed.txt

# real 2m29.278s
# user 2m28.486s
# sys 0m0.022s

and this (https://stackoverflow.com/a/1252191) so fast:

time sed ':begin;N;$!bbegin;s/\n/ /g' | processed.txt

# real 0m0.026s
# user 0m0.013s
# sys 0m0.013s
- waldner says:
  
  February 11, 2019 at 16:24
  
  The first version has to repeatedly scan (once per input line) an ever-increasing pattern space. the s/\n/ / takes longer and longer as input lines accumulate.
  The second version slurps the whole file first, and then runs a *single* s/\n/ /g command, which goes over the whole file just once.
Ingmar Boddington says:

May 23, 2013 at 14:20

But why are you using ':begin' and ';t'? Your explanation is incomplete. Shame...
- waldner says:
  
  May 23, 2013 at 14:33
  
  The article is not meant to be a sed tutorial, of which there are plenty already.
  - Ingmar Boddington says:
    
    May 23, 2013 at 17:25
    
    Hi, that is (kind of) true - it is after all an article called 'HOW TO match newlines in sed'. Simply explaining your examples more fully would have greatly increased the value of this post. Thanks anyway.
    - waldner says:
      
      May 23, 2013 at 17:34
      
      Well, labels and branches are (or should be) basic sed knowledge.
      ":begin" defines a label, and "t" is the conditional branch command, so "tbegin" (or "t begin") branches to label ":begin" if the last replacement operation was successful. Think of it as a kind of conditional "goto".
      The less known commands "N", "P" and "D" instead deserve more explanation in my opinion, since here they are integral to newline matching (while labels and branches are only used as part of the program and are not essential to understand how newlines are actually matched).
      - Ingmar Boddington says:
        
        May 23, 2013 at 23:43
        
        ...good explanation :) I understand what you are saying ofc, but I'd used sed for replacements many times before having to use labels whereas N and P had been used before. Thanks for your responses.
gregor says:

July 28, 2010 at 11:58

On Mac OS X (which uses the FreeBSD version of sed):

# replace each newline with a space
# (should work with GNU sed as well)
printf "a\nb\nc\nd\ne\nf" | sed -e :begin -e '$!N;s/\n/ /; tbegin'
printf "a\nb\nc\nd\ne\nf" | sed -e :begin -e '$!N;s/\n/ /' -e tbegin

\1