Skip to content
 

Removing newlines in text files

This comes up so often (though for reasons I can't imagine) that it deserves its own space.

Basically, what we want to do is to remove all the newlines (optionally replacing them with something else, say spaces) from a file, so from this input:

$ cat file.txt
line1
line2
line3
line4
line5

we get this output:

line1line2line3line4line5

If newlines are replaced by spaces, which seems to be a somewhat more common requirement, then we want this:

line1 line2 line3 line4 line5

Here I'm going to assume that the output should be a regular text stream, that is, correctly terminated with a newline character. If this requirement is removed, things are easier: basically tr alone can do the job (see below). But we want a real text file as output.

So let's look at a number of ways to accomplish our goal using common shell tools. If the newlines should be replaced with something other than a space, it's easy to adapt the examples accordingly.

tr

The easiest way would seem to be with the good old tr:

$ tr '\n' ' ' < file.txt
line1 line2 line3 line4 line5 $

(To remove newlines instead of replacing them with spaces, just use

tr -d '\n' < file.txt

in this and the following commands)

But we immediately see that this has two issues: the first is that it replaces ALL the newline characters, so the output does not end with a newline, which, strictly speaking, is not correct for a text file. That can be "fixed" wth this kludgy code:

$ tr '\n' ' ' < file.txt; echo
line1 line2 line3 line4 line5 
$

Apart from the ugliness, there is another issue: if you inspect the output with tools like od or hexdump (or carefully watch the output of the first example run above), you'll see that the last item has a trailing space, which is the result of the conversion of the ending newline. Maybe we don't want that extra space, so to remove it we have to add another kludge to the already kludgy code:

$ { tr '\n' ' ' < file.txt; echo; } | sed '$s/ $//'
line1 line2 line3 line4 line5
$

So tr is not good enough for our purposes.

paste

The oft-unknown paste utility can be (ab)used perfectly for the task:

$ paste -s -d ' ' file.txt
line1 line2 line3 line4 line5
$ paste -s -d '' file.txt
line1line2line3line4line5
$

sed

Note that the sed examples assume a modern sed, like GNU sed, that understands the syntax used.

To do the job with sed, we can use this code:

$ sed ':a;$!{N;s/\n/ /;ba;}' file.txt
line1 line2 line3 line4 line5
$ sed ':a;$!{N;s/\n//;ba;}' file.txt
line1line2line3line4line5
$

This works correctly because sed always adds a newline whenever it prints the pattern space. What the code does is to accumulate input lines in the pattern space, replacing (or removing) the newline inserted by the N command as each new line is read in. This is executed in a loop, which ends when the last line of input is reached. At that time, sed will print out the (non-newline-terminated) pattern space, and add a trailing newline, to give us the neat output we want.

Or if you prefer to explicitly slurp the file, you can do this:

$ sed ':a;$!{N;ba;};s/\n/ /g' file.txt
line1 line2 line3 line4 line5
$

This is not too bad, but it still slurps the whole file in memory, which may not be very efficient if the file is big.

awk

If we want to use awk, there are a couple of ways to do it. The most straightforward, which uses the string concatenation idiom, is as follows:

$ awk '{a=a s $0;s=" "}END{print a}' file.txt
line1 line2 line3 line4 line5
$ awk '{a=a $0}END{print a}' file.txt
line1line2line3line4line5
$

However, if the file is huge, the string variable a becomes as huge, because it accumulates all the lines, so it's the same issue as sed above: while this is not a major problem memory-wise nowadays, it will probably perform suboptimally (to say the least). To improve on that, we can use the same idea as above, but without storing lines in memory, instead printing them as we go using printf:

$ awk '{printf "%s%s",s,$0;s=" "}END{print""}' file.txt
line1 line2 line3 line4 line5
$ awk '{printf "%s",$0}END{print""}' file.txt
line1line2line3line4line5
$ awk -v ORS= '1;END{print RS}' file.txt  # similar to the previous one, without explicit printf
line1line2line3line4line5
$

Compare the two approaches applied to huge files, and you'll see a big difference in performance.

Here are two other awk ways:

$ awk 'NR>1{printf "%s ",p}{p=$0}END{print p}' file.txt
line1 line2 line3 line4 line5
$ awk '$1=$1' RS= FS='\n' file.txt  # slurps the whole file, assumes no empty lines in the input
line1 line2 line3 line4 line5
$

Perl

Apologies to all the real Perl programmers!

Since Perl can be used in a sed- and awk-way, all the methods described for these tools can be implemented in Perl. However, in some cases, Perl can express the same things in more compact ways:

# slurp the file, replace all the newlines except the last
$ perl -p0777e 's/\n(?!$)/ /gs' file.txt
line1 line2 line3 line4 line5
$ perl -p0777e 's/\n(?!$)//gs' file.txt
line1line2line3line4line5
$

Here a negative lookahead is used to check that the newline character is not the last character in the file. Alternatively, we can avoid slurping and just replace all the newlines only if we are not at the end of file:

$ perl -pe 's/\n/ / if ! eof' file.txt
line1 line2 line3 line4 line5
$ perl -pe 'chomp if ! eof' file.txt
line1line2line3line4line5
$

This is nice because as said it doesn't slurp the file, and relies on the eof function which awk does not provide. Another (more compact) way again exploits eof:

$ perl -ple '$\=eof()?"\n":" "' file.txt
line1 line2 line3 line4 line5
# same but more obfuscated
$ perl -ple '$\=(eof)?$/:$"' file.txt
line1 line2 line3 line4 line5
$

The special variable $\ is Perl's output record separator, ie much like ORS in awk (you can even use $ORS as a synonym).

Be Sociable, Share!

2 Comments

  1. miguel says:

    Thanks dude, very helpful.

  2. John Lee says:

    Hey, that's what the J key is for in vim (or emacs+viper).

    Still, I can see myself using paste, didn't know about that one.

Leave a Reply

(required)