Skip to content
 

Sanitizing files with no trailing newline

Text files should have all their lines terminated by newline characters (ie, \n). This is stated by POSIX, that says that a text file is

A file that contains characters organized into zero or more lines.

A line, in turn, is defined as

A sequence of zero or more non-<newline> characters plus a terminating <newline> character.

As it happens, certain applications produce text files with the final newline character missing on the last line. This is an annoyance at best, but can also break or adversely affect some text processing tools that operate on the file (for example, a shell loop will not read such a last line correctly). A further annoyance may be that you have to process files that may or may not be well-formed, but you don't know in advance and have to check each file before processing. (incidentally, this poses a problem that we will deliberately ignore here: if a file is found that ends with a newline, is that a downright well-formed file, or is it a broken file with an empty last line?).

Of course, the right solution is to fix whatever produces the broken files in the first place. However, unfortunately that's not always possible. Here we'll see how to fix those files so they can be safely processed. It turns out that, due to the way awk processes input, the straightforward oneliner

awk 1 file > tempfile && mv tempfile file

produces a correct output, regardless of whether the original file was correct or not. However, if the file is huge, we would like to avoid the need to read through the whole file only to fix the last line (and if it is correct, not even that). After all, there is a very simple, efficient and straightforward way to append a newline to a file, that doesn't require scanning the whole file:

echo >> file

And, as we know, there are commands that can effectively read the file directly at the end; most implementations of tail do this. So now we have a plan:

  • read the last line,
  • check whether it ends with a newline character and if doesn't
  • add one

We already have an efficient way of performing the first and last step, so it should be an easy matter to check for the newline character shouldn't it? Easier said than done. Let's get the last line with tail and put it in a variable:

lastline=$(tail -n 1 file)    # Wrong!

Unfortunately, command substitution removes trailing newlines. This is by design, but in this case it's a problem, as $lastline will then never end with a newline character, regardless of whether one was present in the original file or not. Now, using a trick I found on the Bash Hackers wiki, we can indeed get the line verbatim as it is in the file:

lastline=$(tail -n 1 file; echo x); lastline=${lastline%x}

This adds and then removes a character; the net result is that now $lastline contains exactly the output of tail -n 1 file untouched.
This is all fine and dandy, but we still have to find a way to check whether the last character of $lastline is a newline or not. The idea now is to extract the last character of $lastline and compare it against newline and see if they match. How do we do that?

With bash it's quite easy:

# Many ways, even
if [ "${lastline: -1}" = $'\n' ]; then .... 
if [[ $lastline =~ $'\n'$ ]]; then .... 

What about shells that lack those fancy features? For the comparison, we can create a variable containing a literal newline character:

newline='
'

To extract the last character of $lastline, there isn't a standard straightforward method; the following kludge seems to work:

${lastline#"${lastline%?}"}

(we can't use command substitution because that will remove trailing newlines which we want to keep instead).

The double quotes are important; they tell the shell to interpret whatever is yielded by ${lastline%?} literally. This isn't generally a problem, but it could be in case that result contains wildcards or other special characters that may deceive the outer expansion. As usual, quoting does not hurt.

Now we're ready to do the job:

newline='
'
lastline=$(tail -n 1 file; echo x); lastline=${lastline%x}
[ "${lastline#"${lastline%?}"}" != "$newline" ] && echo >> file
# Now file is sane; do our normal processing here...

Edit 24/07/2014: thanks to geirha from freenode, who suggested the following clever and simpler solution:

tail -n1 file | read -r _ || echo >> file

Or, even better, read just the last character (that can be an improvement for the other original solution too):

tail -c1 file | read -r _ || echo >> file

These work because the read builtin exits nonzero if it detects EOF before it finds a \n (just like the case we're interested in).

Be Sociable, Share!

12 Comments

  1. gregor says:

    Here are two more alternatives:

    # note: ed will read entire file into memory
    [[ -s file ]] && ed -s file <<< $'H\n$s/\(.*\)/\\1/\nwq'

    sed -E -i "" '$s/(.*)/\1/' file

    • waldner says:

      Thanks.

      The sed method is not realiable: for example, it does not work with GNU sed (even replacing -E with its equivalent in GNU sed -r). And besides that, sed needs to read the whole file, which is something that we specifically want to avoid. Along the same lines,

      awk 1 file

      which is mentioned in the article, produces correct output regardless of whether the input has a trailing newline or not. But again, that needs to read the whole file.

      And regarding the ed solution: it still reads the whole file so it's out of the article's scope, but it does work. It can be rewritten more portably as

      printf '%s\n' H '$s/\(.*\)/\1/' w q | ed file

      which, at least on GNU ed, prints a "Newline appended" message to let the user know what it did.

  2. gregor says:

    Check for the newline character this way:

    [[ "$(tail -c 1 file | tr -dc '\n' | wc -c)" -eq 1 ]] && ..

    see: http://stackoverflow.com/questions/1654021/how-can-i-delete-a-newline-if-it-is-the-last-character-in-a-file

  3. John Lee says:

    Uh, another quibble: there's a misnamed variable in my wrapup script: that's not a shell exec there of course, it's a Python exec statement -- entirely different things.

  4. John Lee says:

    Seems that "rw+" should more properly be "r+", BTW (or, to work on Windows as well as Unix, "r+b" -- but then you're into the fun of different line-ending conventions which I've ignored...).

    One other note: it's useful to remember that you can pass arguments to scripts wrapped with wrapup.py either by passing them at the time you run wrapup.py, or at the time you run the wrapped-up script (or both). I wrote a typo that broke the latter case though, here's the fixed version:

    python -u -c "exec 'import pipes\\nimport sys\\n\\nif __name__ == \"__main__\":\\n    args = sys.argv[1:]\\n    python_file = args[0]\\n    python_program = open(python_file).read()\\n    shell_program = \"exec %s\" % repr(python_program)\\n    args = [shell_program] + args[1:]\\n    print \"python -u -c \" + \" \".join(pipes.quote(arg) for arg in args)\\n'"

    And here's a wrapped-up single-line version of a multiline version of the original one liner (still ignoring Windows and Mac, as God intended):

    python -u -c "exec 'import os\\nimport sys\\n\\n\\ndef fix_newline(path):\\n    fh = open(path, \"rw+\")\\n    try:\\n        fh.seek(-1, os.SEEK_END)\\n        last_char = fh.read(1)\\n        if last_char != \"\\\\n\":\\n            fh.write(\"\\\\n\")\\n    finally:\\n        fh.close()\\n\\n\\ndef main(args):\\n    for path in args:\\n        fix_newline(path)\\n\\n\\nif __name__ == \"__main__\":\\n    main(sys.argv[1:])\\n'"

    I can't be bothered to try more guessing games to make the blog show that in readable form :-(

    • waldner says:

      I'm not entirely sure I understand the usefulness of mangling the code this way, except for fun or (very mild) obfuscation. Python is designed to enforce a relatively clean and readable source layout (unlike other languages), so why not enjoy its clearness?
      Sure, you can still turn it into gibberish if you really want and try hard enough, but hey, one Perl is enough I'd say.

      • John Lee says:

        The point is that you can cut-n-paste your maintainable multi-line code and run it as a one-liner on a remote server, without needing to bother with ssh. Avoids the need to type in passwords for servers that require that.

        • waldner says:

          Right, this makes more sense, I hadn't thought of that. However, let me add that you can still do (although admittedly it's a different use case)

          ssh user@host 'python /dev/stdin' < localscript.py

          without even needing to copy and paste anything. That even allows using arguments, eg

          ssh user@host 'python /dev/stdin arg1 arg2' < localscript.py

          Note I'm not saying one is "better" than the other, only that in most cases there is more than one way to do things. What one uses of course is determined by the circumstance, personal preference, constraints and possibly other factors.

          Thanks!

  5. John Lee says:

    Fail. The beauty of posting that particular piece of Python code on this Python-unfriendly blog is that I don't need to know the blog syntax for "please don't mangle my code", because the script knows how to turn itself into hard-to-mangle form -- here's the output it prints when run on itself:

    python -u -c "exec 'import pipes\\nimport sys\\n\\nif __name__ == \"__main__\":\\n    args = sys.argv[1:]\\n    python_file = args[0]\\n    python_program = open(python_file).read()\\n    shell_program = \"exec %s\" % repr(python_program)\\n    args = [shell_program] + args[1:]\\n    print \"python -u -c \" + \"\".join(pipes.quote(arg) for arg in args)\\n'"
  6. John Lee says:

    Knowing shell that well is bad for your soul ;-)

    Python is usually only good for one-liners when you have some support code to call -- which you might say is as it should be. I'm (mildly) surprised to see that Python wins over shell even for this simple job, though:

    python -c "import sys; f = open(sys.argv[1], 'rw+'); f.seek(-1, 2); f.write('\n') if f.read(1) != '\n' else None" file

    Of course, this could be made more readable in many ways if it were written as a proper .py file. An ex-colleague wrote a simple Python script named wrapup.py that takes a .py file as input and prints it rewritten as a one-liner -- below is my version of that script. This allows you to have your cake and eat it by having readable multi-line Python code, but also running it as a one-liner you can cut-and-paste. So now when you see impenetrable Python gibberish in shell history on certain servers, you'll know where it came from ;-) IIRC there's a bug in module pipes involving quoting of "!" that's only fixed in Python 2.7 -- so if you come across that you can copy the pipes.quote from Python SVN into wrapup.py

    What, there are still servers without Python installed, you say? They must be destroyed.

    Let's see if I can get your blog to indent this correctly:

    import pipes
    import sys
    
    if __name__ == "__main__":
    	args = sys.argv[1:]
    	python_file = args[0]
    	python_program = open(python_file).read()
    	shell_program = "exec %s" % repr(python_program)
    	args = [shell_program] + args[1:]
    	print "python -u -c " + "".join(pipes.quote(arg) for arg in args)
    • waldner says:

      Hey!

      Yes, that's quite similar to how you'd do it in Perl:

      perl -e 'open(F,"+<",$ARGV[0]);seek(F,-1,2);read(F,$c,1);print F "\n" if $c ne "\n";close(F);' file

      but I deliberately avoided these approaches in the article (btw, are there more servers without Perl or without Python?).

      And no, I'm not anti-Python at all! It's just that my knowledge of Python is not good enough to be confident, and I would risk writing rubbish (assuming what I write isn't rubbish already, that is :) ).

      And for the formatting, it's html so just use <pre> tags (which I've taken the liberty of adding to your code) and you should be fine. I've not been able to find a plugin that makes it easier, but it's true that I didn't search too hard either.

      Thanks.

Leave a Reply

(required)