“In-place” editing of files

Posted by waldner on 29 January 2011, 3:17 pm

Now this is a real FAQ.

"How can I edit a file in-place using sed/awk/perl/whatever?"

or:

"I know that using >> I can append text to a file. How do I prepend text to a file?"

What these people usually mean is:

"How do I change/edit a file without having to create a temporary file?" (for some unknown reason)

Let's try to see what "in-place editing" really means, and why using temporary files (implicitly or explicitly) is the only way to do that reliably. Here we will limit our analysis to a pure shell environment using the commonly used filters, or stream-editing tools (sed, perl, awk and similar), which is a relatively common situation. Things start to change if we allow programming languages or special tools.

Note: almost every operating system, even when using "unbuffered" functions, maintains a low-level disk cache or buffer, so even when data is written to a file, it may not hit the disk immediately; similarly, when data is read, it may be coming from the buffer rather than disk. While this low-level OS caching is certainly something to be aware of, for the purposes of the following discussion it is entirely transparent and irrelevant to the points made, so it will be ignored here.

In-place?

Strictly speaking, "in-place" would really mean that: literally editing the very same file (the same inode). This can be done in principle, but:

The program or utility must be designed to do that, meaning that it should arrange for the event that the file size increases, shrinks, or stays the same. Also, it must arrange things so data that hasn't yet been read is never overwritten. None of the usual text processing tools or filters is designed for this;
It's dangerous: if something goes wrong in the middle of the edit (crash, disk full, etc.), the file is left in an inconsistent state.

None of the usual tools or editors do this; even when they seem to do so, they actually create a temporary file behind the scenes. Let's look at what sed and perl (two tools which are often said to be able to do "in-place" editing) do when the option -i is used.

sed

Sed has the -i switch for "in-place" editing. First of all, only some implementations of sed (GNU sed and BSD sed AFAIK) support -i. It's a nonstandard extension, and as such not universally available.
According to the documentation (at least GNU sed's), what sed does when -i is specified is create a temporary file, send output to that file, and at the end, that file is renamed to the original name. This can be verified with strace; even without strace, a simple "ls -i" of the file before and after sed operates will show two different inode numbers.
If you do use -i with sed, make sure you specify a backup extension to save a copy of the original file in case something goes wrong. Only after you're sure everything was changed correctly, you can delete the backup. The BSD sed (used on Mac OS X as well) does not accept -i without a backup extension, which is good, although it can be fooled by supplying an empty string (eg -i "").

Perl

Perl, similar to sed, has a -i switch to edit "in-place". And like sed, it creates a temporary file. However, the way Perl creates the temporary file is different. Perl opens and immediately unlink()s the original file, then opens a new file with the same name (new file descriptor and inode), and sends output to this second file; at the end, the old file is closed and thus deleted because it was unlinked, and what's left is a changed file with the same name as the original. This is more dangerous than sed, because if the process is interrupted halfway, the original file is lost (whereas in sed it would still be available, even if no backup extension was specified). Thus, it's even more important to supply a backup extension to Perl's -i, which results in the original file being rename()d rather than unlink()ed.

Another false in-place

By the way, here is a solution which is often described as "in-place" editing:

$ { rm file; command > file; } < file

("command" is a generic command that edits the file, typically a filter or a stream editor)

This works because, well, it's cheating. It really involves two files: the outer file is not really deleted by the rm command, as it's still open by virtue of the outer input redirection. The inner output redirection then really writes to a different disk file, although the operating system allows you to use the same file name because it's no longer "officially" in use at that point. When the whole thing completes, the original file (which was surviving anonymously for the duration of the processing, feeding command's standard input) is finally deleted from disk.
So, this kludge still needs the same additional disk space you'd need if you used a temporary file (ie, roughly the size of the original file). It basically replicates what Perl does with -i when no backup extension is supplied, including keeping the original file in the risky "open-but-deleted" state for the duration of the operation. So, if one must use this method, at least they should do

$ { mv file file.bak; command > file; } < file

But then, doing this is hardly different from using an explicit temporary file, so why not do that? And so...

Using an explicit temporary file

So, generally speaking, to accomplish almost any editing task on a file, temporary files should be used. Sure, if the file is big, creating a temporary file becomes more and more inefficient, and requires that an amount of available free space roughly the same size of the original file is available. Nonetheless, it's by far the only right and sane way to do the job. Modern machines should have no disk space problems.

The general method to edit a file, assuming command is the command that edits the file, is something along these lines:

$ command file > tempfile && mv tempfile file
# or, depending on how "command" reads its input
$ command < file > tempfile && mv tempfile file

To prepend data to a file, similarly do:

$ { command; cat file; } > tempfile && mv tempfile file

where command is the command that produces the output that should be prepended to the file.

The use of "&&" ensures that the original file is overwritten only if no errors occurred during the processing. That is to safeguard the original data in case something goes wrong. If preserving the original inode number (and thus permissions and other metadata) is a concern, there are various ways, here are two:

$ command file > tempfile && cat tempfile > file && rm tempfile
# or
$ cp file tempfile && command tempfile > file && rm tempfile

These commands are slightly less efficient than the previous methods, as they do two passes over the files (adding the cat in the first method and the cp in the second). In most cases, the general method works just fine and you don't need these latter methods. If you're concerned about the excruciating details of these operations, this page on pixelbeat lists many more methods to replace a file using temporary files, both preserving and not preserving the metadata, with a description of the pros and cons of each.

In any case, for our purposes the important thing to remember of these methods is that the old file stays around (whether under its original name or a different one) until the new one has been completely written, so errors can be detected and the old file rolled back. This makes them the preferred method for changing a file safely.

Sponges and other tricks

There are alternatives to the explicit temporary file, however they are somewhat inferior in the writer's opinion. On the upside, they have the advantage of (generally) preserving the inode and other file metadata.

One such tool is sponge, from the moreutils package. Its use is very simple:

command file | sponge file

As the man page says, what sponge does is "reads standard input and writes it out to the specified file. Unlike a shell redirect, sponge soaks up all its input before opening the output file. This allows for constructing pipelines that read from and write to the same file".
So, sponge accumulates output coming from command (in memory or, when it grows too much, guess where? in a temporary file), and does not open file again for writing until it has received EOF on input. When the incoming stream is over, it opens file for writing and writes the new data into it (if it had to use a temp file, it just rename()s that to file which is more efficient, although this results in changing the file's inode).

A barebone implementation of a sponge-like program in Perl would be like

#!/usr/bin/perl 
# sponge.pl
while(<STDIN>) {
  push @a, $_;
}
# EOF here
open(OUT, ">", $ARGV[0]) or die "Error opening $ARGV[0]: $!";
print OUT @a;
close(OUT);

This keeps everything in memory; it can be extended to use a temporary file (and, for that matter, it can likely be extended to also perform whatever job the filter that feeds its input does, but then we are leaving the domain of this article).
With this, one could do

command file | sponge.pl file

A similar functionality can be implemented using awk:

# sponge.awk
BEGIN {
  outfile = ARGV[1]
  ARGC--
}
{ a[NR] = $0 }
END {
  for(i=1;i<=NR;i++)
    print a[i] > outfile
}

These methods work, and they do edit the same file (inode), however they have the disadvantage that if the amount of data is huge, there is a moderately long period of time (while data is being written back to the file) during which part of the data is only in memory, so if the system crashes it will be lost.

The good old ed

If the editing to be done is not too complex, another alternative is the good old ed editor. A peculiarity of ed is that it reads its editing commands, rather than the data, from standard input. For example, to prepend "XXX" to each line in the file, It can be used as follows:

printf '%s\n' ',s/^/XXX/' w q | ed -s file

(the -s switch is to prevent ed from printing information on how many bytes it read/wrote; there's no harm in omitting it)

At least in most implementations, ed does create a temporary file, which it uses as support for the editing operations; when it is asked to save the changes, it writes them back to the original file. This way of working is mandated by the POSIX standard, that says that

The ed utility shall operate on a copy of the file it is editing; changes made to the copy shall have no effect on the file until a w (write) command is given. The copy of the text is called the buffer.

So, it should be clear that ed presents the same shortcomings of the sponge-like methods; in particular, when it's requested to perform a write (the "w" command), ed truncates the original file and writes the contents of the buffer into it. If the amount of data is huge, this means that there's a moderately long time window during which the file is in an inconsistent state, until ed has written back the whole data (and no other copy of the original data exists). Consider this if you're worried about unexpected things happening in the middle of the process.

"But I don't want to use a temp file!"

Ok. Having said all this, we still see that, for some mysterious reasons, people still try to do away with temporary files, and come up with "creative" solutions. Here are some of them. They are all broken and must not be used for any reason. "Kids, don't do this at home".

The pipe of death

People sometimes try this:

$ cat file | command > file     # doesn't work

or also

$ command file | cat > file     # doesn't work

Obviously none of these work, because the file is truncated by the shell as soon as the last part of the pipeline is run (for any practical purpose this means "immediately"). But, after thinking a bit about that, something "clicks" in the mind of whoever is writing the code, which generally leads to the following "clever" hack:

$ command file | { sleep 10; cat > file; }    # DO NOT DO THIS

And that indeed appears to work. Except it's utterly wrong, and may bite you when you least expect it, with very bad consequences (things that seem to work are almost always much worse and dangerous than things that patently don't, because they can give a false sense of security). So, what's wrong with it?

The idea behind the hack is "let's sleep 10 seconds, so the command can read the whole file and do its job before the file is truncated and the fresh stuff coming from the pipe can be written to it". Let's ignore the fact that 10 seconds may or may not be appropriate (and the same goes for whatever value you choose to use). There's something much more seriously, fundamentally wrong there. Let's see what happens if the file is even moderately big. The right hand side of the pipeline will not consume any data coming from the pipe for 10 seconds (or however many seconds). This means that whatever command outputs, goes into the pipe and just sits there, at least until sleep is finished. But of course, a pipe cannot hold an infinite amount of data; rather, its size is usually fairly limited (like some tens of kilobytes, although it's implementation-dependent). Now, what happens if the output of command fills the pipe before sleep has finished? It happens that at some point a write() performed by command will block. If command is like most programs, that means that command itself will block. In particular, it will not read anything else from file. So it's entirely possible, especially if the input file is moderately large, and the output is accordingly large, that command will block without having read the input file fully. And it will remain blocked until sleep ends.

When that finally happens, there are at least two possible outcomes, depending on how exactly command reads its input and writes its output, the system's stdio buffering, the process scheduler, the shell and possibly some other factor (more on stdio buffering later).
If you're lucky (yes), you will end up writing a pipe's worth of output data into file and nothing more (of course losing its original contents, and the subsequent output that would have come from command). This is if you're lucky. Another, much worse, possibility is that command is unblocked when some of its output has already been written to file by the output redirection. What happens in that case is that the pipeline will enter an endless self-feeding loop, whereby cat writes the output of command to the file, but immediately after that command reads that same data again as its input, over and over. This causes file to grow without bounds as much as it can, possibly filling all the available space in the filesystem.

An alternative way of writing the same bad code, which probably makes the problem more evident is

$ cat file | { sleep 10; command > file; }    # DO NOT DO THIS

Again, cat will block if file is big and the pipe is filled before the 10 seconds have passed.

It probably helps to state it more clearly: the above code has the potential to completely trash your system and render it unusable. Do NOT use it, for any reason. If you don't believe that and want to see it for yourself, try this on a filesystem that you can fill (a loopback filesystem is strongly suggested here):

# create a 100MB file
# dd if=/dev/zero of=loop.img bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 2.58083 s, 40.6 MB/s
# make a filesystem on it
# mke2fs loop.img
mke2fs 1.41.9 (22-Aug-2009)
loop.img is not a block special device.
Proceed anyway? (y,n) y
...
# mount it
# mount -o loop loop.img /mnt/temp
# cd /mnt/temp
# Just create a moderately big file
# seq 1 200000 > file
# Here we go
# sed 's/^/XXX/' file | { sleep 10; cat > file; }    # DO NOT DO THIS
cat: write error: No space left on device
# ls -l
total 97611
-rw-r--r-- 1 root root 99549184 2010-04-02 20:08 file
drwx------ 2 root root    12288 2010-04-02 18:48 lost+found
# tail -n 3 file
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX8483
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX8484
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX#
# Uh-oh...

As the whole thing is completely nondeterministic, you might not get the same result (I had to repeatedly run it a few times myself and on different systems before getting the error). Nonetheless, you'll still have problems; if you don't enter the loop, then you'll end up with lots of missing data. Again: do NOT do the above for any reason. Imagine what could happen if this dangerous code is run as root by some cron job every night on an important server (hint: nothing good).

Buffers and descriptors

Another "solution" that is seen from time to time is something like

$ awk 'BEGIN{print "This is a prepended line"}{print}' file 1<>file     # DO NOT DO THIS

This prevents the file from being truncated as it uses the <> notation which opens the file for reading and writing. So it would seem that this is the holy grail of in-place editing. But is it?

To understand why this "works" and why (you guessed it) it must not be used, let's approach the topic from a general point of view.

In general, during the editing or changing of the file, the overall amount of data that has to be written can be smaller, the same size, or larger than the data that it is supposed to replace. This poses problems.

Let's start with the case where the new data is the same size as the old, which is also the only one that can be made to work (although, again, it's not recommended). For example, with the code

awk '{gsub(/foo/, "bar"); print}'

the old data and the new data are all three characters; we also know that the old data is read before the new data is written out, so we may be able to do real "in-place" editing by doing something like

awk '{gsub(/foo/, "bar"); print}' file 1<>file     # DO NOT DO THIS

this works because the "1<>file" syntax opens the file in read/write mode, and thus it's not truncated to zero length. Obviously, if the file is 1GB and the system crashes at some point in between, the data will be inconsistent. But it should be clear by now that we are already deep in the "don't do this" zone.

Let's see what happens if the replacement data is smaller than the original data.

$ cat file
100
200
300
400
500
$ sed 's/00/A/' file 1<>file
$ cat file
1A
2A
3A
4A
5A

500

This is expected. Once the replacement data has been written back, what was in the original file past that point is left there, and sed (or other similar utilities) does not invoke truncate() or ftruncate(), because they are not designed to be used this way (and rightly so). So this can't work.

Now let's look at the most dangerous case: the changed data is longer than the original. It's the most dangerous because, unlike the previous one, sometimes it works and could lead people to mistakenly think that it can be safely done.
In theory, this shouldn't even work; after all, the first time a chunk of data that is bigger than the original data is written back, some data that has not been read yet will be overwritten, leading to data corruption at a minimum. However, there are some circumstances that may make it look as if it worked, although (did I say that already?) it shouldn't really be done as it's quite risky. The following analysis was performed on Linux and is thus specific to that system, but the concepts are general.

The first thing to observe is that, when using filters or streaming editors, reads happen before writes (obviously). So, say, the program might read 10 bytes, and write back 20 bytes, or so. This doesn't seem to help much, but the second thing to observe is that I/O operations are usually buffered; that is, most programs use buffered I/O calls, like fread() or fwrite(). These calls don't read and write directly to the file (as read() and write() would), but instead use some internal buffers (usually implemented by the C library) whose purpose is to "accumulate" data; when the application fread()s, say, 10 bytes, 4096 bytes are read instead and put in the read buffer (and 10 are returned to the application); when the application fwrite()s 20 bytes, these are written into an output buffer, and only when this buffer is full (again, perhaps 4096 bytes) it is written to the actual file. If the application's standard I/O descriptors are not connected to a terminal (and if the application does not call read()/write() directly, of course), I/O for the program will be buffered.
We can confirm that this is indeed the case when, for example looking at the output of strace, we see that reads and writes happen in big chunks which do not correspond to the expected usage pattern of the application. For example, on this system the buffer size seems to be 4096 bytes. How does this matter for our problem? It matters because this buffering, specifically output buffering, is what makes the "write back more than was read" case work in some cases (but which is, instead, is a recipe for disaster).

So how does output buffering help? Let's go through the awk example we started from:

$ cat file
This is line1
This is line2
This is line3
This is line4
This is line5
$ awk 'BEGIN{print "This is a prepended line"}{print}' file 1<>file     # DO NOT DO THIS
$ cat file
This is a prepended line
This is line1
This is line2
This is line3
This is line4
This is line5

This apparently miraculous outcome is possible because of I/O buffers. Let's have a look at the output of strace:

$ strace awk 'BEGIN{print "This is a prepended line"}{print}' file 1<>file
...
open("file", O_RDONLY)              = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=70, ...}) = 0
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff555abe30) = -1 ENOTTY (Inappropriate ioctl for device)
fstat(3, {st_mode=S_IFREG|0644, st_size=70, ...}) = 0
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
read(3, "This is line1\nThis is line2\nThis"..., 70) = 70
read(3, "", 70)                         = 0
close(3)                                = 0
write(1, "This is a prepended line\nThis is li"..., 95) = 95
exit_group(0)

There's something strange there: why did read() happen before write(), even if in the awk code there is a print statement right in the BEGIN block which should clearly be executed before any data is read? As we said, I/O is buffered, so even if the application writes, data isn't really written to the file until there's enough of it in the buffer, or the file is closed or flushed. So the awk code print "This a prepended line" ends up putting the string into some C library write buffer, not on the file. This is not apparent from the strace output, as it happens entirely in user space without system calls. Then the execution continues, and awk enters its main loop, which requires reading the file. Now, the buffered I/O tries to read a whole chunk of data (the bolded read() above), which in this case is the whole file, and this is stored in some input buffer. Then awk executes the main body of its code, which simply copies its input to its output. Both operations are buffered, so reading reads from the read buffer, and writing writes to the output buffer (which already contains the line printed in the BEGIN block, so further output is appended to that). Nothing of this appears in strace, as it's all in userspace. Finally, the file is closed (because the program terminates), and descriptor 1 is flushed and write() is finally invoked (in bold above). The result of all this is that the output buffer at the time of write()ing contains exactly the line we wanted to prepend, plus the original lines in the file, so that's what's written back to the file.

Let's use ltrace, which can show library calls as well as system calls, to confirm our guesses (the output has been cleaned up in some places for clarity):

$ ltrace -S -n3 awk 'BEGIN{print "This is a prepended line"}{print}' file 1<>file
...
   121	   fileno(0x7f06e270b780)                        = 1
...
   213	   fwrite("This is a prepended line", 1, 24, 0x7f06e270b780 <unfinished ...>
   214	      SYS_fstat(1, 0x7fff663f4050)               = 0
   215	      SYS_mmap(0, 4096, 3, 34, 0xffffffff)       = 0x7f06e2daa000
   216	   <... fwrite resumed> )                        = 24
   217	   __errno_location()                            = 0x7f06e2d9a6a8
   218	   fwrite("\n", 1, 1, 0x7f06e270b780)            = 1
...
   226	   open("file", 0, 0666 <unfinished ...>
   227	      SYS_open("file", 0, 0666)                  = 3
   228	   <... open resumed> )                          = 3
...
   246	   read(3,  <unfinished ...>
   247	      SYS_read(3, "This is line1\nThis is line2\nThis"..., 70) = 70
   248	   <... read resumed> "This is line1\nThis is line2\nThis"..., 70) = 70
...
   254	   fwrite("This is line1", 1, 13, 0x7f06e270b780) = 13
   255	   __errno_location()                            = 0x7f06e2d9a6a8
   256	   fwrite("\n", 1, 1, 0x7f06e270b780)            = 1
   257	   _setjmp(0x64d650, 0x1d64ceb, 0x1d635b0, 0, 0) = 0
   258	   __errno_location()                            = 0x7f06e2d9a6a8
   259	   fwrite("This is line2", 1, 13, 0x7f06e270b780) = 13
   260	   __errno_location()                            = 0x7f06e2d9a6a8
   261	   fwrite("\n", 1, 1, 0x7f06e270b780)            = 1
   262	   _setjmp(0x64d650, 0x1d64cf9, 0x1d635b0, 0, 0) = 0
   263	   __errno_location()                            = 0x7f06e2d9a6a8
   264	   fwrite("This is line3", 1, 13, 0x7f06e270b780) = 13
   265	   __errno_location()                            = 0x7f06e2d9a6a8
   266	   fwrite("\n", 1, 1, 0x7f06e270b780)            = 1
   267	   _setjmp(0x64d650, 0x1d64d07, 0x1d635b0, 0, 0) = 0
   268	   __errno_location()                            = 0x7f06e2d9a6a8
   269	   fwrite("This is line4", 1, 13, 0x7f06e270b780) = 13
   270	   __errno_location()                            = 0x7f06e2d9a6a8
   271	   fwrite("\n", 1, 1, 0x7f06e270b780)            = 1
   272	   _setjmp(0x64d650, 0x1d64d15, 0x1d635b0, 0, 0) = 0
   273	   __errno_location()                            = 0x7f06e2d9a6a8
   274	   fwrite("This is line5", 1, 13, 0x7f06e270b780) = 13
   275	   __errno_location()                            = 0x7f06e2d9a6a8
   276	   fwrite("\n", 1, 1, 0x7f06e270b780)            = 1
   277	   read(3,  <unfinished ...>
   278	      SYS_read(3, "", 70)                        = 0
   279	   <... read resumed> "", 70)                    = 0
   280	   __errno_location()                            = 0x7f06e2d9a6a8
...
   284	   close(3 <unfinished ...>
   285	      SYS_close(3)                               = 0
   286	   <... close resumed> )                         = 0
   287	   free(0x1d64cd0)                               = <void>
   288	   __errno_location()                            = 0x7f06e2d9a6a8
   289	   fflush(0x7f06e270b780 <unfinished ...>
   290	      SYS_write(1, "This is a prepended line\nThis is"..., 95) = 95
   291	   <... fflush resumed> )                        = 0
   292	   fflush(0x7f06e270b860)                        = 0
   293	   exit(0 <unfinished ...>
   294	      SYS_exit_group(0 <no return ...>
   295	+++ exited (status 0) +++

Awk uses buffered I/O (ie, fread()/fwrite()), and in line 121 the actual file descriptor corresponding to the object at address 0x7f06e270b860 (presumably a pointer to a FILE object for stdout) is obtained, which is 1 (ie, standard output).
Lines 213-218 are where the print statement in the BEGIN block is executed; note that no write system call is performed, so data is written to the C library buffer, not to the file. Lines 226-228 open the file for reading, as part of awk's normal processing before starting its loop, and lines 246-248 read the contents of the file (since input is buffered, the call to fread() triggers a read() system call that reads the whole file in the input buffer). Line 254 and following is where the main body of the awk program (ie, "{print}") is executed: again, all the data goes into the C library buffer, which already contained the line printed in the BEGIN block.
Line 284 closes the file descriptor used to read the file. Up to here, the file is still unchanged. Then at line 289, standard output is flushed, and only now data is written to the file (line 290).

So the output buffer effectively saves our bacon here. As a further test, let's run the command again but with output buffering disabled (using the neat stdbuf utility from GNU coreutils):

$ stdbuf -o0 awk 'BEGIN{print "This is a prepended line"}{print}' file 1<>file
# hangs, press ctrl-C
^C
$ ls -l file
-rw-r--r-- 1 waldner waldner 10298496 Jan 28 15:04 file
$ head -n 20 file
This is a prepended line
This is a prepended line
e2
This is line3
This is line4
This is line5
s is line4
This is line5
s is line4
This is line5
s is line4
This is line5
s is line4
This is line5
s is line4
This is line5
s is line4
This is line5
s is line4
This is line5

So this finally shows that (as expected) writing more than is read can't work, and it's only because of I/O buffering that it sometimes appears to work. And obviously, it's not known a priori whether I/O will be buffered (depends on the actual program code, and other things). Even if the POSIX standard requires that some functions use buffered I/O if they don't refer to "an interactive device", there's no guarantee that the application will use those functions (eg, fread() or fwrite()). The application may very well use read() and write() directly, which of course are not buffered. Even if the buffered functions are used, nothing prevents the application from calling fflush() whenever it wishes, or even from disabling buffering entirely. If that happens, again hell breaks loose.

But if the above is not enough, let's continue this wicked game, and let's assume that we can rely on output buffering. Even in this case, we soon run into trouble.

Obviously, write buffering only provides a temporary storage for an amount of data that is less than or equal to the buffer size itself (eg, 4096 bytes). When the buffer is full, it is written out to the file. This means that the (already poor) protection provided by output buffering vanishes as soon as the size difference between read data and written data becomes greater than the buffer size. At that point, the output buffer is written to disk, and overwrites data that has not been read yet, thus disaster ensues again (data loss at a minimum, and potential endless loop with the file growing, depending on how the program exactly transforms the data). It's easy to verify; sticking to awk again,

# Let's prepend more than 4096 bytes to our file
$ awk 'BEGIN{for(i=1;i<=1100;i++)print i}1' file 1<>file
# after a while...
awk: write error: No space left on device
# Let's recreate the file
$ printf 'This is line 1\nThis is line 2\nThis is line 3\nThis is line 4\nThis is line 5\n' > file
# Let's try writing 5000 bytes at once now
$ awk 'BEGIN{printf "%05000d\n", 1}1' file 1<>file
$ cat file
[snip]
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
$ wc -l file
2 file

As can be seen above, whether the outcome is endless loop or "just" data corruption depends on how the program transforms the data.

A hall of shame

Now, knowing how it works, just for completeness, here is a little hall of shame that combines "ideas" from the bad techniques just described. It's provided to clearly state that these commands (and similar ones) must never ever be used.

$ sed 's/foo/longer/g' file 1<>file   # DO NOT DO THIS

# prepend data to a file. Some smart cats detect this and complain
$ { command; cat file; } 1<>file   # DO NOT DO THIS

# let's throw pipes into the mix

# prepend a file, bash
$ cat <(cat file1 file2) 1<>file2   # DO NOT DO THIS

# prepend a file, POSIX sh
$ cat file1 file2 | cat 1<>file2   # DO NOT DO THIS

# prepend text, "fooling" cat
$ { command; cat file; } | cat 1<>file   # DO NOT DO THIS

Those using pipes are even more dangerous (if possible), as they introduce concurrency, which make the outcome even more unpredictable (process substitution in bash is implemented using a pipe, although it's not apparent from the above). Depending on how the processes are scheduled and where the data is buffered, the result can vary from success (unlikely), to self-feeding loop, to corrupted data. Again, try it for yourself a few times and you'll see. As an example, here's what happens with the last command above to prepend text to a file:

$ seq 100000 105000 > file
$ wc -l file
5001 file
$ { seq 1 2000; cat file; } | cat 1<>file
$ wc -l file
208229 file       # should be 7001
$ seq 100000 105000 > file
$ wc -l file
5001 file
$ { seq 1 2000; cat file; } | cat 1<>file
$ wc -l file
194630 file       # should be 7001
$ seq 100000 105000 > file
$ wc -l file
5001 file
# now let's add more data
$ { seq 1 20000; cat file; } | cat 1<>file
^C
$ ls -l file
-rw-r--r-- 1 waldner users 788046226 2010-05-09 15:26 file
# etc.

Conclusions

The bottom line of all this is that, to perform almost any editing/changing task on a file, you must use a temporary file, and for very good reasons. Also, it's much better if that file is explicit.

Update 28/12/2012:

It was brought to my attention that there is another way to write to the file without creating a temporary file. Before showing it, let me repeat that this is a bad idea, unless you REALLY know what you're doing (and even then, think many times about it before doing it).

So, at least with bash, the various expansions that the shell performs (variable expansion, command substitution, etc.) happen before redirections are set up; this makes sense, as one could do

mycommand > "$somefile"

so the variable $somefile needs to be expanded before the redirection can be set up. How can this be exploited for in-place editing (true in-place, in this case)? Simple, by dong this:

printf '%s\n' "$(sed 's/foo/bar/g' file)" > file   # another one for the hall of shame

Of course, the output of the command substitution is temporarily stored in memory, so if the file is big, one may get errors like:

$ printf '%s\n' "$(sed 's/foo/bar/g' bigfile)" > bigfile
-bash: xrealloc: ../bash/subst.c:658: cannot allocate 378889344 bytes (1308979200 bytes allocated)
Connection to piggy closed.

Which, must be admitted, isn't as bad as some of the methods previously described because in this case, at least, the file isn't touched, that is, it's still as it was before running the command, rather than some intermediate inconsistent state.

Another, perhaps less obvious, problem with that approach is that (again, at least with bash) literal strings (such as the second argument to printf in the example) cannot contain ASCII NULs, so if the output of command substitution contains them, they will be missing in the result.

Update 2 23/02/2013:

For those who want real in-place editing, the Tie::File module of Perl is a way to do true in-place editing (same file, same inode) which also takes care of doing all the dirty work of expanding/shrinking the file. Basically, it presents the file as an array, and the code just has to modify the array; the changes are then converted to actual file changes on disk. Of course, all the caveats apply (file is inconsistent while it's being operated on) and, on top of that, performance will degrade as the file size (or amount of changes) increase. As they say, you can’t have your cake and eat it too.

Nevertheless, considering what it has to do, the Tie::File module is a really awesome piece of software.

As an example of a very basic usage, here are some simple operations (but there's no limit to the possibilities).

#!/usr/bin/perl
 
use Tie::File;
use warnings;
use strict;
 
my $filename = $ARGV[0];
my @array;
 
tie @array, 'Tie::File', $filename or die "Cannot tie $filename";
 
$array[9] = 'newline10';      # change value of line 10
splice (@array, 0, 5);        # removes first 5 lines
for (@array) {
  s/foo/longerbar/g;         # sed-like replacement
}
 
# etcetera; anything that can be done with an array can be done
# (but see the CAVEATS section in the documentation)
 
untie @array;

Sample run:

$ cat -n file.txt 
     1	this line will be deleted 1
     2	this line will be deleted 2
     3	this line will be deleted 3
     4	this line will be deleted 4
     5	this line will be deleted 5
     6	this line will not be deleted foo
     7	foo foo abc def
     8	hello world
     9	something normal
    10	something weird
    11	something foobar
$ ls -i file.txt 
11672298 file.txt
$ ./tiefile_test.pl file.txt 
$ cat -n file.txt 
     1	this line will not be deleted longerbar
     2	longerbar longerbar abc def
     3	hello world
     4	something normal
     5	newline10
     6	something longerbarbar
$ ls -i file.txt 
11672298 file.txt

Yes, it really is that simple. Now don't complain that it's slow or a crash messed up things.

Update 3 17/03/2015:

If (and only if)

the replacement is exactly the same length of the part to be replaced
you know exactly the position in the file where the replacement should be written
you're felling brave

another possibility is the venerable dd program. The trick is that it's possible to tell dd to not delete the output file, using the conv=notrunc option. So if we know that the text we want to replace starts at byte 200, we can do:

$ printf "newtext" | dd of=myfile seek=199 bs=1 conv=notrunc
7+0 records in
7+0 records out
7 bytes (7 B) copied, 3.4565e-05 s, 86.8 kB/s

and have the original file overwritten just where it needs to be. The reason for the "same length" requirement should hopefully be obvious.
No need to say that it's quite easy to screw up, but depending on the exact use case (eg binary editing), this might be a viable solution.

Filed under awk, faq, sed, shell, tips Tagged awk, cat, ed, perl, pipeline, prepending, sed, shell, text processing

Comments are closed | Permalink

12 Comments

Cody says:

May 1, 2015 at 22:59

I found this by accident but I'm glad I did; this is a really good write up - it covers a lot of issues with in place editing, the subject of temporary files (which some might argue is semantics but that's hardly a concern to me) and risks, as well as bringing other tools/methods up.

But one thing struck me as amiss: the rename system call doesn't change the inode. If it does in some cases it surely does not in all. I've not seen this in practise. I make use of this quite a lot (frequently, and have for many years), actually (not in place editing or anything like that; I refer only to the inode remaining the same). You could write a simple C test to demonstrate this but you could also do as you say:

ls -i file
mv file file2
ls -i file2

You could also see this with GNU sed option (-i) (and indeed it does use a temporary file).

I'd be curious as to what type of environment you're in that you see different inodes from renaming.
- Cody says:
  
  May 1, 2015 at 23:09
  
  One instance just came to mind - different file systems. But assuming this isn't the case, I've never seen a different inode for mv (I'm not doubting it is possible, however; I'd gladly be given an example or examples).
- waldner says:
  
  May 2, 2015 at 00:13
  
  There's no denying that rename()/mv preserves the inode. The problem with sed -i is that rename() acts on a different file from the one being edited by sed, so the result ends up having a different inode number:
```
$ ls -i aaa
109755 aaa
$ sed -i 's/a/b/' aaa
$ ls -i aaa
109759 aaa
```
  If one wants to preserve the inode, one has to do shenanigans like
```
$ command file > tempfile && cat tempfile > file && rm tempfile
# or
$ cp file tempfile && command tempfile > file && rm tempfile
```
  - Cody says:
    
    May 2, 2015 at 01:37
    
    Actually - yes, I see this. I mentally did a test with sed and saw the inode stay the same. However, while it is true that I did use sed, the time I did check inode is with mv itself (I thought I did both!). Actually, I think I know what I did: I knew it used rename but I did not think of the fact it renames a new file to the old, rather than - as you put it - in-place edit. Which means there is a new inode; there is no other way around it. So seeing mv confirm what I already knew, it worked in my mind even though it wasn't actually 100% correct.
    
    Thanks for clarifying. Truthfully I never thought about the internals of sed in this way because I never had a need to (although I use the option but then I also have backups, and in many files I have not only backups but revision control - but indeed there is that risk). But of course a temporary file does make sense given what you demonstrate (you gave far more examples than I would have thought of, especially if I didn't really think about it for a long while; I'd argue I wouldn't ever think of that many examples). As for rename() and mv - that is another issue entirely, as I do make use of these frequently. Glad I saw this article because there is no such thing as learning too much, as far as I am concerned! Actually, while I'm at it: great site you have here and the \1 at the top for the url is a great addition!
    
    cheers.
Stan says:

March 12, 2015 at 16:17

There are some situations you don't have a choice, for example on a embedded system with limited disk and memory, if you want to do some changes on a (relatively big) file without reflashing the device, then you need real inplace editing, I don't care if the file is inconsistent during the process as if something goes wrong I can reflash anyway.
bruce says:

February 22, 2015 at 22:02

A very complete and careful analysis.Thanks.

Perhaps your advice to make the tempfile explicit is even more important when you are in-place editing alot of files in a common directory over nfs. The multiple processes running on different machines could generate the same file name with mkostemp(). Since nfs does not guarantee every machine will see the same files, the code in mkostemp() that avoid collisions may still generate the same sedXXXXXX may not help because it does not see the file exists already. I think we may seeing some infrequent failures when our filers get bogged down because of a scenario like this.

If we used the fact that the filename (represented by f below) are already uniquely named and made the tmp file explicit:
sed -e '...' f.tmp && mv f.tmp f
then it might be safer.
karl says:

November 20, 2013 at 16:29

Mac OS X / FreeBSD sed need not be fooled, just use the -e switch after the -i option to explicitly define the regex to be used (sed -i -e 's/x/X/' file).

Another cute but dangerous trick to edit a file "in-place" is to use open file handles (without preserving the inode though; see http://stackoverflow.com/a/2586117).

By the way, it is possible to flush disk cache programmatically using man 1 sync (commonly available on Unix systems)!
- waldner says:
  
  November 20, 2013 at 16:44
  
  The stackoverflow trick can be done (preserving the inode) using the "sed ... file 1<>file" method described in the article.
  
  And while it's technically true that data can be flushed programmatically, you would need to run another process or thread (while the editing is taking place) to periodically call sync()...that's not the way many scripts work or are designed.
  
  And anyway, the point is not whether the file is flushed to disk or not, but rather that it is inconsistent (contains edits as well as old content); flushing to disk just syncs the inconsistency to persistent storage.
  
  Flushing may help right after the editing has finished, to immediately write things to disk (though data may not actually hit the disk yet, especially if the drive is smart).
lhunath says:

July 16, 2013 at 14:47

ex(1) is also a POSIX editor which edits in-place which is often considered friendlier than ed, especially when combining with find(1) or in a pipeline.
niku says:

November 13, 2012 at 05:23

Wonderful reference. Thanks!
needlesscomplexity says:

March 14, 2011 at 02:46

this makes the situation look complex, when in truth it is very simple.
perl, bash, added "features" to sed (e.g. -i), cp (e.g. -l), etc. are not very popular but certainly not necessary. if you use such things, which are perceived to make things "easier", then you will likely also create unnecessary complexity. this becomes apparent if something fails to work.

unix is simple.

rule#1 to edit small files, use ed.
ed accepts commands from stdin (e.g. through a pipe or a here doc).

rule#2 to edit big files use sed.
sed accepts commands from a file.

simple.

these utilities are time-tested and will not fail. (assuming the GNU people or others have not fscked them up in the process of adding "features").

you can use ed to edit a sed script. (no need for vi). then use sed to apply the changes to your file or files, of whatever size.

put this combination (ed+sed) in a loop with less(1) to preview changes as you go and you have a rock solid "tool" for editing massive files. one that relies only on ubiquitous unix utils and works in sh/csh/tcsh/ksh/bash. less will not choke on massive files nor create temp files.

here's an example:
http://sprunge.us/LQNX

as for the question of whether to use mv, cp or cat once you have the suitable sed script, do not forget you can also use dd to replace/copy/catenate. unlike the other utilities, dd allows control over the buffer size which can be useful when working with massive files.
- waldner says:
  
  March 14, 2011 at 10:06
  
  Your suggestion may be fine for interactive edits (although some may say it's a bit overkill), but of course that's not the only scenario.

\1