It's generally understood that doing
$ command < file # or $ command file # for commands that support this
and
$ cat file | command
are essentially the same thing. (yes, it's a UUOC, but it's not relevant for the point we want to make here)
This is "usually" true, for most practical purposes; however, there are some subtle differences, essentially related to the fact that a file and a stream, while having some similarities, and usually being presented to the application under the same interface, are in fact two different kinds of object.
The most important difference is that a file can be
$ ls -l bigfile -rw-r--r-- 1 waldner users 176221788 2009-10-17 14:21 bigfile $ $ strace tail -n 1 bigfile execve("/usr/bin/tail", ["tail", "-n", "1", "testfile"], [/* 62 vars */]) = 0 ... fstat(3, {st_mode=S_IFREG|0644, st_size=176221788, ...}) = 0 lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_END) = 176221788 lseek(3, 176218112, SEEK_SET) = 176218112 read(3, "8 999978 999978 999978 999978 99"..., 3676) = 3676 fstat(1, {st_mode=S_IFCHR|0777, st_rdev=makedev(1, 3), ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8372bb5000 read(3, "", 0) = 0 close(3) = 0 write(1, "LAST LINE\n", 10) = 10 ...
Here, the input is a file (as shown by st_mode=S_IFREG) and tail can jump to the end straight away using lseek(). If we feed the input through a pipe, that is not possible:
$ cat bigfile | strace tail -n 1 execve("/usr/bin/tail", ["tail", "-n", "1"], [/* 62 vars */]) = 0 ... fstat(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 read(0, "FIRST LINE\nSECOND LINE\nTHIRD LIN"..., 8192) = 8192 ...[snip about 21500 reads like the above] read(0, "8 999978 999978 999978 999978 99"..., 8192) = 3676 read(0, "", 8192) = 0 fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 2), ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbf4c83b000 close(0) = 0 write(1, "LAST LINE\n", 10) = 10
Here fstat() says that the input is S_IFIFO, so tail has no choice but read the whole thing from start to end.
Also note the huge difference in speed, due to the above behavior:
$ time tail -n 1 bigfile LAST LINE real 0m0.046s user 0m0.002s sys 0m0.001s $ time cat bigfile | tail -n 1 LAST LINE real 0m1.123s user 0m0.139s sys 0m0.553s
So in general, apart from avoiding the UUOC, using files and not streams/pipes is more efficient for those commands that benefit from the special properties of a file.
However, it turns out that sometimes (hopefully less and less) streams have some advantages over files. Programs that lack support for large files (again, hopefully very few these days) are known to fail or behave incorrectly if they have to read files bigger than 2 gigabytes on a 32-bit system. This is due to the fact that using non-LFS system calls like open(), lseek(), or stat()/fstat() on such big files may produce overflow errors, making it effectively impossible for those programs to access those files normally.
Usually, to "persuade" these programs to operate on such files, two things are needed:
- They should not have to use open() directly on the file, as they don't use O_LARGEFILE so that would fail. This can be solved (for those commands that support reading either from a specified file OR from standard input) by using redirection, so for example
nonlfscommand < bigfile
orcat bigfile | nonlfscommand
- Then, these programs might try to detect which kind of object their input is, and based on that, if they detect a regular file, attempt other operations like lseek(), which again would fail. Usually, the detection would be done by calling stat() or fstat(). It turns out that a 32-bit stat() on a big file fails because st_size overflows. So we need a way to make stat() succeed, and in such a way that the program does not attempt further manipulation.
The second requirement rules out things like nonlfscommand < bigfile
So, cat bigfile | nonlfscommand
As said however, this should not be a problem on any modern system, which should definitely all support large files. Perhaps the only problematic programs could be legacy, binary-only programs that were built without large file support.