Run a command on all files in a hierarchy

Posted by waldner on 10 July 2010, 6:54 pm

This is one of the all-time most asked FAQ.

"How do I run a command on all .txt files (or .cpp, or .html, or whatever) in a big directory hierarchy"? Most often, the command is an editing command like sed, so the requester also wants to edit the files "in-place" (but of course he says that only after you've already given him a generic solution and he found it doesn't do what he wanted).

The general way to run a command on a directory hierarchy is to use find:

$ find /basedir -type f -name '*.txt' -exec something {} \;

The above will run the command something on every .txt file in the hierarchy. The {} are replaced by the actual file name being processed; the semicolon is an indicator that tells find that the command to execute ends there (it must be escaped, as above, or quoted, otherwise the shell will interpret it). So for example to find all the files containing lines matching "foobar[0-9]*" you can do:

$ find /basedir -type f -name '*.txt' -exec grep -l 'foobar[0-9]*' {} \;
# list of files follow...

Now, the above command runs grep once for each file found. If you have thousands of .txt files, that means that an equivalent number of processes is fork()ed and exec()ed. This is costly for the sistem, and also not very nice if you're sharing the system with other users. Since grep (like many other tools) can accept a list of files as arguments, it would be nice if it was possible to pass it many filenames at once rather than one at a time. In this case, one usually thinks xargs, which after all is designed precisely for that. But xargs is a bit clumsy, and (at least the standard version) makes it very hard to handle certain cases like filenames with spaces or newlines (although there is some discussion in POSIX to change this, but that won't happen in the near future anyway). For some more bashing of xargs, and an explanation of the features we will now introduce here, see the excellent article "using find".
Now it turns out that find has its own mechanism to do internally what xargs does; to enable it, just replace the semicolon with which you end -exec with a plus (+):

$ find /basedir -type f -name '*.txt' -exec grep -l 'foobar[0-9]*' {} +

Now grep will be invoked with as many arguments as possible each time (within the system's limit for the maximum command line length); so instead of spawning hundreds or thousands of processes like before, we're now running only a few, possibly just one. This is much better.

And you're not limited to just a single command: if you exec sh -c, you can run arbitrary code on the files. This is usually invoked like this:

find /basedir -type f -name '*.txt' -exec sh -c 'some code here' sh {} +

The "sh" after the code is a placeholder; this is what the spawned shell will see as its $0 (as per the manual). Since we're using "+", the {} is then turned into as many arguments as possible, so in the shell code you can use "$@" to refer to them, or loop over them using for, etc.
If the code in single quotes becomes long and complex, you can of course put it in a file and then do -exec script.sh {} +.

Now for the following question..."but I'm running sed and I want to change the files in place". Ok, you know that sed has an option -i that (apparently) edits the files in-place. So you can actually run something like this:

$ find /basedir -type f -name '*.txt' -exec sed -i 's/foo/bar/g' {} +

Now read carefully: if something goes wrong, or you have an error in your sed code that does something that you don't want (yes, it happens!), the above command has the potential to make a mess of your files, in such a way that it may be impossible to recover. You have been warned. When playing with commands like the above it's extremely easy to find yourself with, say, 2500 files changed in an unintended, but non-reversible, way. This is the reason why you should really either make a backup of the whole hierarchy before attempting to do anything, or alternatively you should specify a backup extension to -i, as in -i.bak for example, so you have a copy of the original file with the .bak extension. Only after you have made sure that the massive edit was successful and the files were changed as expected, may you delete the backup files.

For an excellent page on find, see Greg's wiki.

Filed under faq, shell, tips Tagged find, grep, sed

Comments are closed | Permalink

One Comment

Ole Tange says:

July 11, 2010 at 07:23

Sometimes the command you want to run is compute intensive. If you have access to several computer GNU Parallel http://www.gnu.org/software/parallel/ makes it possible to have those computers help do the computation. E.g. convert all .wav to .mp3 using local computer and computer1+2 running one job per CPU-core:

find . -type f -name '*.wav' | parallel -j+0 --trc {.}.mp3 -S :,computer1,computer2 "lame {} -o {.}.mp3"

To learn more watch the intro video for GNU Parallel: http://www.youtube.com/watch?v=OpaiGYxkSuQ

\1