Find files containing specific words

Posted by waldner on 26 November 2009, 12:03 am

Sometimes you need to print the names of all the files that contain certain words (I'll use "foo", "bar" and "baz" here, but the examples are easily extendable or adaptable). I'll assume the search should be in a bunch of .txt files in the directory; again, it's easy to adapt the examples to be used, for examples, with find or in other contexts.

So grep alone doesn't help much, since it cannot correlate patterns appearing in different lines of the file. But grep has a switch -l that can print the name of the file containing the pattern, rather than the matching lines. So this

$ grep -l foo *.txt

gives us a list of filenames that contain "foo". So we could (ab)use xargs and use that list as arguments for another grep:

$ grep -l foo *.txt | xargs grep -l bar

So now we add yet another xargs and the final result is a list of files that contain "foo", "bar" and "baz":

$ grep -l foo *.txt | xargs grep -l bar | xargs grep -l baz

This kind of works, but has some problems. First problem is that it will fail if the filenames being piped to xargs contain spaces or other characters that xargs interprets as input separator. GNU tools usually have switches that allow to separate the output elements with NULLs (ASCII 0) rather than newlines, and corresponding switches that (where applicable) allow reading a NULL separated input. Fortunately, this is the case with grep and xargs. So we can make our code more robust (even in case of filenames containing newlines!) by using those switches:

$ grep -lZ foo *.txt | xargs -0 grep -lZ bar | xargs -0 grep -l baz

The last grep doesn't use -Z because we want a normal plain text list as final output.

But this solution has another issue: each file that makes it to the last stage is read three times. All files that get to the second stage are read twice (though it must be said that all those reads happen almost in parallel due to pipelining). We can make it somewhat better by using the (again GNU grep specific) -m switch, as in -m 1 that tells grep to stop reading as soon as the first match is found. Still, your grep might not have -m, and this method seems a bit inefficient anyway, so let's see if single-pass solutions exist.

With perl we can do something like

$ perl -n0e 'print "$ARGV\n" if /foo/&&/bar/&&/baz/' *.txt

This works. However, it slurps each file entirely in memory. If the files are big, this might be a problem, although most modern systems should have no problems keeping files as big as several hundreds megabytes in memory. But still, why do we have to read 200MB of data if the words we want are found in the first 5 lines (for example)?

Now, awk comes to the rescue (for fairness, the same thing we'll do in awk can be done with perl). The idea is: start reading each file, and as soon as all the words are found, print the file name and move on to the next file. Here is the code:

$ awk 'FNR==1 {foo=bar=baz=0}
/foo/ {foo=1}
/bar/ {bar=1}
/baz/ {baz=1}
foo && bar && baz {print FILENAME;nextfile}' *.txt

This way, we are forced to completely read only files where all the words are not present, but for those that do contain all the words, we can stop reading as soon as we find them. Furthermore, we don't keep whole files in memory, but rather only one line at a time. Please note that the nextfile command is a specific GNU awk feature; with awks that don't support that, you will need to read the rest of the file:

$ awk 'FNR==1 {foo=bar=baz=printed=0}
/foo/ {foo=1}
/bar/ {bar=1}
/baz/ {baz=1}
foo && bar && baz && !printed {print FILENAME;printed=1}' *.txt

That is slightly less efficient, but it's still a single-pass method that only needs to keep a single line at a time in memory.

Filed under awk, linux, shell, tips Tagged awk, grep, oneliners, perl, shell, text processing

Comments are closed | Permalink

\1

Find files containing specific words

BTC

Recent Posts

Categories

Archives

\1

Find files containing specific words

BTC

Recent Posts

Categories

Tags

Archives