Again on IRC, somebody asked how to remove duplicates from a file, but keeping only the last occurrence of each item. The classic awk idiom
awk '!a[$0]++'
prints only the first instance of every line. So if the input is, for example
foo bar baz foo xxx yyy bar
the "normal" output (ie using the classic idiom) would be
foo bar baz xxx yyy
whereas in this particular formulation of the task we want instead
baz foo xxx yyy bar
Of course, one may check a specific field rather than $0 (which is probably more useful), but the general technique is the same.
Turns out that the problem is not as simple as it may seem. Let's start by seeing how we can find out where the last occurrence of a key is in the file:
{pos[$0] = NR}
After reading the whole file,
Now that we have the
At this point, one may think of doing some kind of sorting, but let's see whether it's possible to avoid that. For example, we can (using another common awk idiom) swap the keys and the values:
END { for(key in pos) reverse[pos[key]] = key ...
Now the array
... for(nr=1;nr<=NR;nr++) if(nr in reverse) print reverse[nr] }
to print them in ascending order of the indices (ie, record numbers).
So the resulting awk code is
{pos[$0] = NR} END { for(key in pos) reverse[pos[key]] = key for(nr=1;nr<=NR;nr++) if(nr in reverse) print reverse[nr] }
Now the last detail: what if we wanted to check for duplicates on a specific field rather than the whole line? The code just needs to be changed slightly to remember the lines we need:
# for example. using $3 as a key {pos[$3] = NR; lines[$3] = $0} END { for(key in pos) reverse[pos[key]] = key for(nr=1;nr<=NR;nr++) if(nr in reverse) print lines[reverse[nr]] }
and there we have it.
This is exactly what I was looking for! Very clear and informative. Thank you!
This is interesting, but although it might be less efficient, I guess the simpler solution would have been something like
tac | awk '!a[$0]++' | tac
knowing that the tac utility is part of coreutils.
Well, that's the lazy solution! :-)