Remove duplicates, but keeping only the last occurrence

Posted by waldner on 17 November 2011, 9:55 am

Again on IRC, somebody asked how to remove duplicates from a file, but keeping only the last occurrence of each item. The classic awk idiom

awk '!a[$0]++'

prints only the first instance of every line. So if the input is, for example

foo
bar
baz
foo
xxx
yyy
bar

the "normal" output (ie using the classic idiom) would be

foo
bar
baz
xxx
yyy

whereas in this particular formulation of the task we want instead

baz
foo
xxx
yyy
bar

Of course, one may check a specific field rather than $0 (which is probably more useful), but the general technique is the same.

Turns out that the problem is not as simple as it may seem. Let's start by seeing how we can find out where the last occurrence of a key is in the file:

{pos[$0] = NR}

After reading the whole file, pos["foo"] for example will contain the record number where "foo" was last seen, that is, its last occurrence. (If we were looking for a specific field rather than $0 and we wanted to print the whole line, we would have to save it - this will be shown below after the example with $0 is complete; it doesn't really change the logic).

Now that we have the pos[] array populated, we have to print it in ascending order of its values, which aren't known a priori (and we can only traverse the array using its keys).

At this point, one may think of doing some kind of sorting, but let's see whether it's possible to avoid that. For example, we can (using another common awk idiom) swap the keys and the values:

END {
  for(key in pos) reverse[pos[key]] = key
  ...

Now the array reverse[] uses record numbers as keys, and keys as values. We still don't know what those record numbers are, but now that they are used as indices, we can easily check whether a specific record number is present, so all we need is

  ...
  for(nr=1;nr<=NR;nr++)
    if(nr in reverse) print reverse[nr]
}

to print them in ascending order of the indices (ie, record numbers).

So the resulting awk code is

{pos[$0] = NR}
END {
  for(key in pos) reverse[pos[key]] = key
  for(nr=1;nr<=NR;nr++)
    if(nr in reverse) print reverse[nr]
}

Now the last detail: what if we wanted to check for duplicates on a specific field rather than the whole line? The code just needs to be changed slightly to remember the lines we need:

# for example. using $3 as a key
{pos[$3] = NR; lines[$3] = $0}
END {
  for(key in pos) reverse[pos[key]] = key
  for(nr=1;nr<=NR;nr++)
    if(nr in reverse) print lines[reverse[nr]]
}

and there we have it.

Filed under awk, shell, tips Tagged awk, sorting, text processing

Comments are closed | Permalink

3 Comments

Kayla says:

May 30, 2015 at 20:57

This is exactly what I was looking for! Very clear and informative. Thank you!
gagagruau says:

November 17, 2011 at 23:41

This is interesting, but although it might be less efficient, I guess the simpler solution would have been something like

tac | awk '!a[$0]++' | tac

knowing that the tac utility is part of coreutils.
- waldner says:
  
  November 18, 2011 at 10:14
  
  Well, that's the lazy solution! :-)

\1