Joining files with awk

Posted by waldner on 11 December 2010, 4:43 pm

Here the task is: given a number of input files (for example comma-separated), produce an output where each line is composed by the string concatenation of the corresponding lines from each input file. "Corresponding" means that lines in the input files have a key, and output lines should be joined based on that key. An example will make things clear.

$ cat input1.csv
1,line1
2,line2
3,line3
4,line4
$ cat input2.csv
1,aaaa
2,bbbb
3,cccc
4,dddd
6,eeee
7,ffff
$ cat input3.csv
2,xxxx
4,yyyy
6,zzzz

Data here is bogus, but the important points are:

The key is field 1. Here it's a number, but it can be anything (although that will affect output ordering). If there's no explicit key, line numbers may be used as an implicit key
Not all keys are present in all files. When a key was missing from an input file, the output should have an empty field.

Based on the above, the desired output is one of the following variations (or some modification thereof):

line1,aaaa,
line2,bbbb,xxxx
line3,cccc,
line4,dddd,yyyy
,eeee,zzzz
,ffff,

or, with keys prepended,

1,line1,aaaa,
2,line2,bbbb,xxxx
3,line3,cccc,
4,line4,dddd,yyyy
6,,eeee,zzzz
7,,ffff,

Essentially the output should be a matrix of M columns x N rows, where M is the number of input files, and N is the number of keys found across all input files. A variation may be introduced when the key is numeric, as one may or may not want to "fill in" the gaps between keys that don't exist in the input by creating lines of empty fields. In the above example, a line for key 5 could be created:

5,,,

How to accomplish this or not will be discussed later.

Generic method

This technique reads all the input files into an associative array, tracking M and N, and prints the array at the end.

# join1.awk
BEGIN { FS="," }
FNR==1 {
  # track number of input files (columns)
  ncols++
}
{
  # store data ($2) in the array
  data[$1,ncols] = $2
  keys[$1]
}
END {
  for(key in keys) {
    outline=key
    for(col=1;col<=ncols;col++) {
      outline = outline "," data[key,col]
    }
    print outline
  }
}

The above program loops over the keys using the for(key in keys) construct, which returns keys in random order. Here is a test run over the sample input:

$ awk -f join1.awk *.csv
4,line4,dddd,yyyy
6,,eeee,zzzz
7,,ffff,
1,line1,aaaa,
2,line2,bbbb,xxxx
3,line3,cccc,

(in this and the following examples, output lines have their keys prepended; modifying the code to not print keys is trivial).

This is appropriate for the general case, but may not give the desired result if keys are numeric and ordering is desired. GNU awk users can set WHINY_USERS to enforce key ordering, but let's look at a more general solutions.

Numeric keys, filling the gaps

So we want to print keys in order, and create any key that may be missing. The following code accomplishes that:

# join2.awk
BEGIN { FS="," }
FNR==1 {
  # track number of input files (columns)
  ncols++
}
{
  # store data ($2) in the array
  data[$1,ncols] = $2
  if($1>maxkey) maxkey=$1
}
END {
  for(key=1;key<=maxkey;key++) {
    outline=key
    for(col=1;col<=ncols;col++) {
      outline = outline "," data[key,col]
    }
    print outline
  }
}

Note that now it is not necessary to store keys in an array; since keys are numeric, it's enough to track the maximum value seen (and perhaps the minimum, if not known in advance - here we assume 1, but again it's easy to modify the code to track the minimum).

$ awk -f join2.awk *.csv
1,line1,aaaa,
2,line2,bbbb,xxxx
3,line3,cccc,
4,line4,dddd,yyyy
5,,,
6,,eeee,zzzz
7,,ffff,

Numeric keys, no added keys

If we don't want extra keys to be created, we need to reintroduce tracking of keys, and avoid printing a given key if it didn't exist in the input:

# join3.awk
BEGIN { FS="," }
FNR==1 {
  # track number of input files (columns)
  ncols++
}
{
  # store data ($2) in the array
  data[$1,ncols] = $2
  keys[$1]
  if($1>maxkey) maxkey=$1
}
END {
  for(key=1;key<=maxkey;key++) {
    if(key in keys) {
      outline=key
      for(col=1;col<=ncols;col++) {
        outline = outline "," data[key,col]
      }
      print outline
    }
  }
}

$ awk -f join3.awk *.csv
1,line1,aaaa,
2,line2,bbbb,xxxx
3,line3,cccc,
4,line4,dddd,yyyy
6,,eeee,zzzz
7,,ffff,

Filed under awk, faq, shell Tagged awk, csv, join, text processing

Comments are closed | Permalink

\1