Sorting by paragraph

Posted by waldner on 7 August 2010, 1:28 pm

This comes up in discussions and forums from time to time. Basically, the input is composed of paragraphs (ie, separated by runs of empty lines), and each paragraph has a specific value somewhere in it. The goal is to sort the text "by paragraph", according to this key, and the resulting output should still consist of paragraphs.

Sample input

In 1990, with help from Robert Cailliau, Tim Berners-Lee published a proposal
to build a "Hypertext project" called "WorldWideWeb" (one word, also "W3")
as a "web" of "hypertext documents" to be viewed by "browsers" using a 
client–server architecture.

In 1817, Karl von Drais presented his Laufmaschine ("running machine") 
in Mannheim. The draisine, as it became known, is generally regarded as 
the first ancestor of the modern bicycle.

The saxophone was invented by the belgian instrument maker Adolphe Sax in
1841, while he was working in Paris. The instrument has since become very
popular and is used in many different styles of music nowadays.

Despite a previous example dating back to 1890 in the Italian magazine
"Il Secolo Illustrato della Domenica", it is generally believed that the first
example of what we call "crossword" today was created by Arthur Wynne,
a journalist from Liverpool, in 1913. He called it "word-cross" puzzle.

The method of logarithms was publicly propounded in 1614, in a book 
entitled "Mirifici Logarithmorum Canonis Descriptio", by John Napier, Baron
of Merchiston, in Scotland.

We want to sort the inventions and discoveries in the input by their date. To that end,we rely on each paragraph having somewhere the word "in" followed by spaces or a newline, followed by some digits. The example is perhaps a bit unreal, and more structured data is likely to occur in real life; the important point is that the key can be unambiguously matched. If that is true, the proposed solutions will still be applicable.

Perl solution

So, our regular expression to identify the date in the paragraph is /\b[Ii]n\s+(\d+)/. The year is captured, so it will be available as $1 for subsequent processing. Here is some Perl code that performs paragraph sort according to the date it finds:

# paragraph.pl
$/="";
while(<>) {
  chomp;
  /\b[Ii]n\s+(\d+)/s;
  $par{$1} = $_; 
}
print join "\n", map { "$par{$_}\n" }  sort { $a <=> $b } keys %par;

Setting the special variable $/ to the empty string has the effect of input being read in paragraphs (using the "-00" - that's two zeros - command line switch has the same effect). In each paragraph, the date is found and used as a key in the hash "par". Paragraphs are chomp()ed to remove all their trailing newlines (which Perl would otherwise preserve).
The final line sorts (numerically) the hash keys, adds a newline to each paragraph, and joins all the paragraphs separated by a newline character. The result is:

$ perl paragraph.pl sample.txt
The method of logarithms was publicly propounded in 1614, in a book
entitled "Mirifici Logarithmorum Canonis Descriptio", by John Napier, Baron
of Merchiston, in Scotland.

In 1817, Karl von Drais presented his Laufmaschine ("running machine")
in Mannheim. The draisine, as it became known, is generally regarded as
the first ancestor of the modern bicycle.

The saxophone was invented by the belgian instrument maker Adolphe Sax in
1841, while he was working in Paris. The instrument has since become very
popular and is used in many different styles of music nowadays.

Despite a previous example dating back to 1890 in the Italian magazine
"Il Secolo Illustrato della Domenica", it is generally believed that the first
example of what we call "crossword" today was created by Arthur Wynne,
a journalist from Liverpool, in 1913. He called it "word-cross" puzzle.

In 1990, with help from Robert Cailliau, Tim Berners-Lee published a proposal
to build a "Hypertext project" called "WorldWideWeb" (one word, also "W3")
as a "web" of "hypertext documents" to be viewed by "browsers" using a
client–server architecture.

Since this solution uses a hash, if two or more paragraphs have the same key, only the last will be printed, so this should be considered before using this solution.

It's also possible to use a Schwartzian transform and write the whole program like

$/=""; print join "\n", map { "$_->[0]\n" } sort { $a->[1] <=> $b->[1] } map { chomp; [ $_, /\b[Ii]n\s+(\d+)/s ] } <>;

which notably doesn't use any hash, and thus avoids the problem described previously. (Also, real Perl masters will surely do better.)

Awk solution

Awk can be instructed to read paragraphs as records by setting the special built-in variable RS to the empty string. Awk will recognize paragraphs separated by one or more empty lines but, unlike Perl, it will automatically remove the trailing newline characters from $0.

Since sorting is involved, a distinction needs to be made between GNU awk and standard awk.

GNU awk

With GNU awk, it is possible to use the years as keys and use asorti() to sort. Keys are always strings, but if we assume that we will always have a four-digit year, then alphabetic sort will also sort in numeric order (otherwise, it's trivial to zero-pad the keys using sprintf()). Here is GNU awk code to solve the task:

BEGIN{RS=""}
match($0, /\y[Ii]n[[:space:]]+[0-9]+/) {
  year = substr($0,RSTART,RLENGTH)
  sub(/^[Ii]n[[:space:]]+/,"",year)
  par[year] = $0
}
END{
  n=asorti(par,npar)
  for(i=1;i<=n;i++){
    print s par[npar[i]]
    s = "\n"
  }
}

This uses a second array (npar) to store the sorted keys, and prints the result using the concatenation idiom. The \y in the regular expression is GNU awk's equivalent to Perl's \b (word boundary).

However, we're still using years as keys, and so this solution is not good if there are repeated keys. To solve this problem, we can use a suffix for the year (eg, 1932_00, 1932_01 etc.) to avoid duplicate keys:

BEGIN{RS=""}
match($0, /\y[Ii]n[[:space:]]+[0-9]+/) {
  year = substr($0,RSTART,RLENGTH)
  sub(/^[Ii]n[[:space:]]+/,"",year)
  par[year sprintf("_%02d", count[year]++)] = $0
}
END{
  n=asorti(par,npar)
  for(i=1;i<=n;i++){
    print s par[npar[i]]
    s = "\n"
  }
}

The array count[], well, counts the occurrences of every key, and is used to create the suffix to append to the key. The rest of the program is exactly the same as before.

However, there is another approach to the problem, which is explained in the next paragraph.

Standard awk

With standard awk, there are no sort facilities (unless one implements their own, of course). So we have to use the external sort command for sorting. To avoid the problems caused by duplicated keys, we could use the same approach described for GNU awk, except the array holding the keys would have to be sorted using the external program sort. But here we show another approach: the key is prepended to its paragraph separated by "_", and then sort is instructed to sort numerically on the first "_"-separated field. Before doing this, since sort operates on lines, we have to somehow turn each paragraph into a line; this will be accomplished by replacing all the newlines with SUBSEP (octal \034). Of course the change will be undone (and the prefix removed) before printing the result.

Code:

BEGIN{RS=""}
match($0, /(^|[[:space:]])[Ii]n[[:space:]]+[0-9]+/) {
  year = substr($0, RSTART, RLENGTH)
  match(year, /[0-9]+$/)
  year = substr(year, RSTART, RLENGTH)
  gsub(/\n/, SUBSEP)
  par[++count] = year "_" $0
}
END{
  tempfile = "/tmp/f6SwPu2"
  command = "sort -t _ -k1,1n > " tempfile
  # sort the array...
  for(i=1;i<=count;i++) print par[i] | command
  close(command)
  i=0
  # read it back, but reset RS first...
  RS = "\n"
  while((getline par[++i] < tempfile) > 0)
    ;
  close(tempfile)
  # restore newlines and exclude keys from printed data
  for(i=1;i<=count;i++){
    gsub(SUBSEP, "\n", par[i])
    print s substr(par[i], index(par[i],"_")+1)
    s = "\n"
  }
}

Standard awk does not have \y, so we have to approximate it with the alternation "beginning of string or [[:space:]]".

Note that the ordering produced by the above code may differ from that obtained from the solution with the numeric suffix appended to the key. This is because in the former case sort is effectively seeing duplicate keys, and resolve ties by comparing the whole line, so this may result in reordering of equal-key paragraphs compared to the original input. If this is not desired, it is possible to disable sort's "last resort" comparison by supplying the "-s" (for stable) option to sort. However, it is a nonstandard option, and not all implementations support it (GNU sort does).

Filed under awk, shell, tips Tagged awk, paragraph, perl, text processing

Comments are closed | Permalink

\1