Skip to content

CentOS network install behind a proxy

Warning: this is a gross hack, which may or may not work for you. It did work for me, but that doesn't really mean much. It's also very inefficient and resource-intensive. However, it's quick and dirty, and if you have no alternatives, it may be worth a try.

As most people sadly discover (the writer among them), the CentOS net install does not support installation behind an HTTP proxy.

But still, let's see if it's possible to work around that limitation.

Plan

This started out as a crazy idea, and turned out to be actually working (at least with my proxies), so hopefully it will be useful to other people.
I had been playing with socat lately, and I figured it could help for this task.

The idea is: trick the CentOS installer into thinking it's talking directly with the mirror, but instead send its requests to the local office proxy. Of course they can't just be forwarded unchanged; we need to intercept and mangle them into a format that is palatable to the proxy.

The basic technique is very simple: take requests like this, coming from the installer

GET /something HTTP/1.0      # or 1.1
Host: somehost
...rest of headers here...

turn them into

GET http://some.centos.mirror/something HTTP/1.0     # or 1.1
Host: some.centos.mirror
...rest of headers here...

and forward this to the proxy. With a little luck, requests will always have that form, and the proxy will like the modified version.
Also, obviously, forward the responses coming from the proxy back to the installer.

All that is needed is socat, bash, perl or awk, and the name of a CentOS mirror (and a pair of crossed fingers).

Implementation

We're using a host called fake.example.com where socat is installed. To make it work, we need to point the CentOS installer at this host, either by name or IP address.

The basic socat command to run is this:

$ socat TCP-L:4444 EXEC:some_mangling_code

Socat spawns the mangling code and connects itself to the code's standard input and output. So what the code needs to do is to edit the data it receives on standard input, send it to the local proxy, read the replies and print them on standard output, where socat will read them and forward them back to the installer.

The mangling code to achieve the transformation described above is quite straightforward. The text editing part can be easily implemented in sed, awk or perl, and another instance of socat can be used to connect to the proxy (netcat could be used as well). Here is an example bash script to implement it:

#!/bin/bash
# mangle.sh: invoke as "mangle.sh <mirror name> <proxy_address> <proxy_port>"

mirror=$1
proxy_addr=$2
proxy_port=$3

{ socat - TCP:"$proxy_addr":"$proxy_port" && exec 1>&- ; } < <(
  gawk -v mirror="$mirror" '/^Host: /{$0="Host: " mirror "\r"}
  /^GET /{$2="http://" mirror "/" $2}
  {print; fflush("")}' )

If you have netcat installed, you may use that rather than socat, as it's probably a somewhat lighter process to spawn:

#!/bin/bash
# mangle.sh: invoke as "mangle.sh <mirror name> <proxy_address> <proxy_port>"

mirror=$1
proxy_addr=$2
proxy_port=$3

{ nc "$proxy_addr" "$proxy_port" && exec 1>&- ; } < <(
  gawk -v mirror="$mirror" '/^Host: /{$0="Host: " mirror "\r"}
  /^GET /{$2="http://" mirror "/" $2}
  {print; fflush("")}' )

In any case, the code MUST be written that way, and not some other simpler or more obvious way, due to some subletites related to the way pipelines and file descriptors are handled by the shell. This is interesting enough that it deserves an article on its own; I won't go into the details here. But the above code (with process substitution and explicitly closing descriptor 1) did work for me, while other variations did not. Also the buffer flushing code is vital to quickly send data to the proxy, otherwise awk would buffer its output seeing that its stdout is not connected to a terminal.

Lines should be terminated by CR+LF as dictated by the standard. What the awk code does is: if the line starts with "Host: ", replace the whole line with a brand new Host: header with the name of the chosen mirror; if the line starts with "GET ", prepend "http://<mirror name>/" to the path that is specified. All other lines are printed verbatim.

I'm using the CentOS mirror "centos.cict.fr" here, but it's only an example and any mirror can be used, as long as the appropriate CentOS directory for the mirror is used in the installer text box. Here is the list of mirrors from the official page: North America, Europe, other regions. Pick one that is close to you, and note the CentOS directory to use for it.

So in the end here's the complete command to start the socat redirector/mangler:

socat TCP-L:4444,reuseaddr,fork EXEC:"./mangle.sh centos.cict.fr localproxy.example.com 8080"

Since multiple TCP requests are likely to come from the installer, the fork option to socat spawns a child instance to manage each request, while keeping the main instance listening for future requests. Note that the above mechanism will spawn a ridiculous number of processes, so keep this in mind.

Test it

I was surprised that this worked at all. Here are some screenshots:

As said, the "Web site name" should point to the redirector, but the "CentOS directory" should match the real CentOS directory on the real mirror that was specified as argument to mangle.sh.

The installer seems to like it:

In fact, from here the installation can be completed without problems, at least in my tests.

And finally:

We were able to fool the installer.

Socat 2

Socat 2 is still in beta, but according to the documentation, its new address chains feature should make the task described in this article easier.

There is an example at this page demonstrating unidirectional EXEC addresses:

socat - "exec1:gzip % exec1:gunzip | tcp:remotehost:port"

This can be used for our task, but it needs some modification:

  • Of course we want to edit the stream, not compress it. In the "left to right" direction, we'll apply some mangling code (similar to the one we used with socat 1, but without the socat/netcat part); in the "right to left" direction (replies from the proxy), we can use the special NOP address to pass everything unchanged, as we're not mangling the replies;
  • We want a listening "server" rather than stdin/stdout;
  • Multiple distinct TCP streams are likely to come from the installer, so we need to fork a child to service each of them.

So here's a socat 2 version of the CentOS installer proxifier:

socat TCP-L:4444,reuseaddr,fork "EXEC1:./mangle.awk centos.cict.fr % NOP | TCP:localproxy.example.com:8080"

And here's a sample mangle.awk:

#!/usr/bin/gawk -f
# mangle.awk: invoke as "mangle.awk <mirror name>"

BEGIN{mirror = ARGV[1]; ARGC--}
/^Host: /{$0 = "Host: " mirror "\r"}
/^GET /{$2 = "http://" mirror "/" $2}
{print; fflush("")}

or if you prefer Perl:

#!/usr/bin/perl -p
# mangle.pl: invoke as "mangle.pl <mirror name>"

BEGIN{use IO::Handle; autoflush STDOUT; $mirror = $ARGV[0]; shift}
s%^Host: .*%Host: ${mirror}\r%; s%^GET (.*)%GET http://${mirror}/$1%;

From a few tests, the socat 2 version seems to run fine most of the time, but there are some occasional hiccups where the installer reports an error and it should be told to retry the operation, after which it usually succeeds (not investigated).

Conclusions

Please note that a huge, HUGE number of processes (thousands) are spawned on the redirector, especially in the last stage of the install, where individual packages are downloaded. This is very resource-intensive for the machine where the redirector runs, so be considerate when choosing where to implement the redirector. Perhaps using netcat instead of another instance of the (hevier) socat in the shell script could be marginally better, but it doesn't change the number of processes that are spawned.

Also, it makes a number of assumptions about the requests and replies exchanged by client and server, and it will most likely break if something deviates from those assumptions (eg, HTTP redirects just to name one).

Nevertheless, despite being such a crude hack, it seems to work surprisingly well, at least in the environment where it needed to run. There even was no need to mangle the replies from the proxy. As usual, YMMV.

Anyway, let's hope that future releases will have native proxy support!

Access partitions in non-disk block devices with kpartx

Ever wondered why for normal disk devices (eg /dev/sda), device files for the contained partitions are usually available (eg /dev/sad1 etc.), while for other non-disk devices (eg, disk images, LVM or software RAID volumes) there are no such device files? How to access such partitions?

A typical scenario is an LVM logical volume that is used as virtual disk by a guest VM, and the guest OS creates partitions on it. On the host, you just see, say, /dev/vg0/guestdisk, yet it does contain partitions:

# sfdisk -l /dev/mapper/vg0-guestdisk 

Disk /dev/mapper/vg0-guestdisk: 4568 cylinders, 255 heads, 63 sectors/track
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

   Device Boot Start     End   #cyls    #blocks   Id  System
/dev/mapper/vg0-guestdisk1   *      0+   4376    4377-  35158221   83  Linux
/dev/mapper/vg0-guestdisk2       4377    4567     191    1534207+  82  Linux swap / Solaris
/dev/mapper/vg0-guestdisk3          0       -       0          0    0  Empty
/dev/mapper/vg0-guestdisk4          0       -       0          0    0  Empty

But those mysterious devices /dev/mapper/vg0-guestdisk1 etc. are nowhere.

The same can happen for plain disk images:

# sfdisk -l guest.img
Disk guest.img: cannot get geometry

Disk guest.img: 1305 cylinders, 255 heads, 63 sectors/track
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

   Device Boot Start     End   #cyls    #blocks   Id  System
guest.img1   *      0+    497     498-   4000153+  83  Linux
guest.img2        498    1119     622    4996215   83  Linux
guest.img3       1120    1304     185    1486012+  82  Linux swap / Solaris
guest.img4          0       -       0          0    0  Empty

and also for some software (md) RAID devices.

Anyway, in all these cases, it sometimes happens that one needs to do "something" with the inner partitions (eg, mount them, or recreating or resizing a file system, etc.). That obviously needs a device node to use, to avoid losing sanity. Here's where the neat utility kpartx saves the day.

Basically, what kpartx does is to scan a device or file and apply some magic to detect the partition table in it, and create devices corresponding to those partitions. Since it uses the device mapper, the devices it creates go under /dev/mapper, which may be somewhat confusing because that's also where other devices created using the device mapper (LVM volumes, SAN multipath devices), and against which kpartx may be run, live.

Depending on the distribution, kpartx comes either as part of multipath-tools, or packaged separately.

Some examples

So let's take a partitioned LVM logical volume, the one shown in the previous example:

# kpartx -l /dev/mapper/vg0-guestdisk
vg0-guestdisk1 : 0 70316442 /dev/mapper/vg0-guestdisk 63
vg0-guestdisk2 : 0 3068415 /dev/mapper/vg0-guestdisk 70316505

With -l, kpartx only displays what it found and the devices it would create, but doesn't actually create them. To create them, use -a:

# kpartx -a /dev/mapper/vg0-guestdisk

Nothing seems to happen, but let's have a look under /dev/mapper:

# ls -l /dev/mapper/vg0-guestdisk*
brw-rw---- 1 root disk 251, 0 2010-09-24 18:57 /dev/mapper/vg0-guestdisk
brw-rw---- 1 root disk 251, 3 2010-09-24 18:54 /dev/mapper/vg0-guestdisk1
brw-rw---- 1 root disk 251, 4 2010-09-24 18:54 /dev/mapper/vg0-guestdisk2

And now we can access them just fine:

# mount /dev/mapper/vg0-guestdisk1 /mnt
# ls /mnt
bin  boot  cdrom  dev  etc  home  initrd  initrd.img  initrd.img.old  lib  lost+found  media  mnt  opt  proc  root  sbin  srv  sys  tmp  usr  var  vmlinuz  vmlinuz.old

But what happened? Let's have a look. After all, the new devices are just device maps (yes, on top of the main logical volume, which is itself a device map):

# dmsetup table /dev/mapper/vg0-guestdisk1
0 70316442 linear 251:0 63
# dmsetup table /dev/mapper/vg0-guestdisk2
0 3068415 linear 251:0 70316505

What the above fields mean is as follows (values for the first of the two maps):

  • 0: starting block of the map
  • 70316442: number of blocks in the map (in this case this is the total number of blocks in the "device")
  • linear: mapping mode. Linear just means that: blocks are mapped sequentially from the source to this map
  • 251:0: mapped device; here it's the man logical volume vg0-guestdisk, as could be seen from the previous ls output
  • 63: starting block on the mapped device; this means that block 0 in the vg0-guestdisk1 map corresponds to block 63 in the vg0-guestdisk logical volume, block 1 here corresponds to block 64 there, etc.

A block here is 512 bytes, which means that 70316442 blocks are 36002018304 bytes, or about 33GiB or 36GB, depending on whether you like binary or decimal units (in case anybody cares at all, that is).

As a small aside just for completeness, I said that the "partitioned" device (/dev/mapper/vg0-guestdisk) is itself a device map, so here it is:

# dmsetup table /dev/mapper/vg0-guestdisk
0 73400320 linear 104:3 384

Which shows that this logical volume is a linear a map (LVM also allows for striped maps) built on top of the device with major 104 and minor 3, which on this system is nothing else than /dev/cciss/c0d0p3, a partition in an HP hardware RAID volume, which was previously turned into an LVM physical volume and added to the volume group vg0.
For an excellent introduction to the device mapper, which is what LVM, multipath devices and some disk encryption technologies are built upon, I suggest this Linux Gazette article which is quite enlightening.

Disk images

For disk images, kpartx can still be used, but since they are not real block devices, a block device needs to be associated to the file first. This sounds like a job for loopback devices, and indeed kpartx is smart enough to associate a loopback device automatically if it sees that what it's being asked to use is not a real block device:

# losetup -a    # no loop devices in use now
# kpartx -a guest.img
# losetup -a
/dev/loop0: [6801]:131312 (guest.img)
# ls -l /dev/mapper/loop0*
brw-rw---- 1 root disk 251, 5 2010-09-24 23:22 /dev/mapper/loop0p1
brw-rw---- 1 root disk 251, 7 2010-09-24 23:22 /dev/mapper/loop0p2

No need to add that /dev/mapper/loop0p1 and /dev/mapper/loop0p2 are maps that reference /dev/loop0 (which in turn is associated to our image file).

Conclusion

When the devices created by kpartx are no longer needed, the maps can be removed (either manually using dmsetup remove, or with kpartx -d). The devices should also be removed before the partitioning is changed (with fdisk, etc.) because it seems that otherwise kpartx sometimes is not able to delete the old maps, giving errors like "ioctl: LOOP_CLR_FD: Device or resource busy" when trying to delete or update the old maps. So to be safe, it's better to run kpartx -d, change the partitions, then again kpartx -a. If old maps lie around and are accidentally used, disaster is likely as they will be referencing the start and end of partitions that no longer exist, resulting in mapping now-unrelated parts of the device.

kpartx makes working with embedded partitions much easier, a scenario especially common in virtualization.

kpartx can handle different types of partition tables besides the classical DOS format, including BSD, Solaris, Sun and GPT (not tried, but it would seem so by looking at the source).

Finally, kpartx can be used manually on the command line, but it can also be integrated in udev rules to run automatically when the main device is created, so the corresponding devices for the partitions are created too. For example, many distributions run kpartx in a udev rule when a multipath device (eg /dev/mapper/mpath1 etc.) is created, so its partitions will show up as well as the main device.

Smart ranges in awk

Yes, we all know that awk has builtin support for range expressions, like

# print lines from /BEGIN/ to /END/, inclusive
awk '/BEGIN/,/END/'

Sometimes however, we need a bit more flexibility. We might want to print lines between two patterns, but excluding the patterns themselves. Or only including one. A way to achieve the result is to use something like these:

# print lines from /BEGIN/ to /END/, not inclusive
awk '/BEGIN/,/END/{if (!/BEGIN/&&!/END/)print}'

# print lines from /BEGIN/ to /END/, not including /BEGIN/
awk '/BEGIN/,/END/{if (!/BEGIN/)print}'

However, these have a problem. With this input, for example:

1 BEGIN
2 foo
3 bar
4 BEGIN
5 baz
6 END

the BEGIN at line 4 will not be printed, which instead we probably want. Even if those were correct, they are quite clunky, and there must be a better way to select the lines that we want, and in fact there is. This is another typical awk idiom. We can use a flag to keep track of whether we are currently inside the interesting range or not, and print lines based on the value of the flag. Let’s see how it’s done:

# print lines from /BEGIN/ to /END/, not inclusive
awk '/END/{p=0};p;/BEGIN/{p=1}'

# print lines from /BEGIN/ to /END/, excluding /end/
awk '/END/{p=0} /BEGIN/{p=1} p'

# print lines from /BEGIN/ to /END/, excluding /begin/
awk 'p; /END/{p=0} /BEGIN/{p=1}'

All these programs just set p to 1 when /BEGIN/ is seen, and set p to 0 when /END/ is seen. The crucial difference between them is where the bare "p" (the condition that triggers the printing of lines) is located. Depending on its position (at the beginning, in the middle, or at the end), different parts of the desired range are printed. To print the complete range (inclusive), you can just use the regular /BEGIN/,/END/ expression or use the flag technique, but reversing the order of the conditions and associated patterns:

# print lines from /BEGIN/ to /END/, inclusive
awk '/BEGIN/{p=1};p;/END/{p=0}'

It goes without saying that while we are only printing lines here, the important thing is that we have a way of selecting lines within a range, so you can of course do anything you want instead of printing. And of course /BEGIN/ and /END/ should be changed to match the lines you want to select as starting and ending points.

UPDATE 16/10/10: a file may have many /BEGIN/,/END/ ranges. What if one wants to print, say, only the fourth such range? The solutions using flags are trivially modified by adding a counter, and only printing when the counter is four, or whatever instance is desired:

# print lines in the n-th /BEGIN/,/END/ range, not inclusive
awk -v n=4 '/END/{p=0}; p && c == n; /BEGIN/ && !p {p=1; c++}'

The other cases can be adapted similarly.

“Zero or more”

This usually comes up in a form similar to "I'm running this code:

$ echo "foobar 123" | sed 's/[0-9]*/blah/'
blahfoobar 123

but I don't get the expected result!" (which they think should be "foobar blah"). The result they get is instead as shown above.

Why is that? Remember that the star quantifier "*" means "zero or more" times of whatever is quantified. So in particular, "zero times" is perfectly fine. As a matter of fact, there's a match for "zero times [0-9]" just at the beginning of our string, and that's what sed replaces. Yes, it's a zero-length match (but, can a match of "zero times" something be non-zero-length?). Yet, it's still a perfectly valid match.

But there's more. By convention, there's a match for "zero times anything" between any two characters of the string, and also another one at the end:

$ echo "foobar" | sed 's/[0-9]*/BLAH/g'
BLAHfBLAHoBLAHoBLAHbBLAHaBLAHrBLAH

So why, in the original example, does sed replace the first match even if there is a longer match later in the string (that is, "123")? Shouldn't quantifiers be greedy and always find the longest match?

Sure quantifiers are greedy, but greediness matters only when multiple matches are possible at the same position. So if our pattern is /[0-9]*/ and we are matching against "2847293745 foo", surely "2847293745" is matched, rather than "284729374", "28472937" or the other possible shorter matches down to "2" and the empty string: of all the possible matches at the same position, the longest is chosen.

But this is not the case in our original example. In it, there's no contention: there is only a single possible match at the beginning of the string, and the regex engine is content with it.

The usual (although somewhat simplistic) description for this behavior is "leftmost longest match", so stop at the first match that is found starting from the left, and if at that position multiple matches are possible, take the longest one (the aforementioned greediness). Our sed took the leftmost match, which is also the longest at that position, even if it has length zero.

Another pitfall which results from the same misconception is for example grepping for '[0-9]*' and wondering why all the input lines are printed. It's because they all match the pattern.

To sum it up, what people really mean when they write this code is usually "one or more", not "zero or more", so they really want this:

$ echo "foobar 123" | sed 's/[0-9][0-9]*/blah/'
foobar blah

There are other ways to express "one or more". With BREs (Basic Regular Expressions, those used by sed or grep) it is possible to say

[0-9]\{1,\}

or, with GNU tools, also

[0-9]\+

With ERE (Extended Regular Expressions) tools, like awk or grep -E, "one or more" can be expressed as

[0-9]{1,}

or simply

[0-9]+

Sorting by paragraph

This comes up in discussions and forums from time to time. Basically, the input is composed of paragraphs (ie, separated by runs of empty lines), and each paragraph has a specific value somewhere in it. The goal is to sort the text "by paragraph", according to this key, and the resulting output should still consist of paragraphs.

Sample input

In 1990, with help from Robert Cailliau, Tim Berners-Lee published a proposal
to build a "Hypertext project" called "WorldWideWeb" (one word, also "W3")
as a "web" of "hypertext documents" to be viewed by "browsers" using a
client–server architecture.

In 1817, Karl von Drais presented his Laufmaschine ("running machine")
in Mannheim. The draisine, as it became known, is generally regarded as
the first ancestor of the modern bicycle.

The saxophone was invented by the belgian instrument maker Adolphe Sax in
1841, while he was working in Paris. The instrument has since become very
popular and is used in many different styles of music nowadays.

Despite a previous example dating back to 1890 in the Italian magazine
"Il Secolo Illustrato della Domenica", it is generally believed that the first
example of what we call "crossword" today was created by Arthur Wynne,
a journalist from Liverpool, in 1913. He called it "word-cross" puzzle.

The method of logarithms was publicly propounded in 1614, in a book
entitled "Mirifici Logarithmorum Canonis Descriptio", by John Napier, Baron
of Merchiston, in Scotland.

We want to sort the inventions and discoveries in the input by their date. To that end,we rely on each paragraph having somewhere the word "in" followed by spaces or a newline, followed by some digits. The example is perhaps a bit unreal, and more structured data is likely to occur in real life; the important point is that the key can be unambiguously matched. If that is true, the proposed solutions will still be applicable.

Perl solution

So, our regular expression to identify the date in the paragraph is /\b[Ii]n\s+(\d+)/. The year is captured, so it will be available as $1 for subsequent processing. Here is some Perl code that performs paragraph sort according to the date it finds:

# paragraph.pl
$/="";
while(<>) {
  chomp;
  /\b[Ii]n\s+(\d+)/s;
  $par{$1} = $_;
}
print join "\n", map { "$par{$_}\n" }  sort { $a <=> $b } keys %par;

Setting the special variable $/ to the empty string has the effect of input being read in paragraphs (using the "-00" - that's two zeros - command line switch has the same effect). In each paragraph, the date is found and used as a key in the hash "par". Paragraphs are chomp()ed to remove all their trailing newlines (which Perl would otherwise preserve).
The final line sorts (numerically) the hash keys, adds a newline to each paragraph, and joins all the paragraphs separated by a newline character. The result is:

$ perl paragraph.pl sample.txt
The method of logarithms was publicly propounded in 1614, in a book
entitled "Mirifici Logarithmorum Canonis Descriptio", by John Napier, Baron
of Merchiston, in Scotland.

In 1817, Karl von Drais presented his Laufmaschine ("running machine")
in Mannheim. The draisine, as it became known, is generally regarded as
the first ancestor of the modern bicycle.

The saxophone was invented by the belgian instrument maker Adolphe Sax in
1841, while he was working in Paris. The instrument has since become very
popular and is used in many different styles of music nowadays.

Despite a previous example dating back to 1890 in the Italian magazine
"Il Secolo Illustrato della Domenica", it is generally believed that the first
example of what we call "crossword" today was created by Arthur Wynne,
a journalist from Liverpool, in 1913. He called it "word-cross" puzzle.

In 1990, with help from Robert Cailliau, Tim Berners-Lee published a proposal
to build a "Hypertext project" called "WorldWideWeb" (one word, also "W3")
as a "web" of "hypertext documents" to be viewed by "browsers" using a
client–server architecture.

Since this solution uses a hash, if two or more paragraphs have the same key, only the last will be printed, so this should be considered before using this solution.

It's also possible to use a Schwartzian transform and write the whole program like

$/=""; print join "\n", map { "$_->[0]\n" } sort { $a->[1] <=> $b->[1] } map { chomp; [ $_, /\b[Ii]n\s+(\d+)/s ] } <>;

which notably doesn't use any hash, and thus avoids the problem described previously. (Also, real Perl masters will surely do better.)

Awk solution

Awk can be instructed to read paragraphs as records by setting the special built-in variable RS to the empty string. Awk will recognize paragraphs separated by one or more empty lines but, unlike Perl, it will automatically remove the trailing newline characters from $0.

Since sorting is involved, a distinction needs to be made between GNU awk and standard awk.

GNU awk

With GNU awk, it is possible to use the years as keys and use asorti() to sort. Keys are always strings, but if we assume that we will always have a four-digit year, then alphabetic sort will also sort in numeric order (otherwise, it's trivial to zero-pad the keys using sprintf()). Here is GNU awk code to solve the task:

BEGIN{RS=""}
match($0, /\y[Ii]n[[:space:]]+[0-9]+/) {
  year = substr($0,RSTART,RLENGTH)
  sub(/^[Ii]n[[:space:]]+/,"",year)
  par[year] = $0
}
END{
  n=asorti(par,npar)
  for(i=1;i<=n;i++){
    print s par[npar[i]]
    s = "\n"
  }
}

This uses a second array (npar) to store the sorted keys, and prints the result using the concatenation idiom. The \y in the regular expression is GNU awk's equivalent to Perl's \b (word boundary).

However, we're still using years as keys, and so this solution is not good if there are repeated keys. To solve this problem, we can use a suffix for the year (eg, 1932_00, 1932_01 etc.) to avoid duplicate keys:

BEGIN{RS=""}
match($0, /\y[Ii]n[[:space:]]+[0-9]+/) {
  year = substr($0,RSTART,RLENGTH)
  sub(/^[Ii]n[[:space:]]+/,"",year)
  par[year sprintf("_%02d", count[year]++)] = $0
}
END{
  n=asorti(par,npar)
  for(i=1;i<=n;i++){
    print s par[npar[i]]
    s = "\n"
  }
}

The array count[], well, counts the occurrences of every key, and is used to create the suffix to append to the key. The rest of the program is exactly the same as before.

However, there is another approach to the problem, which is explained in the next paragraph.

Standard awk

With standard awk, there are no sort facilities (unless one implements their own, of course). So we have to use the external sort command for sorting. To avoid the problems caused by duplicated keys, we could use the same approach described for GNU awk, except the array holding the keys would have to be sorted using the external program sort. But here we show another approach: the key is prepended to its paragraph separated by "_", and then sort is instructed to sort numerically on the first "_"-separated field. Before doing this, since sort operates on lines, we have to somehow turn each paragraph into a line; this will be accomplished by replacing all the newlines with SUBSEP (octal \034). Of course the change will be undone (and the prefix removed) before printing the result.

Code:

BEGIN{RS=""}
match($0, /(^|[[:space:]])[Ii]n[[:space:]]+[0-9]+/) {
  year = substr($0, RSTART, RLENGTH)
  match(year, /[0-9]+$/)
  year = substr(year, RSTART, RLENGTH)
  gsub(/\n/, SUBSEP)
  par[++count] = year "_" $0
}
END{
  tempfile = "/tmp/f6SwPu2"
  command = "sort -t _ -k1,1n > " tempfile
  # sort the array...
  for(i=1;i<=count;i++) print par[i] | command
  close(command)
  i=0
  # read it back, but reset RS first...
  RS = "\n"
  while((getline par[++i] < tempfile) > 0)
    ;
  close(tempfile)
  # restore newlines and exclude keys from printed data
  for(i=1;i<=count;i++){
    gsub(SUBSEP, "\n", par[i])
    print s substr(par[i], index(par[i],"_")+1)
    s = "\n"
  }
}

Standard awk does not have \y, so we have to approximate it with the alternation "beginning of string or [[:space:]]".

Note that the ordering produced by the above code may differ from that obtained from the solution with the numeric suffix appended to the key. This is because in the former case sort is effectively seeing duplicate keys, and resolve ties by comparing the whole line, so this may result in reordering of equal-key paragraphs compared to the original input. If this is not desired, it is possible to disable sort's "last resort" comparison by supplying the "-s" (for stable) option to sort. However, it is a nonstandard option, and not all implementations support it (GNU sort does).