Skip to content

Detecting empty files in awk

We have an awk script that can process multiple files, and we want to do some special task at the beginning of each file (for this example we just print the file name, but it can be anything of course). The classic awk idiom to do this is something like

function process_file(){
  print "Processing " FILENAME "..."
}

FNR == 1 { process_file() }
# rest of code

So we call our script with three files and get:

$ awk -f script.awk file1 file2 file3
Processing file1...
Processing file2...
Processing file3...

Alright. But what happens if some file is empty? Let's try it (we use /dev/null to simulate an empty file):

$ awk -f script.awk file1 /dev/null file3
Processing file1...
Processing file3...

Right, since an empty file has no lines, it can never match FNR == 1, so for the purposes of our per-file processing task it's effectively skipped. Depending on the exact needs, this may or may not be acceptable. Usually it is, but what if we want to be sure that we run our code for each file, regardless of whether it's empty or not?

GNU awk

If we have GNU awk and can assume it's available anywhere our script will run (or can force it as a prerequisite for users), then it's easy: just use the special BEGINFILE block instead of FNR == 1.

function process_file(){
  print "Processing " FILENAME "..."
}

BEGINFILE { process_file() }

(Btw, GNU awk also has a corresponding ENDFILE special block.)

And there we have it:

$ gawk -f script.awk file1 /dev/null file3
Processing file1...
Processing /dev/null...
Processing file3...

But alas, for the time being this is not standard, so it can only run with GNU awk.

Standard awk

With standard awk, we have to stick to what is available, namely the FNR == 1 condition. If our process_file function is executed, then we know we're seeing a non-empty file. So our only option is, within this function, to check whether some previous file has been skipped and if so, catch up with their processing. How do we do this check? Well, awk stores all the arguments to the program in the ARGV[] array, so we can keep our own pointer to the index of the expected "current" file being processed and check that it matches FILENAME (which is set by awk and always matches the current file); if they are not the same, it means some previous file was skipped, so we catch up.

First version of our processing function (we choose to ignore the lint/style issue represented by the fact that passing a global variable to a function that accepts a parameter of the same name shadows it, as it's totally harmless here and improves code readability):

function process_it(filename, is_empty) {
  print "Processing " filename " (" (is_empty ? "empty" : "nonempty") ")..."
}

function process_file(argind) {
  argind++

  # if ARGV[argind] differs from FILENAME, we skipped some files. Catch up
  while (ARGV[argind] != FILENAME) {
    process_it(ARGV[argind], 1)
    argind++
  }
  # finally, process the current file
  process_it(ARGV[argind], 0)
  return argind
}

BEGIN {
  argind = 0
}

FNR == 1 {
  argind = process_file(argind)
}
# rest of code here

(The index variable is named argind. The name is not random; GNU awk has an equivalent built-in variable, called ARGIND)

Let's test it:

$ awk -f script.awk file1 /dev/null file3
Processing file1 (nonempty)...
Processing /dev/null (empty)...
Processing file3 (nonempty)...
$ awk -f script.awk /dev/null /dev/null file3
Processing /dev/null (empty)...
Processing /dev/null (empty)...
Processing file3 (nonempty)...
$ awk -f script.awk file1 /dev/null /dev/null
Processing file1 (nonemtpy)...
$
# Oops...

So there's a corner case where it doesn't work, namely where the last file(s) are all empty: since there's no later non-empty file, our function doesn't get any further chance to be called to catch up. This can be fixed: we just call our function from the END block. When we're called from the END block, we just process all the arguments that haven't been processed (that is, from argind to ARGC - 1), if any (these would all be empty files). Revised code:

function process_it(filename, is_empty) {
  print "Processing " filename " (" (is_empty ? "empty" : "nonempty") ")..."
}

function process_file(argind, end) {
  argind++

  if (end) {
    for(; argind <= ARGC - 1; argind++)
      # we had empty files at the end of arguments
      process_it(ARGV[argind], 1)
    return argind
  } else {
    # if ARGV[argind] differs from FILENAME, we skipped some files. Catch up
    while (ARGV[argind] != FILENAME) {
      process_it(ARGV[argind], 1)
      argind++
    }
    # finally, process the current file
    process_it(ARGV[argind], 0)
    return argind
  }
}

BEGIN {
  argind = 0
}

FNR == 1 {
  argind = process_file(argind, 0)
}

# rest of code here...

END {
  argind = process_file(argind, 1)
  # here argind == ARGC
}

Let's test it again:

$ awk -f script.awk file1 /dev/null file3
Processing file1 (nonempty)...
Processing /dev/null (empty)...
Processing file3 (nonempty)...
$ awk -f script.awk /dev/null /dev/null file3
Processing /dev/null (empty)...
Processing /dev/null (empty)...
Processing file3 (nonempty)...
$ awk -f script.awk file1 /dev/null /dev/null
Processing file1 (nonempty)...
Processing /dev/null (empty)...
Processing /dev/null (empty)...
$ awk -f script.awk /dev/null /dev/null /dev/null
Processing /dev/null (empty)...
Processing /dev/null (empty)...
Processing /dev/null (empty)...

But wait, we aren't done yet!

$ awk -f script.awk file1 /dev/null a=10 file3
Processing file1 (nonempty)...
Processing /dev/null (empty)...
Processing a=10 (empty)...
Processing file3 (nonempty)...

That is, awk allows mixing filenames and variable assignments in the argument list. This is really a feature as it allows, for example, to modify FS between files. Here's the relevant text from the standard:

An operand that begins with an <underscore> or alphabetic character from the portable character set [...], followed by a sequence of underscores, digits, and alphabetics from the portable character set, followed by the '=' character, shall specify a variable assignment rather than a pathname.

But this also means that we, in our processing, should detect assignments and not treat them as if they were filenames. Based on the above rules, we can write a function that checks whether its argument is or not an assignment, and use it to decide whether an argument should be processed.

Final code that includes this check:

function is_assignment(s) {
  return (s ~ /^[_a-zA-Z][_a-zA-Z0-9]*=/)
}

function process_it(filename, is_empty) {
  if (! is_assignment(filename))
    print "Processing " filename " (" (is_empty ? "empty" : "nonempty") ")..."
}

function process_file(argind, end) {
  argind++

  if (end) {
    for(; argind <= ARGC - 1; argind++)
      # we had empty files at the end of arguments
      process_it(ARGV[argind], 1)
    return argind
  } else {
    # if ARGV[argind] differs from FILENAME, we skipped some files. Catch up
    while (ARGV[argind] != FILENAME) {
      process_it(ARGV[argind], 1)
      argind++
    }
    # finally, process the current file
    process_it(ARGV[argind], 0)
    return argind
  }
}

BEGIN {
  argind = 0
}

FNR == 1 {
  argind = process_file(argind, 0)
}

# rest of code here...

END {
  argind = process_file(argind, 1)
  # here argind == ARGC
}

Final tests:

$ awk -f script.awk file1 /dev/null a=10 file3
Processing file1 (nonempty)...
Processing /dev/null (empty)...
Processing file3 (nonempty)...
$ awk -f script.awk file1 /dev/null a=10 /dev/null
Processing file1 (nonempty)...
Processing /dev/null (empty)...
Processing /dev/null (empty)...
$ awk -f script.awk /dev/null a=10 /dev/null file1
Processing /dev/null (empty)...
Processing /dev/null (empty)...
Processing file1 (nonempty)...

# now we have an actual file called a=10
$ awk -f script.awk /dev/null ./a=10 /dev/null file1
Processing /dev/null (empty)...
Processing ./a=10 (nonempty)...
Processing /dev/null (empty)...
Processing file1 (nonempty)...

(Semi-)Automated ~/.ssh/config management

Following up from here, a concrete application of the technique sketched at the end of that article.
Considering that it's a quick and dirty hack, and that the configuration format was conjured up from scrath in 10 minutes, it has worked surprisingly well so far (for what it has to do).
It's also a highly ad-hoc hack, which means that it will be absolutely useless (at least, without making more or less heavy changes) in a lot of environments.

The idea: automated generation of the ~/.ssh/config file (which is just like /etc/ssh/ssh_config, but per-user).

As anyone who has used SSH more than a few times perfectly knows (or should know, though that doesn't always seems to be the case), having to repeatedly type every time

ssh -p 1234 -A root@s01.paris.dc1.example.com

is not at all the same as typing (for example)

ssh s01pd1

That's one of the main reasons for using the ~/.ssh/config file, of course: creating easier aliases for otherwise complicated and/or long hostnames (and at the same time being able to supply extra options like username, port etc. without having to type them on the command line).

For the above, one could put this in the config:

Host s01pd1
User root
Port 1234
ForwardAgent yes
Hostname s01.paris.dc1.example.com

Since the SSH client checks this file even before attempting DNS resolution, we have accomplished our goal of reducing the amount of keystrokes to type for this connection (and, consequently, reduced the likeliness of typos and the time needed to type it).

However, in certain environments machines come and go rapidly, and manually editing the file each time to keep it up-to-date is tedious and error-prone (and, furthermore, there are often groups of machines with the same configuration).

Starting from a plain list of hostnames, it's easy to programmatically generate a ~/.ssh/config file. However we don't simply want the hostnames replicated, we also want to have (short!) aliases for each host.

So that's the first desired feature: creating an alias for each host, following some fixed rule. How exactly the alias is generated from the FQDN can vary depending on what looks and feels easiest, most logical or most convenient for the user, so the mechanism should allow for the definition of "rules". But there's no need to invent something new; these rules can be based on the good old regular expression syntax, which is surely well suited for the task.

The second problem to solve is that, for a lot of reasons, there will surely have to be host entries in the ~/.ssh/config file that do not easily lend themselves to be automatically generated (because the SSH port is different, because the username is different, because for this one machine X forwarding is needed, because it needs ad-hoc crypto parameters, because there's no obvious transformation pattern to use to generate the short name, because the machine is not part of any group, because... a lot of other reasons). In other words, it must be possible to keep a number of manually maintained entries (hopefully few, but of course it depends) which should not be touched when the file is subject to automated (re)generation.
This problem is solved by creating a "safe" zone in the file, delimited by special comment markers. When regenerating the file, the contents of the safe zone are preserved and copied verbatim, so manual changes must be done inside this area.
Due to the way ssh looks for values in the file (value from first matching entry is used), the safe zone is located at the end of the file, so for example it's possible to set more specific per-host values in the automatically generated part, and finally set general defaults (eg Host * and so on) in the safe zone.

So our skeleton to be used as starting point for the (semi-)automatically managed ~/.ssh/config is something like this:

#### BEGIN SAFE ZONE ####
# put manually mantained entries here, they will be preserved.

#### END SAFE ZONE ####

When starting from scratch, the above section will be included anyway (albeit empty) in the generated file. If you want to use an existing ~/.ssh/config as starting point, add the above special comment markers at its beginning and end, effectively turning the whole file into a safe zone. Later refining is always possible, so better safe than sorry.

Now, for the actual host definitions, we can use a very simple file format. Hosts can be divided into groups, where hosts belonging to the same group share (at least) the same DNS domain. This is totally arbitrary; after all, as said, we're talking about an ad-hoc thing. More shared options can be specified, as we'll see.

For each host group a list of (unqualified) hostnames is given, space-separated, followed (separated by a colon - ":") by a domain to be appended to the hostname. This is the bare minimum; with this we get the obvious output, so for example starting from

# Paris DC1 hosts
server01 server02 mysql01 : paris.dc1.example.com

# Frankfurt DC2 hosts
mongo01 mongo02 : frankfurt.dc2.example.com

we get (username "root" is assumed by default, another arbitrary decision):

Host server01
Hostname server01.paris.dc1.example.com
User root

Host server02
Hostname server02.paris.dc1.example.com
User root

Host mysql01
Hostname mysql01.paris.dc1.example.com
User root

Host mongo01
Hostname mongo01.frankfurt.dc2.example.com
User root

Host mongo02
Hostname mongo02.frankfurt.dc2.example.com
User root

So at least we save ourself the hassle of typing the username and the FQDN (ie, we can do "ssh server01" instead of "ssh root@server01.paris.dc1.example.com"). Not bad. But life isn't always that easy, and some day there might be another "server01" host in some other domain (host group), at which point "ssh server01" would cease to be useful.
So we use a third field to specify a (optional, but highly recommended) "transformation" expression (in the form of perl's s/// operator) which is applied to the unqualified hostname to derive the final alias for each host. This way, we can create (for example) "server01p1" and "server01f2" as aliases for the one in DC1 Paris and the one in DC2 Frankfurt respectively and restore harmony in the world (if it only were so easy).

So we can do this:

# Paris DC1 hosts
server01 server02 mysql01 : paris.dc1.example.com : s/$/p1/

# Frankfurt DC2 hosts
server01 mongo01 : frankfurt.dc2.example.com : s/$/f2/

to get:

Host server01p1
Hostname server01.paris.dc1.example.com
User root

Host server02p1
Hostname server02.paris.dc1.example.com
User root

Host mysql01p1
Hostname mysql01.paris.dc1.example.com
User root

Host server01f2
Hostname server01.frankfurt.dc2.example.com
User root

Host mongo01f2
Hostname mongo01.frankfurt.dc2.example.com
User root

Now we have to type two characters more, but it's still a lot better than the full FQDN and allows us to distinguish between the two "server01".

If the hosts share some other common options, they can be added starting from the fourth field. For example, a group of office switches that only support old weak crypto algorithms and need username "admin" (not an infrequent case):

# Crappy office switches
sw1 sw2 sw3 sw4 : office.int : s/^/o/ : Ciphers 3des-cbc : MACs hmac-sha1 : KexAlgorithms diffie-hellman-group1-sha1 : User admin

now gives:

Host osw1
Ciphers 3des-cbc
MACs hmac-sha1
KexAlgorithms diffie-hellman-group1-sha1
User admin
Hostname sw1.office.int

Host osw2
Ciphers 3des-cbc
MACs hmac-sha1
KexAlgorithms diffie-hellman-group1-sha1
User admin
Hostname sw2.office.int

Host osw3
Ciphers 3des-cbc
MACs hmac-sha1
KexAlgorithms diffie-hellman-group1-sha1
User admin
Hostname sw3.office.int

Host osw4
Ciphers 3des-cbc
MACs hmac-sha1
KexAlgorithms diffie-hellman-group1-sha1
User admin
Hostname sw4.office.int

Within the extra options, simple interpolation of the special escape sequences %h and %D is supported, similar to what ssh does in its config files (though %D is not supported there): %h is replaced with the (unqualified) hostname, %D with the domain. This makes it possible to say:

# Paris DC1 hosts, behind firewall
server01 server02 server03 : paris.dc1.example.com : s/$/p1/ : ProxyCommand ssh admin@firewall.%D nc %h 22

and have the following automatically generated:

Host server01p1
ProxyCommand ssh admin@firewall.paris.dc1.example.com nc server01 22
User root

Host server02p1
ProxyCommand ssh admin@firewall.paris.dc1.example.com nc server02 22
User root

Host server03p1
ProxyCommand ssh admin@firewall.paris.dc1.example.com nc server03 22
User root

(Yes, there is a very limited amount of rudimentary extra-option parsing, for example to avoid producing a Hostname option - which would be harmless, anyway - if ProxyCommand is present.)

For more on the ProxyCommand directive, see for example here.

So the generic format of the template used to define hosts is:

#host(s) : domain [ : transformation [ : extra_opt_1 ] [ : extra_opt_2 ] ... [ : extra_opt_n ] ]
# first 2 are mandatory, although domain can be empty

Comments and empty lines are ignored. Spaces around the field-separating colons can be added for readability but are otherwise ignored.

If no domain should be appended (for example because it's automatically appended as part of the host's domain resolution mechanism) the domain field can be left empty. Similarly, if no transformation is desired, the transformation field can be left empty to mean "apply no transformation" (the bare unqualified hostname will directly become the alias).

We assume this template file with host definitions is saved in ~/.ssh_config_hosts. Adapt the code as needed.

As mentioned, the automatically generated host blocks are placed before the safe zone, which is always preserved.

Here's the code to regenerate ~/.ssh/config starting from the host definitions in the format explained above and an (optional) existing ~/.ssh/config.
WARNING: this code directly overwrites the existing ~/.ssh/config file, so it's higly advised to make a backup copy before starting to experiment. Output to stdout can also be enabled (see comment in the code) to visually check the result without overwriting.

#!/usr/bin/perl

use warnings;
use strict;

my $tpl_file = "$ENV{HOME}/.ssh_config_hosts";
my $config_file = "$ENV{HOME}/.ssh/config";

my @staticpart = ();
my @generatedpart = ();

my $beg_safepat = '#### BEGIN SAFE ZONE ####';
my $end_safepat = '#### END SAFE ZONE ####';

# read safe section of the config file (to be preserved)
if (-f $config_file) {
  open(my $confr, "<", $config_file) or die "Cannot open $config_file for reading: $!";

  my $insafe = 0;

  while (<$confr>) {
    if (/^$beg_safepat$/) {
      $insafe = 1;
      next;
    }

    if (/^$end_safepat$/) {
      $insafe = 0;
      last;
    }

    next if not $insafe;
    push @staticpart, $_;
  }

  close($confr) or die "Cannot close $config_file: $!";
}

# read host template

open(my $tplr, "<", $tpl_file) or die "Cannot open template $tpl_file for reading: $!";

while (<$tplr>) {

  # skip empty lines and comments
  next if /^\s*(?:#.*)?$/;

  chomp;
  s/\s*#.*//;

  my ($hlist, $domain, $transf, @extra) = split(/\s*:\s*/);

  my @hosts = split(/\s+/, $hlist);

  for my $host (@hosts) {

    my $entry = "";

    my $alias = $host;
    if ($transf) {
      eval "\$alias =~ $transf;";
    }

    $entry = "Host $alias";

    my %opts = ();

    for (@extra) {

      # minimal %h/%D interpolation for things like proxycommand etc...
      (my $extra = $_) =~ s/%h/$host/g; $extra =~ s/%D/$domain/g;

      $entry .= "\n$extra";

      my ($op) = $extra =~ /^(\S+)/;
      $opts{lc($op)} = 1;
    }

    if (!exists($opts{proxycommand})) {
      $entry .= "\nHostname $host" . ($domain ? ".$domain" : "");
    }

    if (!exists($opts{user})) {
      $entry .= "\nUser root";
    }

    push @generatedpart, $entry;

  }
}

close($tplr) or die "Cannot close template $tpl_file: $!";

# write everything out to $config_file

open(my $confw, ">", $config_file) or die "Cannot open $config_file for writing: $!";
# use this to send to stdout instead
#my $confw = *STDOUT;

print $confw "#########################################################################\n";
print $confw "# the following entries are automatically generated, do not change them\n";
print $confw "# directly. Instead change the file $tpl_file\n";
print $confw "# and run $0 to regenerate them.\n";
print $confw "#########################################################################\n\n";

# generated part, each item is a host block
print $confw (join("\n\n", @generatedpart), "\n\n");

# static part (safe zone)
for ("$beg_safepat\n", @staticpart, "$end_safepat\n") {
  print $confw $_;
}

print $confw "\n";

close($confw) or die "Cannot close $config_file: $!";

exit;

Remote-to-remote data copy

...going through the local machine, which is what people normally want and try to do.

Of course it's not as efficient as a direct copy between the involved boxes, but many times it's the only option, for various reasons.

Here are some ways (some with standard tools, some home-made) to accomplish the task. We'll indicate the two remote machines between which data has to be transferred with remote1 and remote2. We assume no direct connectivity between them is possible, but we have access to both from the local machine (with passwordless SSH where appropriate).

remote1 to local, local to remote2

This is of course the obvious and naive way: just copy everything temporarily from remote1 to the local machine (with whatever method), then again from the local machine to remote2. If copying remote-to-remote is bad, doing it this way is even worse, as we actually need space on the local machine to store the data, albeit only temporarily. Sample code using rsync (options are only indicative):

$ rsync -avz remote1:/src/dir/ /local/dir/
sending incremental file list
...
$ rsync -avz /local/dir/ remote2:/dest/dir/
sending incremental file list
...

For small or even medium amounts of data this solution can be workable, but it's clearly not very satisfactory.

scp -3

Newer versions of scp have a command line switch (-3) which does just wat we want: remote-to-remote copy going through the local machine. In this case at least, we don't need local disk space:

$ scp -3 -r remote1:/src/dir remote2:/dest/dir    # recursive to copy everything; adapt as needed

An annoying "feature" of scp -3 is that there's no indication of progress whatsoever (whereas the default for non-remote-to-remote copy is to show progress of each file as it's copied), and no option to enable it. Sure, with -v that information is printed, but so is a lot of other stuff.

SSH + tar

We can also of course use SSH and tar:

$ ssh remote1 'tar -C /src/dir/ -cvzf - .' | ssh remote2 'tar -C /dest/dir/ -xzvf -'

tar + netcat/socat

Can we modify our nettar tool to support remote-to-remote copies? The answer is yes, and here's the code for a generalized version that automatically detects whether local-to-remote, remote-to-local or remote-to-remote copy is desired. This version uses socat instead of netcat, which implies that socat must be installed on the involved remote machines, as well as on the local box. It also implies that traffic is allowed between the local box and the remote ones on the remote TCP port used (in this example 1234).

#!/bin/bash
 
# nettar_gen.sh
# copy directory trees between local/remote and local/remote, using tar + socat

# Usage: $0 src dst

# if either src or dst contain a colon, it's assumed to mean machine:path, otherwise assumed 
# local path

# examples
#
# $0 remote:/src/dir /local/dst
# $0 /local/src remote:/dst/dir
# $0 remote1:/src/dir remote2:/dst/dir

# NOTE: error checking is very rudimentary. Argument sanity checking is missing.

src=$1
dst=$2

port=1234
remotesrc=0
remotedst=0
user=root

if [[ "$src" =~ : ]]; then
  remotesrc=1
  srcmachine=${src%%:*}
  srcdir=${src#*:}
  if ! ssh "$user"@"$srcmachine" "cd '$srcdir' || exit 1; { tar -cf - . | socat - TCP-L:$port,reuseaddr ; } </dev/null >/dev/null 2>&1 &"; then
    echo "Error setting up source on $srcmachine" >&2
    exit 1
  fi
fi

if [[ "$dst" =~ : ]]; then
  remotedst=1
  dstmachine=${dst%%:*}
  dstdir=${dst#*:}
  if ! ssh "$user"@"$dstmachine" "cd '$dstdir' || exit 1; { socat TCP-L:$port,reuseaddr - | tar -xf - ; } </dev/null >/dev/null 2>&1 &"; then
    echo "Error setting up destination on $dstmachine" >&2
    exit 1
  fi
fi

# sometimes remote initialization takes a bit longer...
sleep 0.5

if [ $remotesrc -eq 0 ] && [ $remotedst -eq 0 ]; then
  # local src, local dst
  tar -cf - -C "$src" . | tar -xvf - -C "$dst"
elif [ $remotesrc -eq 0 ]; then
  # local src, remote dst
  tar -cvf - -C "$src" . | socat - TCP:"$dstmachine":$port
elif [ $remotedst -eq 0 ]; then
  # remote src, local dst
  socat TCP:"$srcmachine":$port - | tar -xvf - -C "$dst"
else
  # remote src, remote dst
  socat TCP:"$srcmachine":$port - | socat - TCP:"$dstmachine":$port
fi

So with this code we can say

$ nettar_gen.sh remote1:/src/dir remote2:/dst/dir

and transfer the files unencrypted without the overhead of SSH (as tar runs remotely, we won't be able to see the names of the files being transferred though). Compression can be added to tar if desired (not always makes things faster, so it might or might not be an improvement).

Real rsync?

The approaches so far (except the first one, which however has other drawbacks) have the problem that they are not incremental, so if a transfer is interrupted, we have to restart it from the beginning (ok, we can cheat and move or delete the already-copied data on the origin, so it doesn't have to be copied again, but it should be obvious that this is neither an optimal nor a desirable workaround).
The tool of choice when we need to resume partial transfers is, of course, rsync but, as the man page kindly informs us,

Rsync copies files either to or from a remote host, or locally on the current host (it does not support copying files between two remote hosts).

However, we can leverage SSH's port forwarding capabilities and "bring", so to speak, a "tunnel" to remote1 that connects to remote2 via the local machine, for example:

$ ssh -R10000:remote2:10000 remote1

If we do the above, anything sent to localhost:10000 on remote1 will be sent to port 10000 on remote2. In particular, we can forward to port 22 on remote2 (or whatever port SSH is using there):

$ ssh -R10000:remote2:22 remote1

Now "ssh -p 10000 localhost" on remote1 gives us a password request from remote2's SSH daemon.

So, since rsync runs over SSH, with this tunnel in place we can run this on remote1 (all the examples use root as the user on remote2, adapt as needed):

remote1$ rsync -e 'ssh -l root -p 10000' -avz /src/dir/ localhost:/dest/dir/

and we'll effectively be transferring stuff to remote2. We can run the above directly from the local box (the -t option to SSH is to force a pseudo-tty allocation, otherwise we couldn't be asked for the password):

$ ssh -t -R10000:remote2:22 remote1 'rsync -e "ssh -l root -p 10000" -avz /src/dir/ localhost:/dest/dir/'
The authenticity of host '[localhost]:10000 ([127.0.0.1]:10000)' can't be established.
ED25519 key fingerprint is 9a:fd:f3:7f:55:1e:6b:44:b2:88:fd:a3:e9:c9:b9:ed.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[localhost]:10000' (ED25519) to the list of known hosts.
root@localhost's password:
sending incremental file list
...

So this way we get almost what we wanted, except we're still prompted for a password (which, as should be clear by now, is really the password for root@remote2). This is expected, since remote1 has probably no relation whatsoever with remote2 (we are also asked to accept remote2's SSH host key).

Although this solution is already quite satisfactory, can we do better? The answer is: yes.

An option is to set up passwordless SSH between remote1 and remote2, so we need to install the appropriate SSH keys on remote1's ~/.ssh directory (and adapt the -e option to rsync to use them, if necessary). This may or may not be OK, depending on the exact circumstances, but in any case it still requires changes on remote1, which may not be desirable (it also requires further work if some day we want to transfer, say, between remote1 and remote3, and for any new remote we want to work with).

Can't we somehow exploit the fact that we do have passwordless SSH already to both remotes from out local machine?
The answer is again: yes. If we use ssh-agent (or gpg-agent, which can store SSH keys as well), whose job is to read and store private SSH keys, we can then take advantage of the -A option to SSH (can also be specified in ~/.ssh/config as ForwardAgent) to forward our agent connection to remote1; there, the keys stored by the agents will be accessible and thus usable to get passwordless login on remote2 (well, on [localhost]:10000 actually, which is how remote1 will see it). Simply put, forwarding the SSH agent means that all SSH authentication challenges will be forwarded to the local machine, so in particular it is possible to take advantage of locally available keys even for authentications happening remotely. Here is a very good description of the process. (And be sure to read und understand the implications of using -A as explained in the man page.)
With a running agent with knowledge of the relevant key on the local machine and agent forwarding, we can finally have a seamless remote-to-remote rsync:

$ ssh -t -A -R10000:remote2:22 remote1 'rsync -e "ssh -l root -p 10000" -avz /src/dir/ localhost:/dest/dir/'

An annoyance with this approach is that, since remote1 stores the host key of [localhost]:10000 in its ~/.ssh/known_hosts file, if we do this:

$ ssh -t -A -R10000:remote2:22 remote1 'rsync -e "ssh -l root -p 10000" -avz /src/dir/ localhost:/dest/dir/'

and then this:

$ ssh -t -A -R10000:remote3:22 remote1 'rsync -e "ssh -l root -p 10000" -avz /src/dir/ localhost:/dest/dir/'

SSH will complain loudly and rightly that the key for localhost:10000 has changed.
A workaround, if this kind of operation is needed frequently, is to set up some sort of mapping between remote hosts and ports used on remote1 (and stick to it).

A slightly better method could be to cleanup the relevant entry from remote1's ~/.ssh/known_hosts file just before starting the transfer (eg with sed or some other tool), and then use StrictHostKeyChecking=no to have the key automatically added without confirmation, for example:

# cleanup, then do the copy
$ ssh remote1 'sed -i "/^\[localhost\]:10000 /d" .ssh/known_hosts'
$ ssh -t -A -R10000:remote2:22 remote1 'rsync -e "ssh -l root -p 10000 -o StrictHostKeyChecking=no" -avz /src/dir/ localhost:/dest/dir/'
Warning: Permanently added '[localhost]:10000' (ED25519) to the list of known hosts.
sending incremental file list
...

Update 13/02/2015: it turns out that ssh-keygen, despite its name, has an option (-R) to remove a host key from the known_hosts file, so it can be used insted of sed in the above example:

# ssh-keygen -R '[localhost]:10000'
# Host [localhost]:10000 found: line 21 type ED25519
/root/.ssh/known_hosts updated.
Original contents retained as /root/.ssh/known_hosts.old

However, it leaves behind a file with the .old suffix, and outputs a message which can't be suppressed with -q, despite what the man page says, so one would need to resort to shell redirection if silent operation is wanted.

The mythical “idempotent” file editing

The story goes more or less like this: "I want to edit a file by adding some lines, but leaving alone any other lines that it might already have. If one of the to-be-added lines is already present, do not re-add it (or replace the existing one). I should be able to repeat this process an arbitrary number of times; after the first run, any subsequent run must leave the file unchanged" (hence "idempotent").

For some reason, a typical target for this kind of thing seems to be the file /etc/hosts, and that's what we'll be using here for the examples. Adapt as needed. Other common targets include /etc/passwd or DNS zone files.

Note that there are almost always ways to avoid doing what we're going to do.
A typical scenario cited by proponents of this approach is automated or scripted install of a machine where a known state for /etc/hosts is desired. But in that case, one can just create the file from scratch with appropriate contents (we are provisioning, right?). Creating the file from scratch certainly leaves it with the desired contents, and is surely idempotent (can be repeated as many times as wanted).
Another scenario is managing/maintaining such file on an already installed machine. But if you really need to do that, there are tools (puppet has a /etc/hosts type, augeas can edit most common file types, etc.) that can do it natively and well (well, at least most likely better than a script).

So in the end it's almost always a half-baked attempt at doing something that either shouldn't be necessary in the first place, or should be done with the appropriate tools.

Nevertheless, there seem to be a lot of people trying to do this, so for the sake of it, let's see how the task could be approached.

To make it concrete, here's our existing (pre-edit) /etc/hosts:

#
# /etc/hosts: static lookup table for host names
#
127.0.0.1	my.example.com localhost.localdomain	my localhost
::1		localhost.localdomain	localhost

192.168.44.12   server1.example.com server1
192.168.44.1    firewall.example.com firewall

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

2001:db8:a:b::1  server6.example.com server6
# End of file

We want to merge the following lines (we assume they are stored in their own file, newlines.txt):

192.168.44.250      newserver.example.com  newserver
192.168.44.1        firewall.example.com firewall gateway
2001:db8:a:b::100   server7.example.com  server7

When one of the lines we're adding is already present in the target file, there are two possible policies: either leave the line alone (ie, the old line is the good one), or replace it (ie, the new line is the good one). In our example, we would encounter this issue with the 192.168.44.1 entry. Of course, it's not hard to imagine situations in which for just some of the new lines the "new line wins" policy should be used, while still using the "old line wins" policy for the remaining ones. We choose to ignore this problem here and use a global policy, but it's certainly not just a theoretical case.

Another issue has to do with the method used to detect whether a line is already present: do we compare the whole line, just a key field (somehow calculated, for example a column), a set of fields, or yet something else? If we use more than one field, what about spaces?
In the case of /etc/hosts it seems sensible to use the first column (ie, the actual IP address) as a key, but it could be argued that the second field (the FQDN) should be used instead, as we want to ensure that a given FQDN is resolvable, no matter to which IP address (this in turn has the problem that then we can't add an IPv4 and IPv6 line for the same FQDN). Here we're using the first field; again, adaptation will be necessary for different needs.

Another, more serious issue, has to do with the overall format of the resulting file. What do we do with comments and empty lines? In this case, we just print them verbatim.
And what about internal file "semantics" (for lack of a better term)? Let's say we like to have all IPv4 addresses nicely grouped together and all IPv6 addresses as well. New lines should respect the grouping (an IPv4 line should go into the IPv4 group etc.). Now things start to be, well, "interesting". Since where a line appears in the file doesn't really matter much to the resolver routines, here we choose to just append new lines at the end; but this is a very simple (and, for some "idempotent" editing fans probably unsatisfactory) policy.

The point is: it's easy to see how this seemingly easy task can quickly become arbitrarily (and ridiculously) complicated, and any "quick and dirty" solution necessarily has to deal with many assumptions and tradeoffs. (And all this just for the relatively simple file /etc/hosts. Imagine managing a DNS zone file, or a DHCP server configuration file, with MAC to IP mappings, just to name some other examples. And we're still in the domain of single-line-at-a-time changes.)

So here's some awk code that tries to do the merge. Whether the "existing/old line wins" policy or the "new line wins" policy is used is controlled with a flag (newwins) that can be set with -v, and by default is set to 0 (old line wins):

BEGIN {
  # awk way to check whether a variable is not defined
  if (newwins == "" && newwins == 0) {
    newwins = 0      # by default old line wins
  }
}

# load new lines, skip empty/comment lines
NR == FNR {
  if (!/^[[:blank:]]*(#|$)/) {
    ip = substr($0, 1, index($0, " ") - 1)
    newlines[ip] = $0
  }
  next
}

# print comments and empty lines verbatim
/^[[:blank:]]*(#|$)/ {
  print
  next
}

$1 in newlines {
  print (whowins == 1) ? newlines[$1] : $0
  # either way, forget it
  delete newlines[$1]
  next
}

{ print }

# if anything is left in newlines, they must be truly new lines
END {
  for (ip in newlines)
    print newlines[ip] 
}

So we can run it as follows ("old line wins" policy, only two new lines appended at the end):

$ awk -f mergehosts.awk newlines.txt /etc/hosts
#
# /etc/hosts: static lookup table for host names
#
127.0.0.1	my.example.com localhost.localdomain	my localhost
::1		localhost.localdomain	localhost

192.168.44.12   server1.example.com server1
192.168.44.1    firewall.example.com firewall

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

2001:db8:a:b::1  server6.example.com server6
# End of file
2001:db8:a:b::100   server7.example.com  server7
192.168.44.250      newserver.example.com  newserver

Or with the "new line wins" policy (same two lines appended, and an existing one replaced with the new version):

$ awk -f mergehosts.awk -v newwins=1 newlines.txt /etc/hosts
#
# /etc/hosts: static lookup table for host names
#
127.0.0.1	my.example.com localhost.localdomain	my localhost
::1		localhost.localdomain	localhost

192.168.44.12   server1.example.com server1
192.168.44.1        firewall.example.com firewall gateway

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

2001:db8:a:b::1  server6.example.com server6
# End of file
2001:db8:a:b::100   server7.example.com  server7
192.168.44.250      newserver.example.com  newserver

(To actually change the original file, redirect the output to a temporary file and use it to overwrite the original one. Let's not start that discussion again).

Not looking good? Well, it's kind of expected, since it's a ugly hack. It does work under the assumptions, but it's nonetheless a hack.

As said, it's higly dependent on the use case, but in general a better solution with this kind of problems is to either generate the whole file from scratch every time (including from templates if appropriate), or use dedicated tools to manage it.

It can also be mentioned that, if one must really do it using a script, it's often possible and easy enough to divide the target file into "zones" (for example, using special comment markers). In this way, within the same file, one zone could be deemed "safe" and reserved for hand-created content that should be preserved, and nother zone for automated content (that is, erased and recreated from scratch each time). However this approach assumes that the whole of the automated content is always supplied each time. This approach (slightly less hackish) introduces its own set of considerations, and is interesting enough to deserve an article on its own.

Some common networking operations in Perl

A compilation of common operations that are often needed when writing networking code. Hopefully this saves some googling.

The examples will use Perl code, however from time to time the C data structures will be cited and the C terminology will be used. IPv4 and IPv6 will be covered.

Links to sample programs used in the examples: getaddrinfo.pl, getnameinfo.pl. A reasonably new version of Perl is required (in particular, 5.14 from Debian Wheezy is not new enough).

Socket addresses

In C, there's this notion of a "socket address", which is basically the combination of an IP address and a port (and other data, but address and port are the essential pieces of information). Here are the C data structures for IPv4 and IPv6:

/* IPv4 */
struct sockaddr_in {
    sa_family_t    sin_family; /* address family: AF_INET */
    in_port_t      sin_port;   /* port in network byte order */
    struct in_addr sin_addr;   /* internet address */
};

/* IPv6 */
struct sockaddr_in6 {
    sa_family_t     sin6_family;   /* AF_INET6 */
    in_port_t       sin6_port;     /* port number */
    uint32_t        sin6_flowinfo; /* IPv6 flow information */
    struct in6_addr sin6_addr;     /* IPv6 address */
    uint32_t        sin6_scope_id; /* Scope ID (new in 2.4) */
};

In C, lots of networking-related functions accept or return these structures (or, often, pointers to them). The connect() and the bind() functions are two notable examples.
In fact, in the C function prototypes the generic struct sockaddr type is used, which doesn't really exist in practice (although it has a definition); either a sockaddr_in or a sockaddr_in6 must be used, after casting it to sockaddr.

The actual IP addresses are themselves structs, which are defined as follows:

/* IPv4 */
struct in_addr {
    uint32_t       s_addr;         /* address in network byte order */
};

/* IPv6 */
struct in6_addr {
    unsigned char   s6_addr[16];   /* IPv6 address */
};

Then there's the more recent struct addrinfo, which includes a sockaddr member and, additionally, more data:

struct addrinfo {
    int              ai_flags;       // AI_PASSIVE, AI_CANONNAME, etc.
    int              ai_family;      // AF_INET, AF_INET6, AF_UNSPEC
    int              ai_socktype;    // SOCK_STREAM, SOCK_DGRAM
    int              ai_protocol;    // use 0 for "any"
    size_t           ai_addrlen;     // size of ai_addr in bytes
    struct sockaddr  *ai_addr;       // struct sockaddr_in or _in6
    char             *ai_canonname;  // full canonical hostname
    struct addrinfo  *ai_next;       // linked list, next node
};

This strcture is used by a class of newer, address-family-independent functions. In particular, code is expected to deal with linked lists of struct addrinfo, as indicated by the fact that the ai_next member points to the same data structure type.

From sockaddr to (host, port, ...) data and viceversa

If we have a Perl variable that represents a sockaddr_in or a sockaddr_in6 (for example as returned by recv()), we can extract the actual member data with code similar to the following:

# IPv4
use Socket qw(unpack_sockaddr_in);
my ($port, $addr4) = unpack_sockaddr_in($sockaddr4);

# IPv6
use Socket qw(unpack_sockaddr_in6);
my ($port, $addr6, $scopeid, $flowinfo) = unpack_sockaddr_in6($sockaddr6);

Note that $addr4 and $addr6 are still binary data; to get their textual representation a further step is needed (see below).

Conversely, if we have the individual fields of a sockaddr, we can pack it into a sockaddr variable as follows:

# IPv4
use Socket qw(pack_sockaddr_in);
$sockaddr4 = pack_sockaddr_in($port, $addr4);

# IPv6
use Socket qw(pack_sockaddr_in6);
$sockaddr6 = pack_sockaddr_in6($port, $addr6, [$scope_id, [$flowinfo]]);

Again, $addr4 and $addr6 must be the binary versions of the addresses, not their string representation.

As a convenience, it is possible to use the sockaddr_in() and sockaddr_in6() functions as shortcuts for both packing and unpacking:

# IPv4
use Socket qw(sockaddr_in);
my ($port, $addr4) = sockaddr_in($sockaddr4);
my $sockaddr4 = sockaddr_in($port, $addr4);

# IPv6
use Socket qw(sockaddr_in6);
my ($port, $addr6, $scopeid, $flowinfo) = sockaddr_in6($sockaddr6);
$sockaddr6 = sockaddr_in6($port, $addr6, [$scope_id, [$flowinfo]]);

From binary address to string representation and viceversa

If we have a binary IP address, we can use inet_ntop() and inet_pton() to convert it to a string (printable) representation:

# IPv4
use Socket qw(AF_INET inet_ntop);
$straddr4 = inet_ntop(AF_INET, $addr4);

# IPv6
use Socket qw(AF_INET6 inet_ntop);
$straddr6 = inet_ntop(AF_INET6, $addr6);

And the reverse process, from string to binary:

# IPv4
use Socket qw(AF_INET inet_pton);
$addr4 = inet_pton(AF_INET, $straddr4);

# IPv6
use Socket qw(AF_INET6 inet_pton);
$addr6 = inet_pton(AF_INET6, $straddr6);

All these functions fail if the argument to be converted is not a valid address in the respective representation.

Get sockaddr data from a socket variable

Sometimes it is necessary to know to which local or remote address or port a certain socket is associated. Typically we have a socket variable (for example, obtained with accept()), which in Perl can be stored in a handle, and we want the corresponding sockaddr data. So here's how to get it:

# Get remote sockaddr info from socket handle
$remotesockaddr = getpeername(SOCK);

# then, as already shown...

# IPv4
($port, $addr4) = sockaddr_in($remotesockaddr);

# or IPv6
($port, $addr6, $scopeid, $flowinfo) = sockaddr_in6($remotesockaddr);

To get sockaddr information for the local end of the socket, getsockname() is used:

# Get local sockaddr info from socket
$localsockaddr = getsockname(SOCK);
...

Note that depending on the protocol (TCP or UDP) and/or the bound status of the socket, the resuts may or may not make a lot of sense, but this is something that the code writer should know.

From hostname to IP address and viceversa

There are two ways to perform this hyper-common operation: one is older and deprecated, the other is newer and recommended.

The old way

The older way, which is still extremely popular, is somewhat protocol-dependent. Here it is:

# List context, return all the information
($canonname, $aliases, $addrtype, $length, @addrs) = gethostbyname($name);

As an example, let's try it with www.kernel.org:

#!/usr/bin/perl
 
use warnings;
use strict;
 
use Socket qw ( :DEFAULT inet_ntop );
 
my ($canonname, $aliases, $addrtype, $length, @addrs) = gethostbyname('www.kernel.org');
 
print "canonname: $canonname\n";
print "aliases: $aliases\n";
print "addrtype: $addrtype\n";
print "length: $length\n";
print "addresses: " . join(",", map { inet_ntop(AF_INET, $_) } @addrs), "\n";

Running the above outputs:

canonname: pub.all.kernel.org
aliases: www.kernel.org
addrtype: 2
length: 4
addresses: 198.145.20.140,149.20.4.69,199.204.44.194

So it seems there's no way to get it to return IPv6 addresses.

gethostbyname() can also be run in scalar context, in which case it just returns a single IP(v4) address:

# Scalar context, only IP address is returned
$ perl -e 'use Socket qw (:DEFAULT inet_ntop); my $a = gethostbyname("www.kernel.org"); print inet_ntop(AF_INET, $a), "\n";'
149.20.4.69
$ perl -e 'use Socket qw (:DEFAULT inet_ntop); my $a = gethostbyname("www.kernel.org"); print inet_ntop(AF_INET, $a), "\n";'
198.145.20.140
$ perl -e 'use Socket qw (:DEFAULT inet_ntop); my $a = gethostbyname("www.kernel.org"); print inet_ntop(AF_INET, $a), "\n";'
199.204.44.194

Normal DNS round-robin.

The inverse process is done with gethostbyaddr(), which supports also IPv6, though it's deprecated nonetheless. Again, the results differ depending on whether we are in list or scalar context (remember that all addresses have to be binary):

# List context, return more data

# IPv4
use Socket qw(:DEFAULT)
my ($canonname, $aliases, $addrtype, $length, @addrs) = gethostbyaddr($addr4, AF_INET);

# IPv6
use Socket qw(:DEFAULT)
my ($canonname, $aliases, $addrtype, $length, @addrs) = gethostbyaddr($addr6, AF_INET6);

In these case, of course, the interesting data is in the $canonname variable.

In scalar context, only the name is returned:

# scalar context, just return one name
use Socket qw(:DEFAULT);
my $hostname = gethostbyaddr($addr4, AF_INET);

# IPv6
use Socket qw(:DEFAULT);
my $hostname = gethostbyaddr($addr6, AF_INET6);

Note that, again, in all cases the passed IP addresses are binary.

The new way

The new and recommended way is protocol-independent (meaning that a name-to-IP lookup can return both IPv4 and IPv6 addresses) and is based on the addrinfo structure mentioned at the beginning. The forward lookup is done with the getaddrinfo() function. The idea is that, when an application needs to populate a sockaddr structure, the system provides it with one already filled with data, which can be directly used for whatever the application needs to do (eg, connect() or bind()).
In fact, getaddrinfo() returns a list of addrinfo structs (in C it's a linked list), each with its own sockaddr data, so the application can try each one in turn, in the same order that they are provided. (Normally the first one will work, without needing to try the next; but there are cases where having more than one possibility to try is useful.)

The C version returns a pointer to a linked list of struct addrinfo; with Perl it's easier as the list is returned in an array. The sample Perl code for getaddrinfo() is:

use Socket qw(:DEFAULT getaddrinfo);
my ($err, @addrs) = getaddrinfo($name, $service, $hints);

If $err is not set (that is, the operation was successful), @addrs contains a list of results. Since in Perl there are no structs, each element is a reference to a hash whose elements are named after the struct addrinfo members.

However, there are a few things to note:

  • getaddrinfo() can do hostname-to-address as well as service-to-port-number lookups, hence the first two arguments $name and $service. Depending on the actual task, an application might need to do just one type of lookup or the other, or both. In this paragraph we will strictly do hostname resolution; in the following we will do service name resolution.
  • getaddrinfo() is not only IP-version agnostic (in that it can return IPv4 and IPv6 addresses); it is also, so to speak, protocol (TCP, UDP) and socket type (stream, datagram, raw) agnostic. However, suitable values can be passed in the $hints variable to restrict the scope of the returned entries. This way, an application can ask to be given results suitable only for a specific socket type, protocol or address family. But this also means that, if everything is left unspecified, the getaddrinfo() lookup may (and usually does) return up to three entries for each IP address to which the supplied name resolves: one for protocol 6, socket type 1 (TCP, stream socket), one for protocol 17, socket type 2 (UDP, datagram socket) and one for protocol 0, socket type 3 (raw socket).
  • As briefly mentioned, the last argument $hints is a reference to a hash whose keys provide additional information or instructions about the way the lookup should be performed (see example below).

Let's write a simple code snippet to check the above facts.

#!/usr/bin/perl
 
use warnings;
use strict;
 
use Socket qw(:DEFAULT AI_CANONNAME IPPROTO_TCP IPPROTO_UDP IPPROTO_RAW SOCK_STREAM SOCK_DGRAM SOCK_RAW getaddrinfo
              inet_ntop inet_pton);
 
# map protocol number to name
sub pprotocol {
  my ($proto) = @_;
  if ($proto == IPPROTO_TCP) {
    return 'IPPROTO_TCP';
  } elsif ($proto == IPPROTO_UDP) {
    return 'IPPROTO_UDP';
  } else {
    return 'n/a';
  }
}
 
# map socket type number to name
sub psocktype {
  my ($socktype) = @_;
  if ($socktype == SOCK_STREAM) {
    return 'SOCK_STREAM';
  } elsif ($socktype == SOCK_DGRAM) {
    return 'SOCK_DGRAM';
  } elsif ($socktype == SOCK_RAW) {
    return 'SOCK_RAW';
  } else {
    return 'unknown';
  }
}
 
die "Must specify name to resolve" if (not $ARGV[0] and not $ARGV[1]);
 
my $name = $ARGV[0] or undef;
my $service = $ARGV[1] or undef;
 
# we want the canonical name on the first entry returned
my $hints = {};
if ($ARGV[0]) {
  $hints->{flags} = AI_CANONNAME;
}
 
my ($err, @addrs) = getaddrinfo ($name, $service, $hints);
 
die "getaddrinfo: error or no results" if $err;
 
# If we get here, each element of @addrs is a hash
# reference with the following keys (addrinfo struct members):
 
# 'family'      (AF_INET, AF_INET6)
# 'protocol'    (IPPROTO_TCP, IPPROTO_UDP)
# 'canonname'   (Only if requested with the AI_CANONNAME flag, and only on the first entry)
# 'addr'        This is a sockaddr (_in or _in6 depending on the address family above)
# 'socktype'    (SOCK_STREAM, SOCK_DGRAM, SOCK_RAW)
 
# dump results
for(@addrs) {
 
  my ($canonname, $protocol, $socktype) = (($_->{canonname} or ""), pprotocol($_->{protocol}), psocktype($_->{socktype}));
 
  if ($_->{family} == AF_INET) {
 
    # port is always 0 when resolving a hostname
    my ($port, $addr4) = sockaddr_in($_->{addr});
 
    print "IPv4:\n";
    print "  " . inet_ntop(AF_INET, $addr4) . ", port: $port, protocol: $_->{protocol} ($protocol), socktype: $_->{socktype} ($socktype), canonname: $canonname\n";
  } else {
 
    my ($port, $addr6, $scope_id, $flowinfo) = sockaddr_in6($_->{addr});
    print "IPv6:\n";
    print "  " . inet_ntop(AF_INET6, $addr6) . ", port: $port, protocol: $_->{protocol} ($protocol), socktype: $_->{socktype} ($socktype), (scope id: $scope_id, flowinfo: $flowinfo), canonname: $canonname\n";
  }
}

Let's test it:

$ getaddrinfo.pl www.kernel.org
IPv6:
  2001:4f8:1:10:0:1991:8:25, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), (scope id: 0, flowinfo: 0), canonname: pub.all.kernel.org
IPv6:
  2001:4f8:1:10:0:1991:8:25, port: 0, protocol: 17 (IPPROTO_UDP), socktype: 2 (SOCK_DGRAM), (scope id: 0, flowinfo: 0), canonname: 
IPv6:
  2001:4f8:1:10:0:1991:8:25, port: 0, protocol: 0 (n/a), socktype: 3 (SOCK_RAW), (scope id: 0, flowinfo: 0), canonname: 
IPv4:
  198.145.20.140, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  198.145.20.140, port: 0, protocol: 17 (IPPROTO_UDP), socktype: 2 (SOCK_DGRAM), canonname: 
IPv4:
  198.145.20.140, port: 0, protocol: 0 (n/a), socktype: 3 (SOCK_RAW), canonname: 
IPv4:
  199.204.44.194, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  199.204.44.194, port: 0, protocol: 17 (IPPROTO_UDP), socktype: 2 (SOCK_DGRAM), canonname: 
IPv4:
  199.204.44.194, port: 0, protocol: 0 (n/a), socktype: 3 (SOCK_RAW), canonname: 
IPv4:
  149.20.4.69, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  149.20.4.69, port: 0, protocol: 17 (IPPROTO_UDP), socktype: 2 (SOCK_DGRAM), canonname: 
IPv4:
  149.20.4.69, port: 0, protocol: 0 (n/a), socktype: 3 (SOCK_RAW), canonname: 

As expected, three entries are returned for each resolved IP address (BTW, the order of the entries matters: this is the order in which client applications should attempt to use them. In this case, IPv6 addresses are given preference, as it should be if the machine has good IPv6 connectivity - again, as it should be -).
In practice, as said, one may want to filter the results, for example by address family (IPv4, IPv6) and/or socket type (stream, datagram, raw) and/or protocol (TCP, UDP). For illustration purposes, let's filter by socket type. This is done using the socktype key of the $hints hash. For example, let's change it as follows to only return results suitable for the creation of sockets of type SOCK_STREAM:

my $hints = {}
$hints->{socktype} = SOCK_STREAM;   # add this line
if ($ARGV[0]) {
  $hints->{flags} = AI_CANONNAME;
}

Now let's run it again:

$ getaddrinfo.pl www.kernel.org
IPv6:
  2001:4f8:1:10:0:1991:8:25, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), (scope id: 0, flowinfo: 0), canonname: pub.all.kernel.org
IPv4:
  198.145.20.140, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  149.20.4.69, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  199.204.44.194, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 

Now the result is more like one would expect.

Note that there is much more to hint flags than shown above; the C man page for getaddrinfo() and the Perl reference linked at the end provide all the details.

So getaddrinfo() is the recommended way to do hostname to IP address resolution, although gethostbyname() won't probably go away soon.

The reverse process (from address to name) is performed using getnameinfo(), which is the counterpart to getaddrinfo(). Its usage is quite different from the C version and is as follows:

use Socket qw(:DEFAULT getnameinfo);
my ($err, $hostname, $servicename) = getnameinfo($sockaddr, [$flags, [$xflags]]);

Note that it accepts a sockaddr, so we pass it an address (IPv4 or IPv6) and a port. This should suggest that, just like getaddrinfo(), getnameinfo() can also do port to service name inverse resolution, which it indeed does (see below). Here we are concerned with reverse address resolution; in the following paragraph we'll do service port inverse resolution.

Let's write some code to test getnameinfo():

#!/usr/bin/perl
 
use warnings;
use strict;
 
use Socket qw(:DEFAULT inet_ntop inet_pton getnameinfo);
 
die "Usage: $0 [address] [port]" if (not $ARGV[0] and not $ARGV[1]);
 
my $straddr = ($ARGV[0] or "0.0.0.0");
my $port = ($ARGV[1] or 0);
 
# pack address + port
 
my $sockaddr;
 
# note that we assume the address is correct,
# real code should verify that
 
# stupid way to detect address family
if ($straddr =~ /:/) {
  $sockaddr = sockaddr_in6($port, inet_pton(AF_INET6, $straddr));
} else {
  $sockaddr = sockaddr_in($port, inet_pton(AF_INET, $straddr));
}
 
# do the inverse resolution 
 
my $flags = 0;
my $xflags = 0;
 
my ($err, $hostname, $servicename) = getnameinfo($sockaddr, $flags, $xflags);
 
die "getnameinfo: error or no results" if $err;
 
# dump
print "hostname: $hostname, servicename: $servicename\n";

Let's try it:

$ getnameinfo.pl 198.145.20.140 
hostname: tiz-korg-pub.kernel.org, servicename: 0
$ getnameinfo.pl  2001:4f8:1:10:0:1991:8:25
hostname: pao-korg-pub.kernel.org, servicename: 0

The Perl Socket reference page linked at the bottom provides more details about the possible hint flags that can be passed to getaddrinfo() and getnameinto(), and their possible return values in case of errors.

According to some sources, if a string representation of an address is passed to getaddrinfo() and the AI_CANONNAME flag is set, that should also work to do inverse resolution, in that the 'canonname' hash key of the returned value should be filled with the hostname. However, it does not seem to be working:

$ getaddrinfo.pl 198.145.20.140
IPv4:
  198.145.20.140, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 198.145.20.140  # not the name

From service name to port number and viceversa

Here, again, there are two ways: the old one, and the new one.

The old way

This is done using getservbyname() and getservbyport() for forward and inverse resolution respectively:

my ($name, $aliases, $port, $proto) =  getservbyname($name, $proto);
my ($name, $aliases, $port, $proto) =  getservbyport($port, $proto);

Examples for both:

$ perl -e 'use warnings; use strict; my ($name, $aliases, $port, $proto) = getservbyname($ARGV[0], $ARGV[1]); print "name is: $name, aliases is: $aliases, port is: $port, proto is: $proto\n";' smtp tcp
name is: smtp, aliases is: , port is: 25, proto is: tcp

$ perl -e 'use warnings; use strict; my ($name, $aliases, $port, $proto) = getservbyport($ARGV[0], $ARGV[1]); print "name is: $name, aliases is: $aliases, port is: $port, proto is: $proto\n";' 80 tcp
name is: http, aliases is: , port is: 80, proto is: tcp
The new way

The new way is done again with getaddrinfo()/getnameinfo(), as explained above, since they can do hostname and service resolution on both directions (forward and reverse).
Whereas we ignored the port number in the sockaddr data when doing host-to-IP resolution above, in this case the port number is of course very important.

We can reuse the same code snippets from above, since we allowed for a (then unused) second argument to the program:

$ getaddrinfo.pl '' https
IPv6:
  ::1, port: 443, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), (scope id: 0, flowinfo: 0), canonname: 
IPv4:
  127.0.0.1, port: 443, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
$ getnameinfo.pl '' 443
hostname: 0.0.0.0, servicename: https
$ getnameinfo.pl '' 389
hostname: 0.0.0.0, servicename: ldap

As mentioned before, it's also possible to ask for simultaneous hostname and service name resolution in both directions, eg

$ getaddrinfo.pl www.kernel.org www
IPv6:
  2001:4f8:1:10:0:1991:8:25, port: 80, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), (scope id: 0, flowinfo: 0), canonname: pub.all.kernel.org
IPv4:
  198.145.20.140, port: 80, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  149.20.4.69, port: 80, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  199.204.44.194, port: 80, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 

$ getnameinfo.pl 2001:4f8:1:10:0:1991:8:25 443
hostname: pao-korg-pub.kernel.org, servicename: https

doing so is useful in the common case where the program needs a specific, ready-to-use sockaddr for a given service, address family and/or protocol (ie, the majority of cases), as opposed to just wanting to perform name or service resolution.

Reference: Perl Socket documentation.