Skip to content

Some common networking operations in Perl

A compilation of common operations that are often needed when writing networking code. Hopefully this saves some googling.

The examples will use Perl code, however from time to time the C data structures will be cited and the C terminology will be used. IPv4 and IPv6 will be covered.

Links to sample programs used in the examples: getaddrinfo.pl, getnameinfo.pl. A reasonably new version of Perl is required (in particular, 5.14 from Debian Wheezy is not new enough).

Socket addresses

In C, there's this notion of a "socket address", which is basically the combination of an IP address and a port (and other data, but address and port are the essential pieces of information). Here are the C data structures for IPv4 and IPv6:

/* IPv4 */
struct sockaddr_in {
    sa_family_t    sin_family; /* address family: AF_INET */
    in_port_t      sin_port;   /* port in network byte order */
    struct in_addr sin_addr;   /* internet address */
};

/* IPv6 */
struct sockaddr_in6 {
    sa_family_t     sin6_family;   /* AF_INET6 */
    in_port_t       sin6_port;     /* port number */
    uint32_t        sin6_flowinfo; /* IPv6 flow information */
    struct in6_addr sin6_addr;     /* IPv6 address */
    uint32_t        sin6_scope_id; /* Scope ID (new in 2.4) */
};

In C, lots of networking-related functions accept or return these structures (or, often, pointers to them). The connect() and the bind() functions are two notable examples.
In fact, in the C function prototypes the generic struct sockaddr type is used, which doesn't really exist in practice (although it has a definition); either a sockaddr_in or a sockaddr_in6 must be used, after casting it to sockaddr.

The actual IP addresses are themselves structs, which are defined as follows:

/* IPv4 */
struct in_addr {
    uint32_t       s_addr;         /* address in network byte order */
};

/* IPv6 */
struct in6_addr {
    unsigned char   s6_addr[16];   /* IPv6 address */
};

Then there's the more recent struct addrinfo, which includes a sockaddr member and, additionally, more data:

struct addrinfo {
    int              ai_flags;       // AI_PASSIVE, AI_CANONNAME, etc.
    int              ai_family;      // AF_INET, AF_INET6, AF_UNSPEC
    int              ai_socktype;    // SOCK_STREAM, SOCK_DGRAM
    int              ai_protocol;    // use 0 for "any"
    size_t           ai_addrlen;     // size of ai_addr in bytes
    struct sockaddr  *ai_addr;       // struct sockaddr_in or _in6
    char             *ai_canonname;  // full canonical hostname
    struct addrinfo  *ai_next;       // linked list, next node
};

This strcture is used by a class of newer, address-family-independent functions. In particular, code is expected to deal with linked lists of struct addrinfo, as indicated by the fact that the ai_next member points to the same data structure type.

From sockaddr to (host, port, ...) data and viceversa

If we have a Perl variable that represents a sockaddr_in or a sockaddr_in6 (for example as returned by recv()), we can extract the actual member data with code similar to the following:

# IPv4
use Socket qw(unpack_sockaddr_in);
my ($port, $addr4) = unpack_sockaddr_in($sockaddr4);

# IPv6
use Socket qw(unpack_sockaddr_in6);
my ($port, $addr6, $scopeid, $flowinfo) = unpack_sockaddr_in6($sockaddr6);

Note that $addr4 and $addr6 are still binary data; to get their textual representation a further step is needed (see below).

Conversely, if we have the individual fields of a sockaddr, we can pack it into a sockaddr variable as follows:

# IPv4
use Socket qw(pack_sockaddr_in);
$sockaddr4 = pack_sockaddr_in($port, $addr4);

# IPv6
use Socket qw(pack_sockaddr_in6);
$sockaddr6 = pack_sockaddr_in6($port, $addr6, [$scope_id, [$flowinfo]]);

Again, $addr4 and $addr6 must be the binary versions of the addresses, not their string representation.

As a convenience, it is possible to use the sockaddr_in() and sockaddr_in6() functions as shortcuts for both packing and unpacking:

# IPv4
use Socket qw(sockaddr_in);
my ($port, $addr4) = sockaddr_in($sockaddr4);
my $sockaddr4 = sockaddr_in($port, $addr4);

# IPv6
use Socket qw(sockaddr_in6);
my ($port, $addr6, $scopeid, $flowinfo) = sockaddr_in6($sockaddr6);
$sockaddr6 = sockaddr_in6($port, $addr6, [$scope_id, [$flowinfo]]);

From binary address to string representation and viceversa

If we have a binary IP address, we can use inet_ntop() and inet_pton() to convert it to a string (printable) representation:

# IPv4
use Socket qw(AF_INET inet_ntop);
$straddr4 = inet_ntop(AF_INET, $addr4);

# IPv6
use Socket qw(AF_INET6 inet_ntop);
$straddr6 = inet_ntop(AF_INET6, $addr6);

And the reverse process, from string to binary:

# IPv4
use Socket qw(AF_INET inet_pton);
$addr4 = inet_pton(AF_INET, $straddr4);

# IPv6
use Socket qw(AF_INET6 inet_pton);
$addr6 = inet_pton(AF_INET6, $straddr6);

All these functions fail if the argument to be converted is not a valid address in the respective representation.

Get sockaddr data from a socket variable

Sometimes it is necessary to know to which local or remote address or port a certain socket is associated. Typically we have a socket variable (for example, obtained with accept()), which in Perl can be stored in a handle, and we want the corresponding sockaddr data. So here's how to get it:

# Get remote sockaddr info from socket handle
$remotesockaddr = getpeername(SOCK);

# then, as already shown...

# IPv4
($port, $addr4) = sockaddr_in($remotesockaddr);

# or IPv6
($port, $addr6, $scopeid, $flowinfo) = sockaddr_in6($remotesockaddr);

To get sockaddr information for the local end of the socket, getsockname() is used:

# Get local sockaddr info from socket
$localsockaddr = getsockname(SOCK);
...

Note that depending on the protocol (TCP or UDP) and/or the bound status of the socket, the resuts may or may not make a lot of sense, but this is something that the code writer should know.

From hostname to IP address and viceversa

There are two ways to perform this hyper-common operation: one is older and deprecated, the other is newer and recommended.

The old way

The older way, which is still extremely popular, is somewhat protocol-dependent. Here it is:

# List context, return all the information
($canonname, $aliases, $addrtype, $length, @addrs) = gethostbyname($name);

As an example, let's try it with www.kernel.org:

#!/usr/bin/perl
 
use warnings;
use strict;
 
use Socket qw ( :DEFAULT inet_ntop );
 
my ($canonname, $aliases, $addrtype, $length, @addrs) = gethostbyname('www.kernel.org');
 
print "canonname: $canonname\n";
print "aliases: $aliases\n";
print "addrtype: $addrtype\n";
print "length: $length\n";
print "addresses: " . join(",", map { inet_ntop(AF_INET, $_) } @addrs), "\n";

Running the above outputs:

canonname: pub.all.kernel.org
aliases: www.kernel.org
addrtype: 2
length: 4
addresses: 198.145.20.140,149.20.4.69,199.204.44.194

So it seems there's no way to get it to return IPv6 addresses.

gethostbyname() can also be run in scalar context, in which case it just returns a single IP(v4) address:

# Scalar context, only IP address is returned
$ perl -e 'use Socket qw (:DEFAULT inet_ntop); my $a = gethostbyname("www.kernel.org"); print inet_ntop(AF_INET, $a), "\n";'
149.20.4.69
$ perl -e 'use Socket qw (:DEFAULT inet_ntop); my $a = gethostbyname("www.kernel.org"); print inet_ntop(AF_INET, $a), "\n";'
198.145.20.140
$ perl -e 'use Socket qw (:DEFAULT inet_ntop); my $a = gethostbyname("www.kernel.org"); print inet_ntop(AF_INET, $a), "\n";'
199.204.44.194

Normal DNS round-robin.

The inverse process is done with gethostbyaddr(), which supports also IPv6, though it's deprecated nonetheless. Again, the results differ depending on whether we are in list or scalar context (remember that all addresses have to be binary):

# List context, return more data

# IPv4
use Socket qw(:DEFAULT)
my ($canonname, $aliases, $addrtype, $length, @addrs) = gethostbyaddr($addr4, AF_INET);

# IPv6
use Socket qw(:DEFAULT)
my ($canonname, $aliases, $addrtype, $length, @addrs) = gethostbyaddr($addr6, AF_INET6);

In these case, of course, the interesting data is in the $canonname variable.

In scalar context, only the name is returned:

# scalar context, just return one name
use Socket qw(:DEFAULT);
my $hostname = gethostbyaddr($addr4, AF_INET);

# IPv6
use Socket qw(:DEFAULT);
my $hostname = gethostbyaddr($addr6, AF_INET6);

Note that, again, in all cases the passed IP addresses are binary.

The new way

The new and recommended way is protocol-independent (meaning that a name-to-IP lookup can return both IPv4 and IPv6 addresses) and is based on the addrinfo structure mentioned at the beginning. The forward lookup is done with the getaddrinfo() function. The idea is that, when an application needs to populate a sockaddr structure, the system provides it with one already filled with data, which can be directly used for whatever the application needs to do (eg, connect() or bind()).
In fact, getaddrinfo() returns a list of addrinfo structs (in C it's a linked list), each with its own sockaddr data, so the application can try each one in turn, in the same order that they are provided. (Normally the first one will work, without needing to try the next; but there are cases where having more than one possibility to try is useful.)

The C version returns a pointer to a linked list of struct addrinfo; with Perl it's easier as the list is returned in an array. The sample Perl code for getaddrinfo() is:

use Socket qw(:DEFAULT getaddrinfo);
my ($err, @addrs) = getaddrinfo($name, $service, $hints);

If $err is not set (that is, the operation was successful), @addrs contains a list of results. Since in Perl there are no structs, each element is a reference to a hash whose elements are named after the struct addrinfo members.

However, there are a few things to note:

  • getaddrinfo() can do hostname-to-address as well as service-to-port-number lookups, hence the first two arguments $name and $service. Depending on the actual task, an application might need to do just one type of lookup or the other, or both. In this paragraph we will strictly do hostname resolution; in the following we will do service name resolution.
  • getaddrinfo() is not only IP-version agnostic (in that it can return IPv4 and IPv6 addresses); it is also, so to speak, protocol (TCP, UDP) and socket type (stream, datagram, raw) agnostic. However, suitable values can be passed in the $hints variable to restrict the scope of the returned entries. This way, an application can ask to be given results suitable only for a specific socket type, protocol or address family. But this also means that, if everything is left unspecified, the getaddrinfo() lookup may (and usually does) return up to three entries for each IP address to which the supplied name resolves: one for protocol 6, socket type 1 (TCP, stream socket), one for protocol 17, socket type 2 (UDP, datagram socket) and one for protocol 0, socket type 3 (raw socket).
  • As briefly mentioned, the last argument $hints is a reference to a hash whose keys provide additional information or instructions about the way the lookup should be performed (see example below).

Let's write a simple code snippet to check the above facts.

#!/usr/bin/perl
 
use warnings;
use strict;
 
use Socket qw(:DEFAULT AI_CANONNAME IPPROTO_TCP IPPROTO_UDP IPPROTO_RAW SOCK_STREAM SOCK_DGRAM SOCK_RAW getaddrinfo
              inet_ntop inet_pton);
 
# map protocol number to name
sub pprotocol {
  my ($proto) = @_;
  if ($proto == IPPROTO_TCP) {
    return 'IPPROTO_TCP';
  } elsif ($proto == IPPROTO_UDP) {
    return 'IPPROTO_UDP';
  } else {
    return 'n/a';
  }
}
 
# map socket type number to name
sub psocktype {
  my ($socktype) = @_;
  if ($socktype == SOCK_STREAM) {
    return 'SOCK_STREAM';
  } elsif ($socktype == SOCK_DGRAM) {
    return 'SOCK_DGRAM';
  } elsif ($socktype == SOCK_RAW) {
    return 'SOCK_RAW';
  } else {
    return 'unknown';
  }
}
 
die "Must specify name to resolve" if (not $ARGV[0] and not $ARGV[1]);
 
my $name = $ARGV[0] or undef;
my $service = $ARGV[1] or undef;
 
# we want the canonical name on the first entry returned
my $hints = {};
if ($ARGV[0]) {
  $hints->{flags} = AI_CANONNAME;
}
 
my ($err, @addrs) = getaddrinfo ($name, $service, $hints);
 
die "getaddrinfo: error or no results" if $err;
 
# If we get here, each element of @addrs is a hash
# reference with the following keys (addrinfo struct members):
 
# 'family'      (AF_INET, AF_INET6)
# 'protocol'    (IPPROTO_TCP, IPPROTO_UDP)
# 'canonname'   (Only if requested with the AI_CANONNAME flag, and only on the first entry)
# 'addr'        This is a sockaddr (_in or _in6 depending on the address family above)
# 'socktype'    (SOCK_STREAM, SOCK_DGRAM, SOCK_RAW)
 
# dump results
for(@addrs) {
 
  my ($canonname, $protocol, $socktype) = (($_->{canonname} or ""), pprotocol($_->{protocol}), psocktype($_->{socktype}));
 
  if ($_->{family} == AF_INET) {
 
    # port is always 0 when resolving a hostname
    my ($port, $addr4) = sockaddr_in($_->{addr});
 
    print "IPv4:\n";
    print "  " . inet_ntop(AF_INET, $addr4) . ", port: $port, protocol: $_->{protocol} ($protocol), socktype: $_->{socktype} ($socktype), canonname: $canonname\n";
  } else {
 
    my ($port, $addr6, $scope_id, $flowinfo) = sockaddr_in6($_->{addr});
    print "IPv6:\n";
    print "  " . inet_ntop(AF_INET6, $addr6) . ", port: $port, protocol: $_->{protocol} ($protocol), socktype: $_->{socktype} ($socktype), (scope id: $scope_id, flowinfo: $flowinfo), canonname: $canonname\n";
  }
}

Let's test it:

$ getaddrinfo.pl www.kernel.org
IPv6:
  2001:4f8:1:10:0:1991:8:25, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), (scope id: 0, flowinfo: 0), canonname: pub.all.kernel.org
IPv6:
  2001:4f8:1:10:0:1991:8:25, port: 0, protocol: 17 (IPPROTO_UDP), socktype: 2 (SOCK_DGRAM), (scope id: 0, flowinfo: 0), canonname: 
IPv6:
  2001:4f8:1:10:0:1991:8:25, port: 0, protocol: 0 (n/a), socktype: 3 (SOCK_RAW), (scope id: 0, flowinfo: 0), canonname: 
IPv4:
  198.145.20.140, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  198.145.20.140, port: 0, protocol: 17 (IPPROTO_UDP), socktype: 2 (SOCK_DGRAM), canonname: 
IPv4:
  198.145.20.140, port: 0, protocol: 0 (n/a), socktype: 3 (SOCK_RAW), canonname: 
IPv4:
  199.204.44.194, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  199.204.44.194, port: 0, protocol: 17 (IPPROTO_UDP), socktype: 2 (SOCK_DGRAM), canonname: 
IPv4:
  199.204.44.194, port: 0, protocol: 0 (n/a), socktype: 3 (SOCK_RAW), canonname: 
IPv4:
  149.20.4.69, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  149.20.4.69, port: 0, protocol: 17 (IPPROTO_UDP), socktype: 2 (SOCK_DGRAM), canonname: 
IPv4:
  149.20.4.69, port: 0, protocol: 0 (n/a), socktype: 3 (SOCK_RAW), canonname: 

As expected, three entries are returned for each resolved IP address (BTW, the order of the entries matters: this is the order in which client applications should attempt to use them. In this case, IPv6 addresses are given preference, as it should be if the machine has good IPv6 connectivity - again, as it should be -).
In practice, as said, one may want to filter the results, for example by address family (IPv4, IPv6) and/or socket type (stream, datagram, raw) and/or protocol (TCP, UDP). For illustration purposes, let's filter by socket type. This is done using the socktype key of the $hints hash. For example, let's change it as follows to only return results suitable for the creation of sockets of type SOCK_STREAM:

my $hints = {}
$hints->{socktype} = SOCK_STREAM;   # add this line
if ($ARGV[0]) {
  $hints->{flags} = AI_CANONNAME;
}

Now let's run it again:

$ getaddrinfo.pl www.kernel.org
IPv6:
  2001:4f8:1:10:0:1991:8:25, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), (scope id: 0, flowinfo: 0), canonname: pub.all.kernel.org
IPv4:
  198.145.20.140, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  149.20.4.69, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  199.204.44.194, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 

Now the result is more like one would expect.

Note that there is much more to hint flags than shown above; the C man page for getaddrinfo() and the Perl reference linked at the end provide all the details.

So getaddrinfo() is the recommended way to do hostname to IP address resolution, although gethostbyname() won't probably go away soon.

The reverse process (from address to name) is performed using getnameinfo(), which is the counterpart to getaddrinfo(). Its usage is quite different from the C version and is as follows:

use Socket qw(:DEFAULT getnameinfo);
my ($err, $hostname, $servicename) = getnameinfo($sockaddr, [$flags, [$xflags]]);

Note that it accepts a sockaddr, so we pass it an address (IPv4 or IPv6) and a port. This should suggest that, just like getaddrinfo(), getnameinfo() can also do port to service name inverse resolution, which it indeed does (see below). Here we are concerned with reverse address resolution; in the following paragraph we'll do service port inverse resolution.

Let's write some code to test getnameinfo():

#!/usr/bin/perl
 
use warnings;
use strict;
 
use Socket qw(:DEFAULT inet_ntop inet_pton getnameinfo);
 
die "Usage: $0 [address] [port]" if (not $ARGV[0] and not $ARGV[1]);
 
my $straddr = ($ARGV[0] or "0.0.0.0");
my $port = ($ARGV[1] or 0);
 
# pack address + port
 
my $sockaddr;
 
# note that we assume the address is correct,
# real code should verify that
 
# stupid way to detect address family
if ($straddr =~ /:/) {
  $sockaddr = sockaddr_in6($port, inet_pton(AF_INET6, $straddr));
} else {
  $sockaddr = sockaddr_in($port, inet_pton(AF_INET, $straddr));
}
 
# do the inverse resolution 
 
my $flags = 0;
my $xflags = 0;
 
my ($err, $hostname, $servicename) = getnameinfo($sockaddr, $flags, $xflags);
 
die "getnameinfo: error or no results" if $err;
 
# dump
print "hostname: $hostname, servicename: $servicename\n";

Let's try it:

$ getnameinfo.pl 198.145.20.140 
hostname: tiz-korg-pub.kernel.org, servicename: 0
$ getnameinfo.pl  2001:4f8:1:10:0:1991:8:25
hostname: pao-korg-pub.kernel.org, servicename: 0

The Perl Socket reference page linked at the bottom provides more details about the possible hint flags that can be passed to getaddrinfo() and getnameinto(), and their possible return values in case of errors.

According to some sources, if a string representation of an address is passed to getaddrinfo() and the AI_CANONNAME flag is set, that should also work to do inverse resolution, in that the 'canonname' hash key of the returned value should be filled with the hostname. However, it does not seem to be working:

$ getaddrinfo.pl 198.145.20.140
IPv4:
  198.145.20.140, port: 0, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 198.145.20.140  # not the name

From service name to port number and viceversa

Here, again, there are two ways: the old one, and the new one.

The old way

This is done using getservbyname() and getservbyport() for forward and inverse resolution respectively:

my ($name, $aliases, $port, $proto) =  getservbyname($name, $proto);
my ($name, $aliases, $port, $proto) =  getservbyport($port, $proto);

Examples for both:

$ perl -e 'use warnings; use strict; my ($name, $aliases, $port, $proto) = getservbyname($ARGV[0], $ARGV[1]); print "name is: $name, aliases is: $aliases, port is: $port, proto is: $proto\n";' smtp tcp
name is: smtp, aliases is: , port is: 25, proto is: tcp

$ perl -e 'use warnings; use strict; my ($name, $aliases, $port, $proto) = getservbyport($ARGV[0], $ARGV[1]); print "name is: $name, aliases is: $aliases, port is: $port, proto is: $proto\n";' 80 tcp
name is: http, aliases is: , port is: 80, proto is: tcp
The new way

The new way is done again with getaddrinfo()/getnameinfo(), as explained above, since they can do hostname and service resolution on both directions (forward and reverse).
Whereas we ignored the port number in the sockaddr data when doing host-to-IP resolution above, in this case the port number is of course very important.

We can reuse the same code snippets from above, since we allowed for a (then unused) second argument to the program:

$ getaddrinfo.pl '' https
IPv6:
  ::1, port: 443, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), (scope id: 0, flowinfo: 0), canonname: 
IPv4:
  127.0.0.1, port: 443, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
$ getnameinfo.pl '' 443
hostname: 0.0.0.0, servicename: https
$ getnameinfo.pl '' 389
hostname: 0.0.0.0, servicename: ldap

As mentioned before, it's also possible to ask for simultaneous hostname and service name resolution in both directions, eg

$ getaddrinfo.pl www.kernel.org www
IPv6:
  2001:4f8:1:10:0:1991:8:25, port: 80, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), (scope id: 0, flowinfo: 0), canonname: pub.all.kernel.org
IPv4:
  198.145.20.140, port: 80, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  149.20.4.69, port: 80, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 
IPv4:
  199.204.44.194, port: 80, protocol: 6 (IPPROTO_TCP), socktype: 1 (SOCK_STREAM), canonname: 

$ getnameinfo.pl 2001:4f8:1:10:0:1991:8:25 443
hostname: pao-korg-pub.kernel.org, servicename: https

doing so is useful in the common case where the program needs a specific, ready-to-use sockaddr for a given service, address family and/or protocol (ie, the majority of cases), as opposed to just wanting to perform name or service resolution.

Reference: Perl Socket documentation.

“On the fly” IPsec VPN with iproute2

(This has been on the TODO list for months, let's finally get to it.)

Basically we're going to create an IPsec VPN with static manual keys, using only the ip command from iproute2.

As there seems to be some confusion, note that the VPN we're setting up here has nothing to do with a PSK setup, in which normally there is an IKE daemon that dynamically computes the keys and generates new ones at regular intervals. The PSK is used for IKE authentication purposes, and after that the actual IPsec keys are truly dynamic and change periodically.

Here instead we're not using IKE; we just manually generate some static IPsec keys (technically, we generate a bunch of security associations (SAs); each SA has a number of properties, among which is the SPI - a number that identifies the SA - and two keys, one used for authentication/integrity check and the other for enctryption).
As should be clear, this is not something to be used on a regular basis and/or for a long time; it's rather a quick and dirty hack to be used in emergency situations. The problem with passing a lot of VPN traffic using the same keys should be apparent: the more data encrypted with the same key an attacker has, the higher the chances of a successful decryption (which would affect all traffic, including past one). Manual keying also has other problems, like the lack of peer authentication and the possibility of replay attacks.
On the plus side, it's very easy to setup and it does not require to install any software; the only prerequisites are iproute2 (installed by default on just about any distro nowadays) and a kernel that supports IPsec (ditto).

So here's the sample topology:

ipsectun

So, quite obviusly, we want traffic between 10.0.0.0/24 and 172.16.0.0/24 to be encrypted.

There are two important concepts in IPsec: the SPD (Security Policy Database) and the SAs (Security Association). A SA contains the actual security parameters and keys to be used between two given peers; since a SA is unidirectional, there will always be two of them, one for each direction of the traffic. The SPD, on the other hand, defines to which traffic the SA should be applied, in terms of source/destination IP ranges and protocols. If a packet matches a SPD, the associated SA is applied to it, resulting in its encryption, signing or whatever the SA prescribes.
For them to be of any use, the SPD and the SAs must match (with reversed source/destination values) at both ends.

In our case, we're going to manually define both the SPD policies and the SAs using the xfrm subcommand of the ip utility. (In a traditional IKE setup, instead, the security policies are statically defined in configuration files and the SAs are dynamically established by IKE either at startup or upon seeing matching traffic, and periodically renewed.)

The idea here is to have some code that outputs the commands to be run on GW1 and GW2 respectively, so they can be copied and pasted.

Since authentication and encryption keys are an essential part of a SA, let's start by generating them. We want to use AES256 to encrypt, and SHA256 for integrity check, so we know the key length: 256 bit, or 32 bytes. Each SA contains two keys, and there will be two SAs, so we generate four keys.

# bash
declare -a keys
for i in {1..4}; do
  # keys 1 and 3 are for HMAC, keys 2 and 4 are for encryption
  keys[i]=$(xxd -p -l 32 -c 32 /dev/random)
done

As usual, if reading from /dev/random blocks, you need to add some entropy to the pool or use /dev/urandom (less secure, but this setup isn't meant to be very secure anyway).
Each SA needs a unique ID, called the SPI (Security Parameter Index), which is 32 bits (4 bytes) long:

declare -a spi
for i in {1..2}; do
  spi[i]=$(xxd -p -l 4 /dev/random)
done

Finally, there has to be what iproute calls the reqid, which is what links a SA with a SPD. Again this is 4 bytes, so let's generate it (each SA has its own reqid):

declare -a reqid
for i in {1..2}; do
  reqid[i]=$(xxd -p -l 4 /dev/random)
done

The code to create the SAs is as follows (same code for GW1 and GW2):

ip xfrm state add src 192.0.2.1 dst 198.51.100.20 proto esp spi "0x${spi[1]}" reqid "0x${reqid[1]}" mode tunnel auth sha256 "0x${keys[1]}" enc aes "0x${keys[2]}"
ip xfrm state add src 198.51.100.20 dst 192.0.2.1 proto esp spi "0x${spi[2]}" reqid "0x${reqid[2]}" mode tunnel auth sha256 "0x${keys[3]}" enc aes "0x${keys[4]}"

Now to the SPD. Here we define which traffic should be encrypted. In our case, of course, it's all traffic between 10.0.0.0/24 and 172.16.0.0/24, in both directions.

# for GW1
ip xfrm policy add src 10.0.0.0/24 dst 172.16.0.0/24 dir out tmpl src 192.0.2.1 dst 198.51.100.20 proto esp reqid "0x${reqid[1]}" mode tunnel
ip xfrm policy add src 172.16.0.0/24 dst 10.0.0.0/24 dir fwd tmpl src 198.51.100.20 dst 192.0.2.1 proto esp reqid "0x${reqid[2]}" mode tunnel
ip xfrm policy add src 172.16.0.0/24 dst 10.0.0.0/24 dir in tmpl src 198.51.100.20 dst 192.0.2.1 proto esp reqid "0x${reqid[2]}" mode tunnel

# for GW2
ip xfrm policy add src 172.16.0.0/24 dst 10.0.0.0/24 dir out tmpl src 198.51.100.20 dst 192.0.2.1 proto esp reqid "0x${reqid[2]}" mode tunnel
ip xfrm policy add src 10.0.0.0/24 dst 172.16.0.0/24 dir fwd tmpl src 192.0.2.1 dst 198.51.100.20 proto esp reqid "0x${reqid[1]}" mode tunnel
ip xfrm policy add src 10.0.0.0/24 dst 172.16.0.0/24 dir in tmpl src 192.0.2.1 dst 198.51.100.20 proto esp reqid "0x${reqid[1]}" mode tunnel

The commands are symmetrical, with src/dst pairs swapped. I'm not 100% sure why a fourth policy in the "fwd" direction is not needed (more information welcome), but looking at what eg openswan does, it seems that it creates only three policies as above and everything works, so let's stick with that.

The last thing to do is to add suitable routes for the traffic that must be encrypted:

# for GW1
ip route add 172.16.0.0/24 dev eth0 src 10.0.0.1
# for GW2
ip route add 10.0.0.0/24 dev eth0 src 172.16.0.1

Specifying the "src" parameter is important here, if we want traffic originating on the gateways themselves to go through the tunnel.

Now any traffic between the two networks 10.0.0.0/24 and 172.16.0.0/24 will go through the VPN.

So here is the complete script, with endpoint and subnet addresses parametrized (yes, argument checking - and not only that - could certainly be better):

#!/bin/bash

# doipsec.sh

dohelp(){
  echo "usage: $0 <GW1_public_IP|GW1_range|GW1_internal_IP[|GW1_public_iface[|GW1_GWIP]]> <GW2_public_IP|GW2_range|GW2_internal_IP[|GW2_public_iface[|GW2_GWIP]]>" >&2
  echo "Output commands to set up an ipsec tunnel between two machines" >&2
  echo "Example: $0 '192.0.2.1|10.0.0.0/24|10.0.0.254|eth0' '198.51.100.20|172.16.0.0/24|172.16.0.1|eth1'" >&2
}

if [ $# -ne 2 ] || [ "$1" = "-h" ]; then
  dohelp
  exit 1
fi

IFS="|" read -r GW1_IP GW1_NET GW1_IIP GW1_IF GW1_GWIP <<< "$1"
IFS="|" read -r GW2_IP GW2_NET GW2_IIP GW2_IF GW2_GWIP <<< "$2"

if [ "${GW1_IP}" = "" ] || [ "${GW1_NET}" = "" ] || [ "${GW1_IIP}" = "" ]; then
  dohelp
  exit 1
fi

# assume eth0 if not specified
[ "${GW1_IF}" = "" ] && GW1_IF=eth0

# generate variable data

rand_device=/dev/random    # change to urandom if needed

declare -a keys
for i in {1..4}; do
  # keys 1 and 3 are for HMAC, keys 2 and 4 are for encryption
  keys[i]=$(xxd -p -l 32 -c 32 "${rand_device}")
done

declare -a spi
for i in {1..2}; do
  spi[i]=$(xxd -p -l 4 "${rand_device}")
done

declare -a reqid
for i in {1..2}; do
  reqid[i]=$(xxd -p -l 4 "${rand_device}")
done

# route statement to allow default routing through the tunnel

# sucking heuristic
if [ "${GW1_GWIP}" != "" ] && [ "${GW2_NET}" = "0.0.0.0/0" ]; then
  # add a /32 route to the peer before pointing the default to the tunnel
  GW1_GW2_ROUTE="ip route add ${GW2_IP}/32 dev ${GW1_IF} via ${GW1_GWIP} && ip route del ${GW2_NET} && ip route add ${GW2_NET} dev ${GW1_IF} src ${GW1_IIP}" 
else
  GW1_GW2_ROUTE="ip route add ${GW2_NET} dev ${GW1_IF} src ${GW1_IIP}" 
fi

if [ "${GW2_GWIP}" != "" ] && [ "${GW1_NET}" = "0.0.0.0/0" ]; then
  GW2_GW1_ROUTE="ip route add ${GW1_IP}/32 dev ${GW2_IF} via ${GW2_GWIP} && ip route del ${GW1_NET} && ip route add ${GW1_NET} dev ${GW2_IF} src ${GW2_IIP}" 
else
  GW2_GW1_ROUTE="ip route add ${GW1_NET} dev ${GW2_IF} src ${GW2_IIP}" 
fi

cat << EOF
**********************
Commands to run on GW1
**********************

ip xfrm state flush; ip xfrm policy flush

ip xfrm state add src ${GW1_IP} dst ${GW2_IP} proto esp spi 0x${spi[1]} reqid 0x${reqid[1]} mode tunnel auth sha256 0x${keys[1]} enc aes 0x${keys[2]}
ip xfrm state add src ${GW2_IP} dst ${GW1_IP} proto esp spi 0x${spi[2]} reqid 0x${reqid[2]} mode tunnel auth sha256 0x${keys[3]} enc aes 0x${keys[4]}

ip xfrm policy add src ${GW1_NET} dst ${GW2_NET} dir out tmpl src ${GW1_IP} dst ${GW2_IP} proto esp reqid 0x${reqid[1]} mode tunnel
ip xfrm policy add src ${GW2_NET} dst ${GW1_NET} dir fwd tmpl src ${GW2_IP} dst ${GW1_IP} proto esp reqid 0x${reqid[2]} mode tunnel
ip xfrm policy add src ${GW2_NET} dst ${GW1_NET} dir in tmpl src ${GW2_IP} dst ${GW1_IP} proto esp reqid 0x${reqid[2]} mode tunnel

${GW1_GW2_ROUTE}

**********************
Commands to run on GW2
**********************

ip xfrm state flush; ip xfrm policy flush

ip xfrm state add src ${GW1_IP} dst ${GW2_IP} proto esp spi 0x${spi[1]} reqid 0x${reqid[1]} mode tunnel auth sha256 0x${keys[1]} enc aes 0x${keys[2]}
ip xfrm state add src ${GW2_IP} dst ${GW1_IP} proto esp spi 0x${spi[2]} reqid 0x${reqid[2]} mode tunnel auth sha256 0x${keys[3]} enc aes 0x${keys[4]}

ip xfrm policy add src ${GW2_NET} dst ${GW1_NET} dir out tmpl src ${GW2_IP} dst ${GW1_IP} proto esp reqid 0x${reqid[2]} mode tunnel
ip xfrm policy add src ${GW1_NET} dst ${GW2_NET} dir fwd tmpl src ${GW1_IP} dst ${GW2_IP} proto esp reqid 0x${reqid[1]} mode tunnel
ip xfrm policy add src ${GW1_NET} dst ${GW2_NET} dir in tmpl src ${GW1_IP} dst ${GW2_IP} proto esp reqid 0x${reqid[1]} mode tunnel

${GW2_GW1_ROUTE}
EOF

So for our example we'd run it with something like:

$mdoipsec.sh '192.0.2.1|10.0.0.0/24|10.0.0.254|eth0' '198.51.100.20|172.16.0.0/24|172.16.0.1|eth1'
**********************
Commands to run on GW1
**********************

ip xfrm state flush; ip xfrm policy flush

ip xfrm state add src 192.0.2.1 dst 198.51.100.20 proto esp spi 0xfd51141e reqid 0x62502e58 mode tunnel auth sha256 0x4046c2f9ff22725b850e2d981968249dc6c25fba189e701cf9a14e921f91cffb enc aes 0xccd80053ae1b55113a89bc476d0de1d9e8b7bc94655f3af1b0dad7bb9ada1065
ip xfrm state add src 198.51.100.20 dst 192.0.2.1 proto esp spi 0x34e0aac0 reqid 0x66a32a19 mode tunnel auth sha256 0x1caf04f262e889b9b53b6c95bfbb4ef0292616362e8878fe96123610ca000892 enc aes 0x9380e038247fcd893d4f8799389b90bfa4d0b09195495bb94fe3a9fa5c5b699d

ip xfrm policy add src 10.0.0.0/24 dst 172.16.0.0/24 dir out tmpl src 192.0.2.1 dst 198.51.100.20 proto esp reqid 0x62502e58 mode tunnel
ip xfrm policy add src 172.16.0.0/24 dst 10.0.0.0/24 dir fwd tmpl src 198.51.100.20 dst 192.0.2.1 proto esp reqid 0x66a32a19 mode tunnel
ip xfrm policy add src 172.16.0.0/24 dst 10.0.0.0/24 dir in tmpl src 198.51.100.20 dst 192.0.2.1 proto esp reqid 0x66a32a19 mode tunnel

ip route add 172.16.0.0/24 dev eth0 src 10.0.0.254

**********************
Commands to run on GW2
**********************

ip xfrm state flush; ip xfrm policy flush

ip xfrm state add src 192.0.2.1 dst 198.51.100.20 proto esp spi 0xfd51141e reqid 0x62502e58 mode tunnel auth sha256 0x4046c2f9ff22725b850e2d981968249dc6c25fba189e701cf9a14e921f91cffb enc aes 0xccd80053ae1b55113a89bc476d0de1d9e8b7bc94655f3af1b0dad7bb9ada1065
ip xfrm state add src 198.51.100.20 dst 192.0.2.1 proto esp spi 0x34e0aac0 reqid 0x66a32a19 mode tunnel auth sha256 0x1caf04f262e889b9b53b6c95bfbb4ef0292616362e8878fe96123610ca000892 enc aes 0x9380e038247fcd893d4f8799389b90bfa4d0b09195495bb94fe3a9fa5c5b699d

ip xfrm policy add src 172.16.0.0/24 dst 10.0.0.0/24 dir out tmpl src 198.51.100.20 dst 192.0.2.1 proto esp reqid 0x66a32a19 mode tunnel
ip xfrm policy add src 10.0.0.0/24 dst 172.16.0.0/24 dir fwd tmpl src 192.0.2.1 dst 198.51.100.20 proto esp reqid 0x62502e58 mode tunnel
ip xfrm policy add src 10.0.0.0/24 dst 172.16.0.0/24 dir in tmpl src 192.0.2.1 dst 198.51.100.20 proto esp reqid 0x62502e58 mode tunnel

ip route add 10.0.0.0/24 dev eth1 src 172.16.0.1

Note that it's not necessarily the two networks local to GW1 and GW2 that have to be connected by the tunnel. If GW2 had, say, an existing route to 192.168.0.0/24, it would be perfectly possible to say:

$ doipsec.sh '192.0.2.1|10.0.0.0/24|10.0.0.254|eth0' '198.51.100.20|192.168.0.0/24|172.16.0.1|eth1'
...

to encrypt traffic from/to 10.0.0.0/24 and 192.168.0.0/24. Of course, in this case either hosts in 192.168.0.0/24 must somehow have a route back to 10.0.0.0/24 going through GW2, or GW2 must NAT traffic coming from 10.0.0.0/24 destined to 192.168.0.0/24 (and hosts there must still have a route back to GW2's masquerading address), but that should be obvious.

In the same way, it's possible to just route everything to/from site A through the tunnel (although I would not recommend it):

$ doipsec.sh '192.0.2.1|10.0.0.0/24|10.0.0.254|eth0|192.0.2.254' '198.51.100.20|0.0.0.0/0|172.16.0.1|eth1'
**********************
Commands to run on GW1
**********************

ip xfrm state flush; ip xfrm policy flush

ip xfrm state add src 192.0.2.1 dst 198.51.100.20 proto esp spi 0x00127764 reqid 0xd7d184b1 mode tunnel auth sha256 0x8dcce7d80f7c8bb81e6a526b9d5d7ce2e7a474e3406c40953108b6d92b61cb77 enc aes 0xf9d41041fc014b94d602ed051800601464cdbc525847d5894ed03f55b8b5e78c
ip xfrm state add src 198.51.100.20 dst 192.0.2.1 proto esp spi 0xec8fe8cb reqid 0x18fcbfd1 mode tunnel auth sha256 0xc1dbbafc0deff6d4bfe0e2736d443d94ffe25ce8637e6f70e3260c87cf8f9724 enc aes 0x6170cd164092554bfd8402c528439c2c3d9823b74b493d9c18ca05a9c3b40a0d

ip xfrm policy add src 10.0.0.0/24 dst 0.0.0.0/0 dir out tmpl src 192.0.2.1 dst 198.51.100.20 proto esp reqid 0xd7d184b1 mode tunnel
ip xfrm policy add src 0.0.0.0/0 dst 10.0.0.0/24 dir fwd tmpl src 198.51.100.20 dst 192.0.2.1 proto esp reqid 0x18fcbfd1 mode tunnel
ip xfrm policy add src 0.0.0.0/0 dst 10.0.0.0/24 dir in tmpl src 198.51.100.20 dst 192.0.2.1 proto esp reqid 0x18fcbfd1 mode tunnel

ip route add 198.51.100.20/32 dev eth0 via 192.0.2.254 && ip route del 0.0.0.0/0 && ip route add 0.0.0.0/0 dev eth0 src 10.0.0.254

**********************
Commands to run on GW2
**********************

ip xfrm state flush; ip xfrm policy flush

ip xfrm state add src 192.0.2.1 dst 198.51.100.20 proto esp spi 0x00127764 reqid 0xd7d184b1 mode tunnel auth sha256 0x8dcce7d80f7c8bb81e6a526b9d5d7ce2e7a474e3406c40953108b6d92b61cb77 enc aes 0xf9d41041fc014b94d602ed051800601464cdbc525847d5894ed03f55b8b5e78c
ip xfrm state add src 198.51.100.20 dst 192.0.2.1 proto esp spi 0xec8fe8cb reqid 0x18fcbfd1 mode tunnel auth sha256 0xc1dbbafc0deff6d4bfe0e2736d443d94ffe25ce8637e6f70e3260c87cf8f9724 enc aes 0x6170cd164092554bfd8402c528439c2c3d9823b74b493d9c18ca05a9c3b40a0d

ip xfrm policy add src 0.0.0.0/0 dst 10.0.0.0/24 dir out tmpl src 198.51.100.20 dst 192.0.2.1 proto esp reqid 0x18fcbfd1 mode tunnel
ip xfrm policy add src 10.0.0.0/24 dst 0.0.0.0/0 dir fwd tmpl src 192.0.2.1 dst 198.51.100.20 proto esp reqid 0xd7d184b1 mode tunnel
ip xfrm policy add src 10.0.0.0/24 dst 0.0.0.0/0 dir in tmpl src 192.0.2.1 dst 198.51.100.20 proto esp reqid 0xd7d184b1 mode tunnel

ip route add 10.0.0.0/24 dev eth1 src 172.16.0.1

In this last case, GW2 must obviously perform NAT on at least some of the traffic coming from 10.0.0.0/24. IMPORTANT: since the fifth argument has been specified for GW1 and the remote network is 0.0.0.0/0, the resulting commands include a statement that temporarily deletes the default route on GW1, before recreating it to point into the tunnel. If you're running the commands remotely (eg via SSH) on the relevant machine, things can go wrong and screw up pretty easily. You must always inspect the generated routing code to make sure it's fine for your case, and take the necessary precautions to avoid losing access. This code isn't meant to be production-level anyway.

Another point worth noting is that the generated commands

ip xfrm state flush; ip xfrm policy flush

will remove any trace of IPsec configuration, including preexisting tunnels that may be configured and possibly running. But if that is the case, it means there is a "real" IPsec implementation on the machine, so that's what should be used for the new tunnel too, not the kludgy script described here.

So that's it for this klu^Wexperiment. In principle, one could envision some sort of scheduled task syncronized between the machines that updates the SAs or generates new ones with new keys at regular intervals (ip xfrm allows for that), but in practice for anything more complex it would be too much work for a task for which a well-known protocol exists, namely IKE, which is what should be used for any serious IPsec deployment in any case.

“Range of fields” in awk

This is an all-time awk FAQ. It can be stated in various ways. A typical way is:

"How can I print the whole line except the first (or the first N, or the Nth) field?"

Or also:

"How can I print only from field N to field M?"

The underlying general question is:

"How can I print a range of fields with awk?"

There are actually quite a few ways to accomplish the task, each has its applicability scenario(s) and its pros and cons. Let's start with methods that only use standard Awk features, then we'll get to GNU awk.

Use a loop

This is the most obvious way: just loop from N to M and print the corresponding fields.

sep = ""
for (i = 3; i<=NF; i++) {
  printf "%s%s", sep, $i
  sep = FS
}
print ""

This is easy, but has some issues: first, the original record spacing is lost. If the input record (line) was, say,

  abc  def   ghi    jkl     mno

the above code will print

ghi jkl mno

instead. This might or might not be a problem. For the same reason, if FS is a complex regular expression, whatever separated the fields in the original input is lost.
On the other hand, if FS is exactly a single-char expression (except space, which is the default and special cased), the above code works just fine.

Assign the empty string to the unwanted fields

So for example one might do:

$1 = $2 = ""; print substr($0, 3)

That presents the same problems as the first solution (formatting is lost), although for different reasons (here it's because awk rebuilds the line with OFS between fields), and introduces empty fields, which have to be skipped when printing the line (in the above example, the default OFS of space is assumed, so we must print starting from the third character; adapt accordingly if OFS is something else).

Delete the unwanted fields

Ok, so it's not possible to delete a field by assigning the empty string to it, but if we modify $0 directly we can indeed remove parts of it and thus fields. We can use sub() for the task:

# default FS
# removes first 2 fields
sub(/^[[:blank:]]*([^[:blank:]]+[[:blank:]]+){2}/,""); print
# removes last 3 fields
sub(/([[:blank:]]+[^[:blank:]]+){3}[[:blank:]]*$/,""); print

# one-char FS, for example ";"
# removes first 2 fields
sub(/^([^;]+;){2}/,""); print
# removes last 3 fields
sub(/(;[^;]+){3}$/,""); print

While this approach has the advantage that it preserves the original formatting (this is especially important if FS is the default, which in awk is slightly special-cased, as can be seen from the first example), it has the problem that it's not applicable at all if FS is a regular expression (that is, when it's not the default and is longer than one character).
It also requires that the Awk implementation in use understands the regex {} quantifier operator, something many awks don't do (although this can be worked around by "expanding" the expression, that is, for example, using "[^;]+;[^;]+;[^;]+;" instead of "([^;]+;){3}". However, the resulting expression might be quite long and awkward - pun intended).

Manually find start and end of fields

Let's now try to find a method that works regardless of FS or OFS. We observe that we can use index($0, $1) to find where $1 begins. We also know the length of $1, so we know where it ends within $0. Now, we can use again index() starting from the next character to find where $2 begins, and so on for all fields of $0. so we can discover the starting positions within $0 for all fields. Sample code:

pos = 0
for (i=1; i<= NF; i++) {
  start[i] = index(substr($0, pos + 1), $i) + pos
  pos = start[i] + length($i)
}

Now, start[1] contains the starting position of field 1 ($1), start[2] the starting position of $2, etc. (As customary in awk, the first character of a string is at position 1.) With this information, printing field 3 to NF without losing information is as simple as doing

first = 3
last = NF
print substr($0, start[first], start[last] - start[first] + length($last))

Seems easy right? Well, this approach has a problem: it assumes that the input has no empty fields, which however are perfectly fine in awk. If some of the fields in the desired range are empty, it may or may not work. So let's see if we can do better.

Manually find the separators

By design, FS can never match the empty string (more on this later), so perhaps we can look for matches of FS (using match()) and use those offsets to extract the needed fields. The idea is the same as in the previous approach, each match is attempted starting from where the previous one left off plus the length of the following field.
If we go this route, however, we must keep in mind that the default FS in awk is special-cased, in that leading and trailing blanks (spaces + tabs) in the record are not counted for the purpose of field splitting, and furthermore fields are separated by runs of blanks despite FS being just a single space. This only happens with the default FS; with any other value, each match terminates exactly one field. Fortunately, it is possible to check whether FS is the default by comparing it to the string " " (a space). If we detect the default FS, we remove leading and trailing blanks from the record, and, for the purpose of matching, change it to its effectively equivalent pattern, that is, "[[:blank:]]+".
If FS is not the default, there is still another special case we should check. The awk specification says that if FS is exactly one character (and is not a space), it must NOT be treated as a regular expression. Since we want to use match() and FS as a pattern, this is especially important, for example if FS is ".", or "+", or "*", which are special regular expression metacharacters but should be treated literally in this case.
All that being said, here's some code that finds and saves all matches of FS:

BEGIN {
  # sep_re is the "effective" FS, so to speak, to be
  # used to find where separators are
  sep_re = FS
  defaultfs = 0

  # ...but check for special cases
  if (FS == " ") {
    defaultfs = 1
    sep_re = "[[:blank:]]+"
  } else if (length(FS) == 1) {
    if (FS ~ /[][^$.*?+{}\\()|]/) {
      sep_re = "\\" FS
    }
  }
}

{
  # save $0 and work on the copy
  record = $0

  if (defaultfs) {
    gsub(/^[[:blank:]]+|[[:blank:]]+$/, "", record)
  }

  # find separators
  i = 0
  while(1) {
    if (match(record, sep_re)) {
      i++
      seps[i] = substr(record, RSTART, RLENGTH)
      record = substr(record, RSTART + RLENGTH)
    } else {
      break
    }
  }

  # ...continued below

With the above code seps[i] contains the string that matched FS between field i and i + 1. We of course also have the fields themselves in $1...$NF, so we can finally write the code that extracts a range of fields from the line:

  # ...continued from above

  result = ""

  first = 3
  last = NF
  for (i = first; i < last; i++) {
    result = result $i seps[i]
  }
  result = result $last
  print result
}

Are we still overlooking something? Unfortunately, yes.
We said earlier that FS can't match the empty string; however, technically we can obviously set it to a value that would ordinarily match the empty string, for example

FS="a*"

That matches zero or more a's, so in particular it will produce a zero-length match if it can't find an "a".
But, just as obviously, an FS that can match a zero-length string is useless as field "separator", so what happens in these cases is that awk just does not allow it to match:

$ echo 'XXXaYYYaaZZZ' | awk -F 'a*' '{for (i=1; i<=NF; i++) print i, $i}'
1 XXX
2 YYY
3 ZZZ

In other words, if awk finds a match of length zero it just ignores it and skips to the next character until it can find a match of length at least 1 for FS.

(Let's leave aside the fact that setting FS to "a*" makes no sense, as in that case what's really wanted is "a+" instead and let's try to make the code handle the worst case.)

In our sample code, we're using match(), which can indeed produce zero-length matches, but we are not checking for those cases; the result is that running it with an FS that can produce zero-length matches will loop forever.

Thus we need to mimic awk's field splitting a little bit more, in that if we find a zero-length match, we just ignore it and try to match again starting from the next character.
So here's the full code to print a range of fields preserving format and separators, with the revised loop to find separators skipping zero-length matches:

BEGIN {
  # sep_re is the "effective" FS, so to speak, to be
  # used to find where separators are
  sep_re = FS
  defaultfs = 0

  # ...but check for special cases
  if (FS == " ") {
    defaultfs = 1
    sep_re = "[[:blank:]]+"
  } else if (length(FS) == 1) {
    if (FS ~ /[][^$.*?+{}\\()|]/) {
      sep_re = "\\" FS
    }
  }
}

{
  # save $0 and work on the copy
  record = $0

  if (defaultfs) {
    gsub(/^[[:blank:]]+|[[:blank:]]+$/, "", record)
  }

  # find separators
  i = 0
  while(1) {
    if (length(record) == 0) break;
    if (match(record, sep_re)) {
      if (RLENGTH > 0) {
        i++
        seps[i] = substr(record, RSTART, RLENGTH)
        record = substr(record, RSTART + RLENGTH)
      } else {
        # ignore zero-length match: go to next char
        record = substr(record, 2)
      }
    } else {
      break
    }
  }

  result = ""

  first = 3
  last = NF
  for (i = first; i < last; i++) {
    result = result $i seps[i]
  }
  result = result $last
  print result
}

A simple optimization of the above code would be to directly skip the next field upon finding a match for FS, eg

# attempt next match after the field that begins here
record = substr(record, RSTART + RLENGTH + length($i))

since, by definition, a field can never match FS, so it can be skipped entirely for the purpose of finding matches of FS.

GNU awk

As it often happens, life is easier for GNU awk users. In this case, thanks to the optional fourth argument to the split() function (a GNU awk extension present at least since 4.0), which is an array where the separators are saved. So all that is needed is something like:

# this does all the hard work, as split() is
# guaranteed to behave like field splitting
nf = split($0, fields, FS, seps)

first = 3
last = NF
for (i = first; i < last; i++) {
  result = result fields[i] seps[i]
}
result = result $last
print result

For more and a slightly different take on the subject, see also this page on the awk.freeshell.org wiki.

Three text processing tasks

Just three problems that came up in different circumstances in the last couple of months.

Ranges, again

Ranges strike again, this time the task is to print or select everything from the first occurrence of /START/ in the input to the last occurrence of /END/, including the extremes or not. So, given this sample input:

 1 xxxx
 2 xxxx
 3 END
 4 aaa
 5 START
 6 START
 7 zzz
 8 START
 9 hhh
10 END
11 ppp
12 END
13 mmm
14 START

we want to match from line 5 to 12 (or from line 6 to 11 in the noninclusive version).

The logic is something along the lines of: when /START/ is seen, start collecting lines. Each time an /END/ is seen (and /START/ was previously seen), print what we have so far, empty the buffer and start collecting lines again, in case we see another /END/ later.

Here's an awk solution for the inclusive case:

awk '!ok && /START/ { ok = 1 }
ok { p = p sep $0; sep = RS }
ok && /END/ { print p; p = sep = "" }' file.txt

and here's the noninclusive case, which is mostly the same code with the order of the blocks reversed:

awk 'ok && /END/ { if (content) print p; p = sep = "" }
ok { p = p sep $0; sep = RS; content = 1 }
!ok && /START/ { ok = 1 }' file.txt

The "content" variable is necessary for the obscure corner case in which the input contains something like

...
START

END
...

If we relied upon "p" not being empty to decide whether to print or not, this case would be indistinguishable from this other one:

...
START
END
...

We could also (perhaps a bit cryptically) avoid the extra variable and rely on "sep" being set instead. We keep the extra variable for the sake of clarity.

Here are two sed solutions implementing the same logic (not really recommended, but since the original request was to solve this with sed). The hold buffer is used to accumulate lines.
Inclusive:

# sed -n
# from first /START/ to last /END/, inclusive version

/START/ {
  H
  :loop
  $! {
    n
    H
    # if we see an /END/, sanitize and print
    /END/ {
      x
      s/^\n//
      p
      s/.*//
      x
    }
    bloop
  }
}

The noninclusive version uses the same logic, except we discard the first /START/ line that we see (done by the "n" in the loop), and, when we see an /END/, we print what we have so far (which crucially does not include the /END/ line itself, which however is included for the next round of accumulation).

# sed -n
# from first /START/ to last /END/, noninclusive version

/START/ {
  :loop
  $! {
    n
    /END/ {
      # recover lines accumulated so far
      x

      # if there something, print
      /./ {
        # remove leading \n added by H
        s/^\n//
        p
      }

      # empty the buffer
      s/.*//

      # recover the /END/ line for next round
      x
    }
    H
    bloop
  }
}

Note that the above solutions assume that no line exists that match both /START/ and /END/. Other solutions are of course possible.

Conditional line join

In this case we have some special lines (identified by a pattern). Every time a special line is seen, all the previous or following lines should be joined to it. An example to make it clear, using /SPECIAL/ as our pattern:

SPECIAL 1
line2
line3
SPECIAL 2
line5
line6
line7
SPECIAL 3
SPECIAL 4
line10
SPECIAL 5

So we want one of the two following outputs, depending on whether we join the special lines to the preceding or the following ones:

# join with following lines
SPECIAL 1 line2 line3
SPECIAL 2 line5 line6 line7
SPECIAL 3
SPECIAL 4 line10
SPECIAL 5
# join with preceding lines
SPECIAL 1
line2 line3 SPECIAL 2
line5 line6 line7 SPECIAL 3
SPECIAL 4
line10 SPECIAL 5

The sample input has been artificially crafted to work with both types of change; in practice, in real inputs either the first or the last line won't match /SPECIAL/, depending on the needed processing.

So here's some awk code that joins each special line with the following ones, until a new special line is found, thus producing the first of the two output shown above:

awk -v sep=" " '/SPECIAL/ && done == 1 {
  print ""
  s = ""
  done = 0
}
{
  printf "%s%s", s, $0
  s = sep
  done = 1
}
END {
  if (done) print""
}' file.txt

And here's the idiomatic solution to produce the second output (join with preceding lines):

awk -v sep=" " '{ ORS = /SPECIAL/ ? RS : sep }1' file.txt

The variable "sep" should be set to the desired separator to be used when joining lines (here it's simply a space).

Intra-block sort

(for want of a better name)

Let's imagine an input file like

alpha:9832
alpha:11
alpha:449
delta:23847
delta:113
gamma:1
gamma:10
gamma:100
gamma:101
beta:5768
beta:4

The file has sections, where the first field names the section (alpha, beta etc.). Now we want to sort each section according to its second field (numeric), but without changing the overall order of the sections. In other words, we want this output:

alpha:11
alpha:449
alpha:9832
delta:113
delta:23847
gamma:1
gamma:10
gamma:100
gamma:101
beta:4
beta:5768

As a variation, blocks can be separated by a blank line, as follows:

alpha:9832
alpha:11
alpha:449

delta:23847
delta:113

gamma:1
gamma:10
gamma:100
gamma:101

beta:5768
beta:4

So the corresponding output should be

alpha:11
alpha:449
alpha:9832

delta:113
delta:23847

gamma:1
gamma:10
gamma:100
gamma:101

beta:4
beta:5768
Shell

The blatantly obvious solution using the shell is to number each section adding a new field at the beginning, then sort according to field 1 + field 3, and finally print the result removing the extra field that we added:

awk -F ':' '$1 != prev {count++} {prev = $1; print count FS $0}' file.txt | sort -t ':' -k1,1n -k3,3n | awk -F ':' '{print substr($0,index($0,FS)+1)}'
alpha:11
alpha:449
alpha:9832
delta:113
delta:23847
gamma:1
gamma:10
gamma:100
gamma:101
beta:4
beta:5768

Instead of reusing awk, the job of the last part of the pipeline could have been done for example with cut or sed.

For the variation with separated blocks, an almost identical solution works. Paragraphs are numbered prepending a new field, the result sorted, and the prepended numbers removed before printing:

awk -v count=1 '/^$/{count++}{print count ":" $0}' file.txt | sort -t ':' -k1,1n -k3,3n | awk -F ':' '{print substr($0,index($0,FS)+1)}'
alpha:11
alpha:449
alpha:9832

delta:113
delta:23847

gamma:1
gamma:10
gamma:100
gamma:101

beta:4
beta:5768

A crucial property of this solution is that empty lines are always thought as being part of the next paragraph (not the previous), so when sorting they remain where they are. This also means that runs of empty lines in the input are preserved in the output.

Perl

The previous solutions treat the input as a single entity, regardless of how many blocks it has. After preprocessing, sort is applied to the whole data, and if the file is very big, many temporary resources (disk, memory) are needed to do the sorting.

Let's see if it's possible to be a bit more efficient and sort each block independently.

Here is an example with perl that works with both variations of the input (without and with separated blocks).

#!/usr/bin/perl

use warnings;
use strict;

sub printblock {
  print $_->[1] for (sort { $a->[0] <=> $b->[0] } @_);
}

my @block = ();
my ($prev, $cur, $val);

while(<>){

  my $empty = /^$/;

  if (!$empty) {
    ($cur, $val) = /^([^:]*):([^:]*)/;
    chomp($val);
  }

  if (@block && ($empty || $cur ne $prev)) {
    printblock(@block);
    @block = ();
  }

  if ($empty) {
    print;
  } else {
    push @block, [ $val, $_ ];
    $prev = $cur;
  }
}

printblock(@block) if (@block);

Of course all the sample code given here must be adapted to the actual input format.

File encryption on the command line

This list is just a reference which hopefully saves some googling.

Let's make it clear that we're talking about symmetric encryption here, that is, a password (or better, a passphrase) is supplied when the file is encrypted, and the same password can be used to decrypt it. No public/private key stuff or other preparation should be necessary. We want a quick and simple way of encrypting stuff (for example, before moving them to the cloud or offsite backup not under our control). As said, file ecryption, not whole filesystems or devices.

Another important thing is that symmetric encryption is vulnerable to brute force attacks, so a strong password should always be used and the required level of security should always be evaulated. It may be that symmetric encryption is not the right choice for a specific situation.

It is worth noting that the password or passphrase that are supplied to the commands are not used directly for encryption/decription, but rather are used to derive the actual encryption/decryption keys. However this is done transparently by the tools (usually through some sort of hashing) and for all practical purposes, these passwords or passphrases are the keys, and should be treated as such.

In particular, one thing that should be avoided is putting them directly on the command line. Although some tools allow that, the same tools generally also offer options to avoid it, and they should definitely be used.

Openssl

Probably the simplest and most commonly installed tool is openssl.

# Encrypt
$ openssl enc -aes-192-cbc -in plain.txt -out encrypted.enc
# Decrypt
$ openssl enc -d -aes-192-cbc -in encrypted.enc -out plain.txt

The above is the basic syntax. The cipher name can of course be different; the man page for the enc openssl subcommand lists the supported algorithms (the official docs also say: "The output of the enc command run with unsupported options (for example openssl enc -help) includes a list of ciphers, supported by your version of OpenSSL, including ones provided by configured engines." Still, it seems that adding a regular -help or -h option wouldn't be too hard). Other useful options:

  • -d to decrypt
  • -pass to specify a password source. In turn, the argument can have various formats: pass:password to specify the password directly in the command, env:var to read it from the environment variable $var, file:pathname to read it from the file at pathname, fd:number to read it from a given file descriptor, and stdin to read it from standard input (equivalent to fd:0, but NOT equivalent to reading it from the user's terminal, which is the default behavior if -pass is not specified)
  • -a to base64-encode the encrypted file (or assume it's base64-encoded if decrypting)

Openssl can also read the data to encrypt from standard input (if no file is specified with -in) and/or write to standard output (if -out is not given). Example with password from file:

# Encrypt
$ tar -czvf - file1 file2 ... | openssl enc -aes-192-cbc -pass file:/path/to/keyfile -out archive.tar.gz.enc
# Decrypt
$ openssl enc -d -aes-192-cbc -pass file:/path/to/keyfile -in archive.tar.gz.enc | tar -xzvf -

GPG

There are two main versions of GPG, the 1.x series and the 2.x series (respectively 1.4.x and 2.0.x at the time of writing).

gpg comes with a companion program, gpg-agent, that can be used to store and retrieve passphrases use to unlock private keys, in much the same way that ssh-agent caches password-protected SSH private keys (actually, in addition to its own, gpg-agent can optionally do the job of ssh-agent and replace it). Using gpg-agent is optional with gpg 1.x, but mandatory with gpg 2.x. In practice, when doing symmetric encryption, the agent is not used, so we won't talk about it here (although we will briefly mention it later when talking about aespipe, since that tool can use it).

GPG 1.4.x
# Encrypt file
$ gpg --symmetric --cipher-algo AES192 --output encrypted.enc plain.txt
# Decrypt file
$ gpg --decrypt --output plain.txt encrypted.enc

# Encrypt stdin to file
$ tar -czvf - file1 file2 ... | gpg --symmetric --cipher-algo AES192 --output archive.tar.gz.enc
# Decrypt file to stdout
$ gpg --decrypt archive.tar.gz.enc | tar -xzvf -

Useful options:

  • -a (when encrypting) create ascii-armored file (ie, a special text file)
  • --cipher-algo ALG (when encrypting) use ALG as cipher algorithm (run gpg --version to get a list of supported ciphers)
  • --batch avoid asking questions to the user (eg whether to overwrite a file). If the output file exists, the operation fails unless --yes is also specified
  • --yes assume an answer of "yes" to most questions (eg when overwriting an output file, which would otherwise ask for confirmation)
  • --no-use-agent to avoid the "gpg: gpg-agent is not available in this session" message that, depending on configuration, might be printed if gpg-agent is not running (it's only to avoid the message; as said, the agent is not used anyway with symmetric encryption)
  • --passphrase string use string as the passphrase
  • --passphrase-file file read passphrase from file
  • --passphrase-fd n read passphrase from file descriptor n (use 0 for stdin)
  • --quiet suppress some output messages
  • --no-mdc-warning (when decrypting) suppress the "gpg: WARNING: message was not integrity protected" message. Probably, a better thing to do is use --force-mdc when encrypting, so GPG won't complain when decrypting.

In any case, GPG will create and populate a ~/.gnupg/ directory if it's not present (I haven't found a way to avoid it - corrections welcome).

Similar to openssl, GPG reads from standard input if no filename is specified at the end of the command line. However, writing to standard output isn't obvious.

When encrypting, if no --output option is given, GPG will create a file with the same name as the input file, with the added .gpg extension (eg file.txt becomes file.txt.gpg), unless input comes from stdin, in which case output goes to stdout. If the input comes from a regular file and writing to standard ouput is desired, --output - can be used. --output can of course also be used if we want an output file name other than the default with .gpg appended.
On the other hand, when decrypting using --decrypt output goes to stdout unless --output is used to override it. If --decrypt is not specified, GPG still decrypts, but the default operation is to decrypt to a file named like the one on the command line but with the .pgp suffix removed (eg file.txt.pgp becomes file.txt); if the file specified does not end in .pgp, then --output must be specified (--output - writes to stdout), otherwise PGP exits with a "unknown suffix" error.

GPG 2.0.x
# Encrypt file
$ gpg --symmetric --batch --yes --passphrase-file key.txt --cipher-algo AES256 --output encrypted.enc plain.txt
# Decrypt file
$ gpg --decrypt --batch --yes --passphrase-file key.txt --output plain.txt encrypted.enc

# Encrypt stdin to file
$ tar -czvf - file1 file2 ... | gpg --symmetric --batch --yes --passphrase-file key.txt --cipher-algo AES256 --output archive.tar.gz.enc
# Decrypt file to stdout
$ gpg --decrypt --batch --yes --passphrase-file key.txt archive.tar.gz.enc | tar -xzvf -

In this case, the --batch option is mandatory (and thus probably --yes too) if we don't want pgp to prompt for the passphrase and instead use the one supplied on the command line with one of the --passphrase* options. The --no-use-agent option is ignored in gpg 2.0.x, as using the agent is mandatory and thus it should always be running (even though it's not actually used when doing symmetric encryption).

aespipe

As the name suggests, aespipe only does AES in its three variants (128, 192, 256). Aespipe tries hard to prevent the user from specifying the passphrase on the command line (and rightly so), so the passphrase(s) must normally be in a file (plaintext or encrypted with GPG). It is of course possible to come up with kludges to work around these restrictions, but they are there for a reason.

Aespipe can operate in single-key mode, where only one key/passphrase is necessary, and in multi-key mode, for which at least 64 keys/passphrases are needed. With 64 keys it operates in multi-key-v2 mode, with 65 keys it switches to multi-key-v3 mode, which is the safest and recommended mode, and the one that will be used for the examples.

So we need a file with 65 lines of random grabage; one way to generate it is as follows:

$ tr -dc '[:print:]' < /dev/random | fold -b | head -n 65 > keys.txt

If the above command blocks, it means that the entropy pool of the system isn't providing enough data. Either generate some entropy by doing some work or using an entropy-gathering daemon, or use /dev/urandom instead (at the price of losing some randomness).

Aespipe can also use a pgp-encrypted key file; more on this later. For now let's use the cleartext one.

# Encrypt a file using aes256
$ aespipe -e AES256 -P keys.txt < plain.txt > encrypted.enc
# Decrypt
$ aespipe -d -P keys.txt < encrypted.enc > plain.txt

As can be seen from the examples, given the way aespipe works (that is, as a pipe), it is not necessary to show its usage to encrypt to/from stdin/stdout, since it's its default and only mode of operation.

Useful options:

  • -C count run count rounds of hashing when generating the encryption key from the passphrase. This stretching helps to slow down brute force attacks. Recommended if using single-key mode, not needed in multi-key mode(s)
  • -e ENCALG (when encrypting) use ENCALG as cipher algorithm (AES128, AES192, AES256)
  • -h HASHALG use HASHALG to generate the actual key from the passphrase (default depends on encryption algorithm, see the man page)

One very important thing to note is that aespipe has a minimum block granularity when encrypting and decrypting; in simple terms, this means that the result of the decryption must always be a multiple of this minumum (16 bytes in single-key mode, 512 bytes in multi-key modes). NULs are added to pad if needed. Here is a blatant demonstration of this fact:

$ echo hello > file.txt.orig
$ ls -l file.txt.orig
-rw-r--r-- 1 waldner users 6 Jul 11 16:52 file.txt.orig
$ aespipe -P keys.txt < file.txt.orig > file.txt.enc
$ aespipe -d -P keys.txt < file.txt.enc > file.txt.dec
$ ls -l file.txt.*
-rw-r--r-- 1 waldner users 512 Jul 11 16:58 file.txt.dec
-rw-r--r-- 1 waldner users 512 Jul 11 16:57 file.txt.enc
-rw-r--r-- 1 waldner users   6 Jul 11 16:52 file.txt.orig
$ od -c file.txt.dec 
0000000   h   e   l   l   o  \n  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000020  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0001000

Some file formats can tolerate garbage at the end (eg tar), other can't, so this is something to keep into account when using aespipe. In the cases where the original size is known, it may be possible to postprocess the decrypted file to remove the padding but this may not always be practical:

$ origsize=$(wc -c < file.txt.orig)
$ truncate -s "$origsize" file.txt.dec
# alternatively
$ dd if=file.txt.dec bs="$origsize" count=1 > file.txt.likeorig

In the cases where the exact byte size is needed and no postprocessing is possible or wanted, another tool should be used (eg gpg or openssl).

Ok, so let's now see how to use an encrypted keyfile with aespipe. The file should be encrypted with GPG, which in turn can do symmetric encryption (as previously seen in this same article) or public-key encryption (using a public/private key pair, which should be already generated and available - not covered here).
Let's encrypt our keys.txt file both with symmetric and public-key encryption (separately)

# using symmetric encryption
$ gpg --symmetric --output keys.enc.sym keys.txt
# enter passphrase, or use some --passphrase* option to specify one

# using public key encryption
$ gpg --encrypt --recipient 199705C4 --output keys.enc.pubk keys.txt
# no passphrase is required, as only the public key is used to encrypt
# here "199705C4" is the id of the (public) key

Now, we want to encrypt or decrypt some file using the keys contained in our password-protected keyfile(s). This is done using the -K option (instead of -P) to aespipe. Let's start with the symmetrically enctrypted keyfile (keys.enc.sym):

# encrypt
$ aespipe -e aes256 -K keys.enc.sym < plain.txt > encrypted.enc
# aespipe prompts for the gpg passphrase to decrypt the keyfile

# decrypt
$ aespipe -d -e aes256 -K keys.enc.sym < encrypted.enc > plain.txt
# same thing, passphrase for keyfile is prompted

Now with the public-key encrypted keyfile:

# encrypt
$ aespipe -e aes256 -K keys.enc.pubk < plain.txt > encrypted.enc
# to decrypt keys.enc.pubk, the private key is needed, 
# aespipe prompts for the passphrase to unlock the private key

# decrypt
$ aespipe -d -e aes256 -K keys.enc.pubk < encrypted.enc > plain.txt
# same thing, passphrase to unlock the private key is prompted

So far, nothing special. However, for this last case (keyfile encrypted with public key cryptography), aespipe can actually use gpg-agent (if it's running) to obtain the passphrase needed to unlock the private key. This is done with the -A option, which tells aespipe the path to the socket where gpg-agent is listening. Assuming gpg-agent has already seen the passphrase to unlock the private key, it can transmit it to aespipe.

# The gpg-agent socket information is in the GPG_AGENT_INFO environment variable
# in the session where the agent is running, or one to which the variable has been exported. For example:
$ echo "$GPG_AGENT_INFO"
/tmp/gpg-gXM3Pm/S.gpg-agent:4897:1
# encrypt using a public-key encrypted keyfile, but tell aespipe to ask gpg-agent for the passphrase
$ aespipe -e aes256 -A "$GPG_AGENT_INFO" -K keys.enc.pubk < plain.txt > encrypted.enc
# similar for decryption

Other utilities

Let's have a look at some other utilities that are simpler but lack the flexibility provided by the previous ones.

mcrypt

This seems to be almost unusable, as doing practically anything beyond simple, optionless encryption produces a message like

Signal 11 caught. Exiting.

so it doesn't seem to be a good candidate for serious use. Some research shows many user in the same situation. More information is welcome.

aescrypt

This is a little-known program, however aescrypt is open source and very simple to use. It is multiplatform and has even a GUI for graphical operation. Here, however, we'll use the command-line version.

# encrypt a file
$ aescrypt -e -p passphrase file.txt
# creates file.txt.aes

# decrypt a file
$ aescrypt -d -p passphrase file.txt.aes
# creates file.txt

# encrypt standard input
$ tar -czvf - file1 file2 ... | aescrypt -e -p passphrase - -o archive.tar.gz.aes

# decrypt to stdout
$ aescrypt -d -p passphrase -o - archive.tar.gz.aes | tar -xzvf -

If no -p option is specified, aescrypt interactively prompts for the passphrase.
If no -o option is specified, a file with the same name and the .aes suffix is created when encrypting, and one with the .aes suffix removed when decrypting.

Since putting passwords directly on the command line is bad, it is possible to put the passphrase in a file and tell aescrypt to read it from the file. However, the file is not a simple text file; it has to be in a format that aescrypt recognizes. To create it, the documentation suggests using the aescrypt_keygen utility as follows:

$ aescrypt_keygen -p somesupercomplexpassphrase keyfile.key

The aescrypt_keygen program is only available in the source code package and not in the binary one (at least in the Linux version). However, since this file, according to the documentation, is nothing more than the UTF-16 encoding of the passphrase string, it's easy to produce the same result without the dedicated utility:

# generate keyfile
$ echo somesupercomplexpassphrase | iconv -f ascii -t utf-16 > keyfile.key

Once we have a keyfile, we can encrypt/decrypt using it:

$ aescrypt -e -k keyfile.key file.txt
# etc.
ccrypt

The ccrypt utility is another easy-to-use encryption program that implements the AES(256) algorithm. Be sure to read the man page and the FAQ.

Warning: when not reading from standard input, ccrypt overwrites the source file with the result of the encryption or decryption. This means that, if the encryption process is interrupted, a file could be left in an only partially encrypted state. On the other hand, when encrypting standard input this (obviously) does not happen. Sample usage:

# encrypt a file; overwrites the unencrypted version, creates file.txt.cpt
$ ccrypt -e file.txt

# decrypt a file; overwrites the encrypted version, creates file.txt
$ ccrypt -d file.txt.cpt

In this mode, multiple file arguments can be specified, and they will all be encrypted/decrypted. It is possible to recursively encrypt files contained in subdirectories if the -r/--recursive option is specified.

If no files are specified, ccrypt operates like a pipe:

# Encrypt standard input (example)
$ tar -czvf - file1 file2 ... | ccrypt -e > archive.tar.gz.enc
# Decrypt to stdout (example)
$ ccrypt -d < archive.tar.gz.enc | tar -xzvf -

To use the command non-interactively, it is possible to specify the passphrase in different ways:

  • -K|--key passphrase: directly in the command (not recommended)
  • -E|--envvar var: the passphrase is the content of environment variable $var

A useful option might be the -x|--keychange, which allows changing the passphrase of an already encrypted file; the old and new passphrases are prompted - or specified on the command line with -K/-H (--key/--key2) or -E/-F (--envvar/--envvar2) respectively, the file is decrypted with the old passphrase and reencrypted with the new one.

7-zip

The compression/archiving utility 7-zip can apparently do AES256 encryption, deriving the encryption key from the passphrase specified by the user with the -p option:

# encrypt/archive, prompt for passphrase
$ 7z a -p archive.enc.7z file1 file2 ...

# encrypt/archive, passphrase on the command line
$ 7z a -ppassphrase archive.enc.7z file1 file2 ...

# encrypt/archive standard input (prompt for passphrase)
$ tar -cvf - file1 file2 ... | 7z a -si -p archive.enc.tar.7z

# decrypt/extract, prompt for passphrase
$ 7z x -p archive.enc.7z [ file1 file2 ... ]

# decrypt/extract, passphrase on the command line
$ 7z x -ppassphrase archive.enc.7z [ file1 file2 ... ]

# decrypt/extract to stdout (prompt for passphrase)
$ 7z x -so -p archive.enc.tar.7z | tar -xzvf -

It looks like there's no way to run in batch (ie, non-interactive) mode without explicitly specifying the passphrase on the command line.