Skip to content

Poor man’s directory tree replication

So you have this /var/lib/mysql directory that you need to copy to three other machines. A quick and dirty solution is to use ssh and tee (it goes without saying that passwordless ssh is needed, here and for all the other examples):

$ tar -C /var/lib/mysql -cvzf - . |\
  tee >(ssh dstbox1 'tar -C /var/lib/mysql/ -xzvf -') \
      >(ssh dstbox2 'tar -C /var/lib/mysql/ -xzvf -') \
      >(ssh dstbox3 'tar -C /var/lib/mysql/ -xzvf -') > /dev/null

If the directory tree to be transfered is not local, it is again possible to use ssh to get to it:

$ ssh srcbox 'tar -C /var/lib/mysql -cvzf - .' |\
  tee >(ssh dstbox1 'tar -C /var/lib/mysql/ -xzvf -') \
      >(ssh dstbox2 'tar -C /var/lib/mysql/ -xzvf -') \
      >(ssh dstbox3 'tar -C /var/lib/mysql/ -xzvf -') > /dev/null

This means that all the data flows from the source, through the machine where the pipeline runs, to the targets. On the other hand this solution has the advantage that there is no need to set up passwordless ssh between the origin and the target(s); the only machine that needs passwordless ssh to all the others is the machine where the command runs.

Now this is all basic stuff, but after doing this I wondered whether it would be possible to generalize the logic for a variable number of target machines, so for example a nettar-style operation could be possible, as in

$ nettar2.sh /var/lib/mysql dstbox1:/var/lib/mysql dstbox2:/var/tmp dstbox3:/var/lib/mysql ...

Would mean: take (local) /var/lib/mysql and replicate it to dstbox1 under /var/lib/mysql, to dstbox2 under /var/tmp, to dstbox3 under /var/lib/mysql, and so on for any extra argument supplied. Arguments could have the form targetname:[targetpath], with a missing targetpath indicating the same path as the source (ie, /var/lib/mysql in this example).

It turns out that such a generalization is not easy.

Note that in the following code, all error checking and other refinements are omitted for simplicity. In particular, care should be taken at least to:

  • validate the arguments passed to the script for number (at least two) and correct syntax
  • check that paths exist (or create them if not, etc)
  • properly escape arguments to commands that are executed using ssh (for example using printf %q)
  • validate data that is used to dynamically build commands to be run with eval

None of the above is done in the code that follows.

Concurrent transfers

An obvious way to do it is to run three (or however many) concurrent transfers, eg

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# parallel transfers
 
srcpath=$1
shift
 
for arg in "$@"; do
  dstbox=${arg%:*}
  dstpath=${arg#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  tar -C "$srcpath" -cvzf - . | ssh "$dstbox" "tar -C '$dstpath' -xvzf -" &
done
 
wait

This obviously simply reads $srcpath multiple times and transfers it to each target machine. We are not exploiting the data duplication done by tee. If the source directory is huge, this will not be efficient as multiple processes at once will try to read it; although the OS will probably cache most of it, it doesn't look like a satisfactory solution.

So what if we actually want to use tee (which in turn implies that we need process substitution or an equivalent facility)?

Using eval

The first thing that comes to mind is to use the questionable eval command:

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using tee + eval
 
do_sshtar(){
  local dstbox=$1 dstpath=$2
  ssh "$dstbox" "tar -C '$dstpath' -xvzf -"
}
 
declare -a args
 
srcpath=$1
shift
 
for arg in "$@"; do
  dstbox=${arg%:*}
  dstpath=${arg#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  args+=( ">(do_sshtar '$dstbox' '$dstpath')" )
done
 
tar -C "$srcpath" -cvzf - . | eval tee "${args[@]}" ">/dev/null"

This effectively builds the full list of process substitutions at runtime and executes them. However, when using eval we should be well aware of what we're doing. See the following pages for a good discussion of the implications of using eval: http://mywiki.wooledge.org/BashFAQ/048 and http://wiki.bash-hackers.org/commands/builtin/eval.

Note that with process substitution there is also the (in this case minor) issue that the created processes are run asynchronously in background, and we have no way to wait for their full termination (not even using wait), so the script might give us back the prompt slightly before all the background processes have fully completed their job.

Coprocesses

Bash and other shells have coprocesses (see also here), so it would seem that they could be useful for our purposes.
However, at least in bash, it seems that it's not possible to create a coprocess whose name is stored in a variable (which is how we would create a bunch of coprocesses programmatically), eg:

$ coproc foo { command; }      # works
$ cname=foo; coproc $cname { command; }  # does not work as expected (creates a coproc literally named $cname)

So to use coprocesses for our task, we would need again to resort to eval.

Named pipes

Let's see if there is some other possibility. Indeed there is, and it involves using named pipes (aka FIFOs):

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using tee + FIFOs (ssh version)
 
declare -a fifos
 
srcpath=$1
shift
 
count=1
for arg in "$@"; do
  dstbox=${arg%:*}
  dstpath=${arg#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  curfifo=/tmp/FIFO${count}
  mkfifo "$curfifo"
  fifos+=( "$curfifo" )
  ssh "$dstbox" "tar -C '$dstpath' -xvzf -" < "$curfifo" &
  ((count++))
done
 
tar -C "$srcpath" -cvzf - . | tee -- "${fifos[@]}" >/dev/null
 
wait
# cleanup the FIFOs
rm -- "${fifos[@]}"

Here we're creating N named pipes, whose names are saved in an array, and an instance of ssh +tar to the target machine is launched in background reading from each pipe. Finally, tee is run against all the existing named pipes to send them the data; all the FIFOs are removed at the end.
This is not too bad, but we should manually set up interprocess communication (ie, create/delete the FIFOs); the beauty of process substitution is that bash sets up those channels for us, and here we're not taking advantage of that.

A point to note is that here we used ssh for the data transfer; it's always possible to change the code to use netcat, as explained in the nettar article. Here's an adaptation of the last example to use the nettar method (the other cases are similar):

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using tee + FIFOs (netcat version)
 
declare -a fifos
 
srcpath=$1
shift
 
count=1
for arg in "$@"; do
  dstbox=${arg%:*}
  dstpath=${arg#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
 
  if ssh "$dstbox" "cd '$dstpath' || exit 1; { nc -l -p 1234 | tar -xvzf - ; } </dev/null >/dev/null 2>&1 &"; then
    curfifo=/tmp/FIFO${count}
    mkfifo "$curfifo"
    fifos+=( "$curfifo" )
    nc "$dstbox" 1234 < "$curfifo" &
    ((count++))
  else
    echo "Warning, skipping $dstbox" >&2   # or whatever
  fi
done
 
tar -C "$srcpath" -cvzf - . | tee -- "${fifos[@]}" >/dev/null
 
wait
# cleanup the FIFOs
rm -- "${fifos[@]}"

There should be some other way. I'll update the list if I discover some other method. As always, suggestions welcome.

Recursion

Update 19/05/2014: Marlon Berlin suggested (thanks) that recursion could be used to build an implicit chain of >(...) process substitutions, and indeed that's true. So here it is:

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using recursion (ssh version)
 
do_sshtar(){
 
  local dstbox=${1%:*} dstpath=${1#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  shift
 
  if [ $# -eq 0 ]; then
    # end recursion
    ssh "$dstbox" "tar -C '$dstpath' -xzvf -"
  else
    # send data to "current" $dstbox and recurse
    tee >(ssh "$dstbox" "tar -C '$dstpath' -xzvf -") >(do_sshtar "$@") >/dev/null
  fi
}
 
srcpath=$1
shift
 
tar -C "$srcpath" -czvf - . | do_sshtar "$@"

When the do_sshtar function receives only one argument, it just transfers the data directly via ssh to terminate the recursion. Otherwise, it uses tee to transfer the data and continue the recursion. Simple and elegant. Here's the netcat version:

#!/bin/bash
 
# syntax: $0 /src/dir dstbox1:[/dst/dir] [ dstbox2:[/dst/dir] dstbox3:[/dst/dir] ... ]
# using recursion (netcat version)
 
do_nctar(){
 
  local dstbox=${1%:*} dstpath=${1#*:}
  [ -n "$dstpath" ] || dstpath=$srcpath
  shift
 
  # set up listening nc on $dstbox
  if ssh -n "$dstbox" "cd '$dstpath' || exit 1; { nc -l -p 1234 | tar -xvzf - ; } </dev/null >/dev/null 2>&1 &"; then
    if [ $# -eq 0 ]; then
      # end recursion
      nc "$dstbox" 1234
    else
      # send data to "current" $dstbox and recurse
      tee >(nc "$dstbox" 1234) >(do_nctar "$@") >/dev/null
    fi
  else
    echo "Warning, skipping $dstbox" >&2
    # one way or another, we must consume the input
    if [ $# -eq 0 ]; then
      cat > /dev/null
    else
      do_nctar "$@"
    fi
  fi
}
 
srcpath=$1
shift
 
tar -C "$srcpath" -czvf - . | do_nctar "$@"

The -n switch to ssh is important, otherwise it will try to read from stdin, consuming our tar data.

Many ways to encrypt passwords

Specifically, using crypt(3). One typical use case is, you have a plaintext password and need the damn full thing to put into the /etc/shadow file, which nowadays is usually something like:

$5$sOmEsAlT$pKHkGjoFXUgvUv.UYQuekdpjoZx7mqXlIlKJj6abik7   # sha-256

or

$6$sOmEsAlT$F3DN61SEKPHtTeIzgzyLe.rpctiym/qxz5xQz9YM.PyTdH7R13ZDXj6sDMeZg5wklbYJYSqDBXcH4UnAWQrRN0   # sha-512

The input to the crypt(3) library function is a cleartext password and a salt. Here we assume the salt is provided, but it's easy to generate a random one (at least one that's "good enough").

In the case of sha-256 and sha-512 hashes (identified respectively by the $5$ and $6$ in the first field, which are also the only ones supported by Linux along with the old md5 which uses code $1$) the salt can be augmented by prepending the rounds=<N>$ directive, to change the default number of rounds used by the algorithm, which is 5000. So for example we could supply a salt like

rounds=500000$sOmEsAlT

and thus use 500000 rounds (this is called stretching and is used to make brute force attacks harder). If the rounds= argument is specified, the output of crypt() includes it as well, since its value must be known every time the hash is recalculated.

It seems there's no utility to directly get the hash string (there used to be a crypt(1) command which however had troubles related to the export of cryptographic software, which made it so weak that many distros stopped shipping it). So we'll have to find some command that calls the crypto(3) function.

In the following examples, we assume the algorithm number, the salt and the password are stored in the shell variables $alg, $salt and $password respectively:

alg=6
salt='rounds=500000$sOmEsAlT'
password='password'

This way, the code doesn't hardcode anything and can be reused.

Perl

$ perl -e 'print crypt($ARGV[1], "\$" . $ARGV[0] . "\$" . $ARGV[2]), "\n";' "$alg" "$password" "$salt"
$6$rounds=500000$sOmEsAlT$Rf3.xi9RRiCW/FTh4gp67TSLyKotq1QkGkbn0O6cYDYEExwrFE30zeKGDIaZ3TZ.RDwiNya5nKlPDRTA0U4E8/

Python

# python 2/3
$ python -c 'import crypt; import sys; print (crypt.crypt(sys.argv[2],"$" + sys.argv[1] + "$" + sys.argv[3]))' "$alg" "$password" "$salt"
$6$rounds=500000$sOmEsAlT$Rf3.xi9RRiCW/FTh4gp67TSLyKotq1QkGkbn0O6cYDYEExwrFE30zeKGDIaZ3TZ.RDwiNya5nKlPDRTA0U4E8/

MySQL

Yes, MySQL has a built-in function that uses crypt(3):

$ mysql -B -N -e "select encrypt('$password', '\$$alg\$$salt');" 
$6$rounds=500000$sOmEsAlT$Rf3.xi9RRiCW/FTh4gp67TSLyKotq1QkGkbn0O6cYDYEExwrFE30zeKGDIaZ3TZ.RDwiNya5nKlPDRTA0U4E8/

Obviously, extra care should be taken with this one if $password or $salt contain quotes or other characters that are special to MySQL.

Php

$ php -r 'echo crypt($argv[2], "\$" . $argv[1] . "\$" . $argv[3]) . "\n";' "$alg" "$password" "$salt"
$6$rounds=500000$sOmEsAlT$Rf3.xi9RRiCW/FTh4gp67TSLyKotq1QkGkbn0O6cYDYEExwrFE30zeKGDIaZ3TZ.RDwiNya5nKlPDRTA0U4E8/

Ruby

$ ruby -e 'puts ARGV[1].crypt("$" + ARGV[0] + "$" + ARGV[2]);' "$alg" "$password" "$salt"
$6$rounds=500000$sOmEsAlT$Rf3.xi9RRiCW/FTh4gp67TSLyKotq1QkGkbn0O6cYDYEExwrFE30zeKGDIaZ3TZ.RDwiNya5nKlPDRTA0U4E8/

mkpasswd

This utility comes with the whois package (at least in Debian). Here it's better to introduce another separate variable to hold the number of rounds:

# password as before
rounds=500000
salt=sOmEsAlT

(and of course the other examples can be adapted to use the three variables instead of two). Then it can be used as follows:

$ mkpasswd -m sha-512 -R "$rounds" -S "$salt" "$password"
$6$rounds=500000$sOmEsAlT$Rf3.xi9RRiCW/FTh4gp67TSLyKotq1QkGkbn0O6cYDYEExwrFE30zeKGDIaZ3TZ.RDwiNya5nKlPDRTA0U4E8/

If using the standard number of rounds the -R option can be omitted, of course. Here the algorithm is specified by name, so the $alg variable is not used.

Some notes on macvlan/macvtap

There's not a lot of documentation about these interfaces. Here are some notes to summarize what I've been able to gather so far. Surely there's more to it (corrections and/or more information welcome).

macvlan

Macvlan interfaces can be seen as subinterfaces of a main ethernet interface. Each macvlan interface has its own MAC address (different from that of the main interface) and can be assigned IP addresses just like a normal interface.

So with this it's possible to have multiple IP addresses, each with its own MAC address, on the same physical interface. Applications can then bind specifically to the IP address assigned to a macvlan interface, for example. The physical interface to which the macvlan is attached is often referred to as "the lower device" or "the upper device"; here we'll use the term "lower device".

The main use of macvlan seems to be container virtualization (for example LXC guests can be configured to use a macvlan for their networking and the macvlan interface is moved to the container's namespace), but there are other scenarios, mostly very specific cases, like using virtual MAC addresses (see for example this keepalived feature).

A macvlan interface can work in one of four modes, defined at creation time.

  • VEPA (Virtual Ethernet Port Aggregator) is the default mode. If the lower device receives data from a macvlan in VEPA mode, this data is always sent "out" to the upstream switch or bridge, even if it's destined for another macvlan in the same lower device. Since macvlans are almost always assigned to virtual machines or containers, this makes it possible to see and manage inter-VM traffic on a real external switch (whereas with normal bridging it would not leave the hypervisor), with all the features provided by a "real" switch. However, at the same time this implies that, for VMs to be able to communicate, the external switch should send back inter-VM traffic to the hypervisor out of the same interface it was received from, something that is normally prevented from happening by STP. This feature (the so-called "hairpin mode" or "reflective relay") isn't widely supported yet, which means that if using VEPA mode with an ordinary switch, inter-VM traffic leaves the hypervisor but never comes back (unless it's sent back at the IP level by a router somewhere, but then there's nothing special about that, it has always worked that way).
    Since there are few switches supporting hairpin mode, VEPA mode isn't used all that much yet. However it's worth mentioning that Linux's own internal bridge implementation does support hairpin mode in recent versions; assuming eth0 is a port of br0, hairpin mode can be anabled by doing

    # echo 1 > /sys/class/net/br0/brif/eth0/hairpin_mode

    or using a recent version of brctl:

    # brctl hairpin br0 eth0 on

    or even better, using the bridge program that comes with recent versions of iproute2:

    # bridge link set dev eth0 hairpin on

    So a Linux box could very well be used in the role of "external switch" as mentioned above.

  • Bridge mode: this works almost like a traditional bridge, in that data received on a macvlan in bridge mode and destined for another macvlan of the same lower device is sent directly to the target (if the target macvlan is also in bridge mode), rather than being sent outside. This of course works well with non-hairpin switches, and inter-VM traffic has better performance than VEPA mode, since the external round-trip is avoided. In the words of a kernel developer,

    The macvlan is a trivial bridge that doesn't need to do learning as it
    knows every mac address it can receive, so it doesn't need to implement
    learning or stp. Which makes it simple stupid and and fast.

  • Private mode: this is essentially like VEPA mode, but with the added feature that no macvlans on the same lower device can communicate, regardless of where the packets come from (so even if inter-VM traffic is sent back by a hairpin switch or an IP router, the target macvlan is prevented from receiving it). I haven't tried, but I suppose that it is the operating mode of the target macvlan that determines whether it receives the traffic or not. This mode is useful, of course, if we really want macvlan isolation.
  • Passthru mode: this mode was added later, to work around some limitation of macvlans (more details here). I'm not 100% clear on what's the problem passthru mode tries to solve, as I was able to set promiscuous mode, create bridges, vlans and sub-macv{lan,tap} interfaces in KVM guests using a plain macvtap in VEPA mode for their networking (so no need for passthru). Since I'm surely missing something, more information (as usual) is welcome.

VEPA, bridged and private mode come from a standard called EVB (edge virtual bridging); a good article which provide more information can be found here.

Curiously (at least, in the case of the three original operating modes), the operating mode is per-macvlan interface rather than global (per-physical device); I guess that it's then more or less mandatory to configure all the macvlans of the same lower device to operate in the same mode, or at least match the macvlan modes so that only intended inter-VM traffic is possible; not sure what would happen, for instance, if a macvlan using VEPA mode tries to communicate with another one using bridge mode, or viceversa. This may well be worth investigating.

Irrespective of the mode used for the macvlan, there's no connectivity from whatever uses the macvlan (eg a container) to the lower device. This is by design, and is due to the the way macvlan interfaces "hook into" their physical interface. If communication with the host is needed, the solution is kind of easy: just create another macvlan on the host on the same lower device, and use this to communicate with the guest.

The documentation of iproute2 about setting operating mode for macvlans isn't complete, since neither "ip link help" nor the man pages mention how to do that. Fumbling around a bit, it can be seen that the syntax is

# ip link add link eth2 macvlan2 type macvlan mode aaa    # hit enter here to force an error message
Error: argument of "mode" must be "private", "vepa", "bridge" or "passthru"

Even more undocumented (if possible) is the way to show the operating mode of a macvlan, which turns out to be

# ip -d link show macvlan2
27: macvlan2@eth2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT 
    link/ether 26:8a:3c:07:7d:f4 brd ff:ff:ff:ff:ff:ff
    macvlan  mode vepa 

Let's hope that all this appears in the documentation soon.

The MAC address of the macvlan is normally autogenerated; to explicitly specify one, the following syntax can be used (which also specifies custom name and operating mode at the same time):

# ip link add link eth2 FOOMACVLAN address 56:61:4f:7c:77:db type macvlan mode bridge

Final note, it's also possible to create a macvlan interface and bridge it (eg brctl addif br0 macvlan2); though it's a bit weird, it does work fine.

macvtap interfaces

A macvtap is a virtual interfaces based on macvlan (thus tied to another interface) vaguely similar (not much in fact) to a regular tap interface. A macvtap interface is similar to a normal tap interface in that a program can attach to it and read/write frames. However, the similarities end here. The most prominent user of macvtap interfaces seems to be libvirt/KVM, which allows guests to be connected to macvtap interfaces. Doing so allows for (almost) bridged-like behavior of guests but without the need to have a real bridge on the host, as a regular ethernet interface can be used as the macvtap's lower device.

Some notes about macvtap (more information is always welcome):

  • Since it's based on macvlan, macvtap shares the same operating modes it can be in (VEPA, bridge, private and passthru)
  • Similarly, a guest using a macvatp interface cannot communicate directly with its lower device in the host. In fact, if you run tcpdump on the macvtap interface on the host, no traffic will be seen. Again this is by design, but can be surprising. This link has some details and suggests workarounds for KVM in case this functionality is needed. A quick workaround is to create a macvlan (not macvtap) interface on the host, which will then be visible from the guests. (On a side note, this is also a way to use routed mode for the macvtap guests: put the host's macvlan and all guests on the same IP subnet, configure the guests to use the host macvlan's IP as their default gateway, and have the host do NAT between the macvlan and the physical interface. But then, in this case, it's probably easier to use a real bridge).
  • Creation of a macvtap interface is not done by opening /dev/net/tun; instead, it looks like the only way to create one is to directly send appropriate messages to the kernel via a netlink socket (at least, that's how iproute2 and libvirt do it; strace and/or the source will show the details, as there seems to be no documentation whatsoever). This makes it a bit more complicated than a normal tun/tap interface.
  • macvtap interfaces are persistent by default. Once the macvtap interface has been created via netlink, an actual chracter device file appears under /dev (this does not happen with normal tap interfaces), The device file is called /dev/tapNN, where NN is the interface index of the macvtap (can be seen for example with "ip link show"). It's this device file that has to be opened by programs wanting to use the interface (eg libvirtd/qemu to connect a guest).
  • One consequence of there being an actual device file for the macvtap interface is that traffic entering the interface can be seen and "stolen" to the intended recipient by simply reading from the device file; doing "cat /dev/tap22" (for example) while a guest VM is using it dumps the raw ehernet frames and prevents the VM from seeing them. On the other hand, neither seeing outgoing traffic nor injecting frames by writing to the device file from the outside seem to be possible.
  • If a VM is connected to the macvtap, the MAC address of the macvtap interface as seen on the host is the same that is seen by the guest; this is different from regular tap interfaces, where the guest is somehow "behind" the tap interface (the vnetX interfaces on the host have a MAC address which is not the same that the guest uses).
  • All traffic for guests connected to a macvtap does show up if running tcpdump on the lower device, even in bridge mode and for guest-to-guest traffic. However, as said, tcpdump (on the host) on the macvtap device itself shows no traffic.
  • If the lower device is a wireless card, macvtap doesn't work (the guest is isolated, nothing enters, nothing exits). Perhaps it's just that it only works with some wireless cards, and I happened to have one that doesn't work. Again, I could not find more information.

As said, creating a macvtap interface via code is a bit complicated, but luckily iproute2 can do it on the command line. To create a macvtap interface called macvtap2, with eth2 as its lower physical interface:

# ip link add link eth2 macvtap2 address 00:22:33:44:55:66 type macvtap mode bridge
# ip link set macvtap2 up
# ip link show macvtap2
18: macvtap2@eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT qlen 500
    link/ether 00:22:33:44:55:66 brd ff:ff:ff:ff:ff:ff
# ls -l /dev/tap18 
crw------- 1 root root 250, 1 May 26 10:51 /dev/tap18

To delete the interface, the usual command can be used:

# ip link del macvtap2

Two links which provide good information about macvtap:
http://seravo.fi/2012/virtualized-bridged-networking-with-macvtap
http://virt.kernelnewbies.org/MacVTap.

Smart ranges in sed

Since there seem to be still quite a few people who want to do this with sed...let's see how to select ranges of lines in the same way as with awk (explained here).

We should also avoid the same issue described there, that is, if other /BEGIN/ lines are found while we are inside a range, those lines should be printed. So with this input:

1 BEGIN
2 foo
3 bar
4 BEGIN
5 baz
6 END

at least lines 2 to 5 should be printed (line 1, or 6, or both may also be printed, depending on whether and which range endpoint we are including/excluding).

We're going to assume a sed with ERE (-E) support (as should be the norm these days anyway).

From BEGIN to END, inclusive

This is obviously the easy one:

# print lines from /BEGIN/ to /END/, inclusive
$ sed '/BEGIN/,/END/!d'
$ sed -n '/BEGIN/,/END/p'

No mysteries here. Let's get to the interesting cases.

From BEGIN to END, excluding END

# print lines from /BEGIN/ to /END/, excluding /END/
$ sed '/BEGIN/!d; :loop; n; /END/d; $!bloop'

We start a loop when we see a /BEGIN/, and keep looping until we see an /END/, at which point we delete the line so it's not printed.

From BEGIN to END, excluding BEGIN

# print lines from /BEGIN/ to /END/, excluding /BEGIN/
$ sed -E '/BEGIN/!d; :loop; N; /END/{ s/^[^\n]*\n//; p; d;}; $!bloop'

Same loop, but the lines are accumulated in the pattern space, and the first of them is removed before printing the whole block (note that the "D" command cannot be used for that purpose here, as it starts a new cycle).

From BEGIN to END, not inclusive

This is of course just a small variation on the preceding one, in that we delete both the first and the last line:

# print lines from /BEGIN/ to /END/, excluding both lines
$ sed -E '/BEGIN/!d; :loop; N; /END/{ s/^[^\n]*\n//; s/\n?[^\n]*$//; /./p; d;}; $!bloop'

Since we're excluding both the start and the end line, what's left after removing them may be empty, so we check that there's at least one character left and we only print the pattern space if that is the case.

For anything more complex, just use awk!

Pulling out strings

This is a generic text-processing need that often occurs in different kinds of scripts. Simply put, you want to get a list of the strings in the file (or files) that match a certain pattern. Let's use this simple file as an example:

12345#foobar3#blah
xxxxxxx#foobar77#yyyyyy
foobar867#zzzzzzz
ooooooo#foobar12#ggggggg#foobar17#kkkkkkkk#foobar99
xxxxxxxxxxxxxxxxx
somefoobar12thatwedontwant

Our pattern is (using ERE syntax) "foobar[0-9]+", that is, "foobar" followed by any number of digits. We will refine it a bit later.

Using common shell tools, we have several possibilities.

GNU grep

Probably the simplest one, if GNU grep is available, is to use its -o option, to return only the part of the input that matches the pattern, so:

$ grep -Eo 'foobar[0-9]+' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99
foobar12

As said, this needs GNU grep due to the -o option.

GNU awk and BusyBox awk

These two awk implementations support, as a non-standard extension, the assignment of a regular expression to RS, and make whatever matched RS available in the special variable RT (mawk seems to support the former feature, but not the latter, which make it unsuitable to be used in the way we describe here). So here's how to use these awks for the task:

$ gawk -v RS='foobar[0-9]+' 'RT{print RT}' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99
foobar12

Note that using RS/RT this way allows to match patterns that contain newlines, something that's not easily achieved with other tools (except Perl, see below).

These methods are easy and quick; however, if none of the above implementations is available, we need to use something more standard.

Standard awk

With standard awk, a way to extract all occurrences is to use a loop over each line, repeatedly using match():

$ cat matches.awk
{
  line = $0
  while (match(line, /foobar[0-9]+/) > 0) {
    print substr(line, RSTART, RLENGTH)
    line = substr(line, RSTART + RLENGTH)
  }
}
$ awk -f matches.awk test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99
foobar12

Here the original line is saved (in case it's needed for further processing) and a copy is used to find matches. Since match() only finds the first match in the string, when a match is found it's removed so running match() again can find the following occurrence (if any). For this reason, the above code will loop forever if it's given a pattern that can match the empty string, like for example a*. When you do that, you really want a+ instead anyway, so use the latter. The code above is a common awk idiom to find all matches of a pattern.

Sed

With sed the task is a bit complicated. Basically, we need to somehow "mark" the parts of the data that match our pattern, so we can later delete everything that's not between markers, leaving thus only what we want.

A safe character to use as marker is the newline character (\n), since sed guarantees that, under normal conditions, no input line as seen in the pattern space will contain that character. For the first of the following solutions to work, a sed implementation that recognize \n in the RHS and the special bracket expression [^\n] (any character except \n) is needed. And since our pattern is a ERE (though it could be rewritten as BRE), we need a sed that recognizes EREs. GNU sed has all these features, and we're going to assume it in the examples.

That said, let's see a couple of ways to solve the task with sed.

One somewhat laborious solution is as follows:

$ sed -E '
s/foobar[0-9]+/\n&/g
t ok
d
:ok
s/^[^\n]*\n//
s/(foobar[0-9]+)[^\n]*/\1/g' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99
foobar12

Here we prepend a \n to each match, then delete what's before the very first match in the line (zero or more non-\n followed by a \n at the beginning of the string). Finally we delete all the parts between matches, which leaves us with just the matches, nicely separated by \n characters.

Another approach to the problem is implemented with the following code (which also has the benefit of using standard syntax; changing the ERE into BRE (foobar[0-9][0-9]*) and converting all the "\n" in the RHS to literal escaped newlines would allow this solution to be used with a standard sed):

$ sed -E '
/\n/!s/foobar[0-9]+/\n&\n/g
/^foobar[0-9]+\n/P
D' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99
foobar12

Here the approach is to "isolate" each match with a \n before and one after (if the pattern space doesn't already have one). If the line begins with a match, it's printed with "P" (up to the following \n, which is what we want). Regardless, the part up to and including the first \n is deleted (with "D"). If something is left, go to the beginning to do the previous steps again, until the whole pattern space is entirely consumed. If there were no matches in the original line, "D" will just delete it entirely and start a new cycle. Rinse and repeat for every input line.

Perl

With perl we can do it pretty easily thanks to its powerful regular expression matching operators:

$ perl -ne 'print "$_\n" for (/foobar\d+/g);' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99
foobar12

If the pattern we want has newlines in it, we can just tell perl to slurp the file with perl -n000e and we're set.

Context comes to town

All the solutions seen so far strictly match a pattern, regardless of where it appears. In other words, they ignore the context of the matches. However there may be cases where this is important. In our example input data, we might want to match foobar[0-9]+ only if it's delimited, where "delimited" here is defined as "preceded by either a hash (#) or beginning of line, and followed by either a hash or end of line". Clearly, with this new requirements we don't want the foobar12 in the last line.

We thus need to consider the context in the regular expressions, making them include a larger text, so that matches only happen where there's data that we want; however, since the matched text will now be larger than what we need, we need to subsequently "clean up" the match, extracting only what we want from it. Our regular expression becomes now (ERE syntax)

(^|#)foobar[0-9]+(#|$)

Let's see how to modify the previous solutions to work with context.

GNU grep

Grep can't really edit text, so it would seem like it's out of the discussion here, but with a silly trick we can still use it:

$ grep -Eo '(^|#)foobar[0-9]+(#|$)' test.txt | grep -Eo 'foobar[0-9]+'
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99

The first grep prints all matches with their context, and the second one, operating only on the good data, strictly "extracts" the matches that we need.

GNU awk and BusyBox awk

Setting RS to a non-default value obviously causes awk to stop working in line-oriented mode, so the beginning of line and end-of line anchors in our regular expression need to be augmented to consider the newline character.

Now, with the extended RS, RT will contain the full match with context, so we use gsub() to clean it up:

$ gawk -v RS='(^|#|\n)foobar[0-9]+(#|\n|$)' 'RT{gsub(/^(#|\n)|(#|\n)$/, "", RT); print RT}' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99

The critical part here is obviously the gsub(), which should be written carefully to remove the context stuff and only leave what we want.

Standard awk

Here we don't change RS so we're using the traditional line-oriented mode:

$ cat matches2.awk
{
  line = $0
  while (match(line, /(^|#)foobar[0-9]+(#|$)/)>0) {
    m = substr(line, RSTART, RLENGTH)
    gsub(/^#|#$/, "", m); print m
    line = substr(line, RSTART + RLENGTH)
  }
}
$ awk -f matches2.awk test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99

Sed

Things start to get complicated with sed if we want context. However we can still do it.

Of the two sed solutions presented previously, the easiest to adapt is the second one, so here it is:

$ sed -E '
/\n/!s/(^|#)foobar[0-9]+(#|$)/\n&\n/g
/^#?foobar[0-9]+#?\n/ {
  s/^#?(foobar[0-9]+)#?/\1/
  P
}
D' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99

Again, the critical bit is the part where the context (that we needed to match only the "correct" parts, but no longer want) is removed. This part will be highly dependent on the actual input data and problem requirements.

Perl

Perl is again an easy winner, as we can match with context and pull out only the interesting parts in a single go:

$ perl -ne 'print "$_\n" for (/(?:^|#)(foobar\d+)(?:#|$)/g);' test.txt
foobar3
foobar77
foobar867
foobar12
foobar17
foobar99

The regular expressions for what comes before and after are non-capturing, so the list returned byt the overall match is already made of clean strings, which we thus just need to print.

Overlap problems

You might have noticed that at the same time we introduced context to the matches, we also introduced the potential for overlap. Consider the following sample input data:

12345#foobar3#foobar9999#blah
somefoobar12thatwedontwant

If we run for example the above GNU awk solution on this data, we get:

$ gawk -v RS='(^|#|\n)foobar[0-9]+(#|\n|$)' 'RT{gsub(/^(#|\n)|(#|\n)$/, "", RT); print RT}' test.txt
foobar3

The foobar9999 is missed since the regular expression that matches foobar3 also "consumes" its surrounding context (the leading and trailing hash) and thus applying the regex with context again on what's left fails to match the second occurrence of the pattern.

However, this does not happen with all the solutions; only with some of them. The standard awk and the sed solutions still work since the previous match is deleted from the line, and the extended pattern we use to include context works if the match is at the beginning of a line without a delimiter, too. In the example, once #foobar3# has been matched and removed what's left is "^foobar9999#blah$", and the expression we're using for the match can still match again it since the pattern is at the very beginning and ^ is a possible anchor.
Of course, this happens to work because of the specific combination of input data and regular expressions that we're using; generally speaking, this doesn't have to be the case. It will depend on the actual situation.

The modern RE engine answer to safely solve the overlapping context problem is, naturally, lookaround, which turns actual consumed characters into zero-length assertions, and leaves them available for the next match attempt. This means that sed and awk are excluded, since their RE engines do not support lookaround.

What's left is GNU grep (with its -P option to match in PCRE mode, where available), and of course perl.

grep:

$ grep -Po '(?<=^|#)foobar[0-9]+(?=#|$)' test2.txt
foobar3
foobar9999

There's also a pcregrep utility that comes with the PCRE library, with a syntax similar to that of grep. In particular, it supports the -o option, se we can also do:

$ pcregrep -o '(?<=^|#)foobar[0-9]+(?=#|$)' test2.txt
foobar3
foobar9999

Let's try perl:

$ perl -ne 'print "$_\n" for (/(?<=^|#)(foobar\d+)(?=#|$)/g);' test2.txt
Variable length lookbehind not implemented in regex m/(?<=^|#)(foobar\d+)(?=#|$)/ at -e line 1.

Oops...it seems PCRE is more advanced than perl itself in this particular feature. As man pcrepattern informs us,

The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. However, if there are several top-level alternatives, they do not all have to have the same fixed length. Thus

(?<=bullock|donkey)

is permitted, but

(?<!dogs?|cats?)

causes an error at compile time. Branches that match different length strings are permitted only at the top level of a lookbehind assertion. This is an extension compared with Perl, which requires all branches to match the same length of string. An assertion such as

(?<=ab(c|de))

is not permitted, because its single top-level branch can match two different lengths, but it is acceptable to PCRE if rewritten to use two top-level branches:

(?<=abc|abde)

So what can we do with perl? We have two possibilities.

We note that, strictly speaking, and in this particular case, only what follows the match has to be preserved for the next attempt; the lookbehind is not strictly needed, and we can replace it with a regular match. Thus:

$ perl -ne 'print "$_\n" for (/(?:^|#)(foobar\d+)(?=#|$)/g);' test2.txt
foobar3
foobar9999

Another way to solve the problem is a bit ugly, but it works: we can just move the ^ anchor outside the lookbehind and make it part of a regular alternation; since it's a zero-length match anyway, nothing is harmed:

$ perl -ne 'print "$_\n" for (/(?:^|(?<=#))(foobar\d+)(?=#|$)/g);' test2.txt
foobar3
foobar9999

It is important to understand that there's no generic rule here, and the solution will necessarily have to depend on the problem at hand. Depending on the actual situation, transforming a variable-length lookbehind into something accepted by perl may not always be so easy (or even possible).