Skip to content
 

The mythical “idempotent” file editing

The story goes more or less like this: "I want to edit a file by adding some lines, but leaving alone any other lines that it might already have. If one of the to-be-added lines is already present, do not re-add it (or replace the existing one). I should be able to repeat this process an arbitrary number of times; after the first run, any subsequent run must leave the file unchanged" (hence "idempotent").

For some reason, a typical target for this kind of thing seems to be the file /etc/hosts, and that's what we'll be using here for the examples. Adapt as needed. Other common targets include /etc/passwd or DNS zone files.

Note that there are almost always ways to avoid doing what we're going to do.
A typical scenario cited by proponents of this approach is automated or scripted install of a machine where a known state for /etc/hosts is desired. But in that case, one can just create the file from scratch with appropriate contents (we are provisioning, right?). Creating the file from scratch certainly leaves it with the desired contents, and is surely idempotent (can be repeated as many times as wanted).
Another scenario is managing/maintaining such file on an already installed machine. But if you really need to do that, there are tools (puppet has a /etc/hosts type, augeas can edit most common file types, etc.) that can do it natively and well (well, at least most likely better than a script).

So in the end it's almost always a half-baked attempt at doing something that either shouldn't be necessary in the first place, or should be done with the appropriate tools.

Nevertheless, there seem to be a lot of people trying to do this, so for the sake of it, let's see how the task could be approached.

To make it concrete, here's our existing (pre-edit) /etc/hosts:

#
# /etc/hosts: static lookup table for host names
#
127.0.0.1	my.example.com localhost.localdomain	my localhost
::1		localhost.localdomain	localhost

192.168.44.12   server1.example.com server1
192.168.44.1    firewall.example.com firewall

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

2001:db8:a:b::1  server6.example.com server6
# End of file

We want to merge the following lines (we assume they are stored in their own file, newlines.txt):

192.168.44.250      newserver.example.com  newserver
192.168.44.1        firewall.example.com firewall gateway
2001:db8:a:b::100   server7.example.com  server7

When one of the lines we're adding is already present in the target file, there are two possible policies: either leave the line alone (ie, the old line is the good one), or replace it (ie, the new line is the good one). In our example, we would encounter this issue with the 192.168.44.1 entry. Of course, it's not hard to imagine situations in which for just some of the new lines the "new line wins" policy should be used, while still using the "old line wins" policy for the remaining ones. We choose to ignore this problem here and use a global policy, but it's certainly not just a theoretical case.

Another issue has to do with the method used to detect whether a line is already present: do we compare the whole line, just a key field (somehow calculated, for example a column), a set of fields, or yet something else? If we use more than one field, what about spaces?
In the case of /etc/hosts it seems sensible to use the first column (ie, the actual IP address) as a key, but it could be argued that the second field (the FQDN) should be used instead, as we want to ensure that a given FQDN is resolvable, no matter to which IP address (this in turn has the problem that then we can't add an IPv4 and IPv6 line for the same FQDN). Here we're using the first field; again, adaptation will be necessary for different needs.

Another, more serious issue, has to do with the overall format of the resulting file. What do we do with comments and empty lines? In this case, we just print them verbatim.
And what about internal file "semantics" (for lack of a better term)? Let's say we like to have all IPv4 addresses nicely grouped together and all IPv6 addresses as well. New lines should respect the grouping (an IPv4 line should go into the IPv4 group etc.). Now things start to be, well, "interesting". Since where a line appears in the file doesn't really matter much to the resolver routines, here we choose to just append new lines at the end; but this is a very simple (and, for some "idempotent" editing fans probably unsatisfactory) policy.

The point is: it's easy to see how this seemingly easy task can quickly become arbitrarily (and ridiculously) complicated, and any "quick and dirty" solution necessarily has to deal with many assumptions and tradeoffs. (And all this just for the relatively simple file /etc/hosts. Imagine managing a DNS zone file, or a DHCP server configuration file, with MAC to IP mappings, just to name some other examples. And we're still in the domain of single-line-at-a-time changes.)

So here's some awk code that tries to do the merge. Whether the "existing/old line wins" policy or the "new line wins" policy is used is controlled with a flag (newwins) that can be set with -v, and by default is set to 0 (old line wins):

BEGIN {
  # awk way to check whether a variable is not defined
  if (newwins == "" && newwins == 0) {
    newwins = 0      # by default old line wins
  }
}

# load new lines, skip empty/comment lines
NR == FNR {
  if (!/^[[:blank:]]*(#|$)/) {
    ip = substr($0, 1, index($0, " ") - 1)
    newlines[ip] = $0
  }
  next
}

# print comments and empty lines verbatim
/^[[:blank:]]*(#|$)/ {
  print
  next
}

$1 in newlines {
  print (whowins == 1) ? newlines[$1] : $0
  # either way, forget it
  delete newlines[$1]
  next
}

{ print }

# if anything is left in newlines, they must be truly new lines
END {
  for (ip in newlines)
    print newlines[ip] 
}

So we can run it as follows ("old line wins" policy, only two new lines appended at the end):

$ awk -f mergehosts.awk newlines.txt /etc/hosts
#
# /etc/hosts: static lookup table for host names
#
127.0.0.1	my.example.com localhost.localdomain	my localhost
::1		localhost.localdomain	localhost

192.168.44.12   server1.example.com server1
192.168.44.1    firewall.example.com firewall

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

2001:db8:a:b::1  server6.example.com server6
# End of file
2001:db8:a:b::100   server7.example.com  server7
192.168.44.250      newserver.example.com  newserver

Or with the "new line wins" policy (same two lines appended, and an existing one replaced with the new version):

$ awk -f mergehosts.awk -v newwins=1 newlines.txt /etc/hosts
#
# /etc/hosts: static lookup table for host names
#
127.0.0.1	my.example.com localhost.localdomain	my localhost
::1		localhost.localdomain	localhost

192.168.44.12   server1.example.com server1
192.168.44.1        firewall.example.com firewall gateway

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

2001:db8:a:b::1  server6.example.com server6
# End of file
2001:db8:a:b::100   server7.example.com  server7
192.168.44.250      newserver.example.com  newserver

(To actually change the original file, redirect the output to a temporary file and use it to overwrite the original one. Let's not start that discussion again).

Not looking good? Well, it's kind of expected, since it's a ugly hack. It does work under the assumptions, but it's nonetheless a hack.

As said, it's higly dependent on the use case, but in general a better solution with this kind of problems is to either generate the whole file from scratch every time (including from templates if appropriate), or use dedicated tools to manage it.

It can also be mentioned that, if one must really do it using a script, it's often possible and easy enough to divide the target file into "zones" (for example, using special comment markers). In this way, within the same file, one zone could be deemed "safe" and reserved for hand-created content that should be preserved, and nother zone for automated content (that is, erased and recreated from scratch each time). However this approach assumes that the whole of the automated content is always supplied each time. This approach (slightly less hackish) introduces its own set of considerations, and is interesting enough to deserve an article on its own.

One Comment

  1. Laurent C says:

    You might be interested by Ansible's approach, which just works in practice.
    http://docs.ansible.com/ansible/lineinfile_module.html
    https://docs.ansible.com/ansible/blockinfile_module.html
    Your blog is really good work.