Skip to content

Replace every Nth occurrence

Problem statement: replace every Nth occurrence of a pattern. For bonus points, provide context to the match.

For this article, we're going to use this example (mostly nonsensical, but that illustrates the concept):

1foo1 bar foo 2foo3 abc def foo foo foo 1foo4 foo foo 3foo9 4foo7 zzz 7foo7 3foo3

And we want, among all matches of /[0-9]foo[0-9]/, to replace the "foo" with "FOO" on every third match (read it again if it's not clear). So the output we want is:

9foo2 bar foo 2foo3 abc def foo foo foo 1FOO4 foo foo 3foo9 4foo7 zzz 7FOO7 3foo3

That is, the third and sixth match get their "foo" part replaced with FOO.

Perl

Perl is the easy winner here due to its ability to evaluate code directly in the replacement part:

$count = 0; s/(\d)foo(\d)/(++$count % 3 == 0)?"$1FOO$2":$&/ge;

So in the RHS, if the number of the match we're seeing is a multiple of 3 we replace foo with FOO, otherwise we replace the match iwth itself. Obviously, by changing the test on $count we can control exactly which matches we act upon. Depending on the exact need, we could also employ the replacement operator directly on the matched part:

$count = 0; s/(\d)foo(\d)/$a=$&; $a =~ s{foo}{FOO} if (++$count % 3 == 0); $a/ge;

If no context is needed, things can be simplified (eg no capturing and replaying of the digits before and after foo):

$count = 0; s/foo/(++$count % 3 == 0)?"FOO":$&/ge;

But of course, due to the lack of context, this does a different thing on the input:

1foo1 bar foo 2FOO3 abc def foo foo FOO 1foo4 foo FOO 3foo9 4foo7 zzz 7FOO7 3foo3

So (as usual when working with regular expressions) one should know exactly their data and what kind of processing is needed before choosing which solution to use.

Awk

While not as straightforward as Perl, awk can indeed be used successfully for this kind of task. It just takes a bit more of code:

{
  count = 0
  newline = ""
 
  while(match($0, /[0-9]foo[0-9]/) > 0) {
    count++
    newline = newline substr($0, 1, RSTART - 1)
    matched = substr($0, RSTART, RLENGTH)
    $0 = substr($0, RSTART + RLENGTH)
 
    if (count % 3 == 0) {
 
      # simple sub(), but see text below
      sub(/foo/, "FOO", matched)
    }
 
    newline = newline matched
  }
  newline = newline $0
  print newline
}

Here we're doing a simple sub() on the matched part, but depending on the exact task we may need to extract/save and restore the context, possibly running match() again on the matched part, for example:

{
  count = 0
  newline = ""
 
  while(match($0, /[0-9]foo[0-9]/) > 0) {
    count++
    newline = newline substr($0, 1, RSTART - 1)
    matched = substr($0, RSTART, RLENGTH)
 
    $0 = substr($0, RSTART + RLENGTH)
 
    if (count % 3 == 0) {
      match(matched, /^[0-9]/)
      prematch = substr(matched, RSTART, RLENGTH)
      matched = substr(matched, RLENGTH + 1)
      match(matched, /[0-9]$/)
      postmatch = substr(matched, RSTART, RLENGTH)
      matched = substr(matched, 1, RSTART - 1)
      sub(/foo/, "FOO", matched)
 
      # restore context
      matched = prematch matched postmatch
    }
 
    newline = newline matched
  }
  newline = newline $0
  print newline
}

Here the context is just a single leading/trailing digit so it could have been done by just taking the first and last character of matched, but hopefully the above illustrates the general concept of finding, extracting and restoring the parts that make up the context, which can be quite lengthy with awk.

Again if we don't need the context, most of the hassle just goes away:

{
  count = 0
  newline = ""
 
  while(match($0, /foo/) > 0) {
    count++
    newline = newline substr($0, 1, RSTART - 1)
    matched = substr($0, RSTART, RLENGTH)
    $0 = substr($0, RSTART + RLENGTH)
 
    if (count % 3 == 0) {
      sub(/foo/, "FOO", matched)
      # or here even just
      # matched = "FOO"
    }
 
    newline = newline matched
  }
  newline = newline $0
  print newline
}

NOTE: be careful not to look for matches that can be zero-length with awk, because they can cause an endless loop.

Sed

Sed is definitely NOT recommended for this task. As a divertissement, here's a sed solution using markers for the much simpler case of N == 2 (every other match), without bothering for contexts (needs GNU sed):

sed 's/foo/\x1&\x2/g; s/\([^\x1]*[\x1][^\x2]*[\x2][^\x1]*\)[\x1][^\x2]*[\x2]/\1FOO/g; s/[\x1\x2]//g'

Text replacement in context/out of context

Ok the title isn't the best one but essentially the problem here is: I want to replace FOO with BAR, but only if FOO is (or is not) part of a text in brackets (this is just an example, although it seems to be a common occurring case; the point is that it must or must not be in a certain context). So, in this example:

abcd FOO efgh [this FOO is in brackets] ijkl FOO [another FOO in brackets]

The output should be either

abcd FOO efgh [this BAR is in brackets] ijkl FOO [another BAR in brackets]

or

abcd BAR efgh [this FOO is in brackets] ijkl BAR [another FOO in brackets]

depending on whether we want the in-context or out-of context replacement.

This is an interesting problem. There are a few different ways to approach it.

In-context replacement

In-context replacement is probably easier, so let's start with it.

Awk

The idea with awk is to use match() repeatedly to find all the instances of the context, and perform the replacements only on them. In our example, the contexts are all the bracketed blocks, so:

{
  newline = ""
  while(match($0, /\[[^]]*\]/) > 0) {
    newline = newline substr($0, 1, RSTART - 1)
    context = substr($0, RSTART, RLENGTH)
    gsub(/FOO/, "BAR", context)
    newline = newline context
    $0 = substr($0, RSTART + RLENGTH)
  }
  newline = newline $0
  print newline
}

The variable newline contains the changed line, which is built up gradually. Parts of the original lines that are not touched are added to newline as they are, while contexts are added after FOO having been replaced with BAR. At the end, newline is printed. Let's build a simple test file (which will be used throughout the examples), and test it. Note that for simplicity we're NOT considering nested contexts, which rapidly become very hard to parse using regular expressions (in the example, that would be blocks containing bracketed subblocks). We're also deliberately ignoring the case where context-providing characters can appear (perhaps escaped somehow) in some other place and thus should be ignored for our purposes.

$ cat sample.txt
abcd FOO efgh [this FOO is in brackets] ijkl FOO nmop [another FOO in brackets] blah
abcd FOO efgh
[this FOO is in brackets]
ijkl FOO mnop [another FOO in brackets] blah
efgh [normal text in brackets] ijkl mnop [another normal text] blah
FOO FOO [FOO FOO][FOO FOO]FOO FOO
[FOO]
[FOO]ijkl
$ awk -f incontext.awk sample.txt
abcd FOO efgh [this BAR is in brackets] ijkl FOO nmop [another BAR in brackets] blah
abcd FOO efgh
[this BAR is in brackets]
ijkl FOO mnop [another BAR in brackets] blah
efgh [normal text in brackets] ijkl mnop [another normal text] blah
FOO FOO [BAR BAR][BAR BAR]FOO FOO
[BAR]
[BAR]ijkl
sed

With sed, we use a loop and keep replacing FOOs that appear in a context:

$ sed ':loop; s/\(\[[^]]*\)FOO\([^]]*\]\)/\1BAR\2/; t loop' sample.txt
abcd FOO efgh [this BAR is in brackets] ijkl FOO nmop [another BAR in brackets] blah
abcd FOO efgh
[this BAR is in brackets]
ijkl FOO mnop [another BAR in brackets] blah
efgh [normal text in brackets] ijkl mnop [another normal text] blah
FOO FOO [BAR BAR][BAR BAR]FOO FOO
[BAR]
[BAR]ijkl

Note that this will not work if the replacement string contains the matched text (ie FOO here); that would lead to an endless loop.

Perl

Perl is the most powerful of the bunch, so we can do the replacement directly on each matched context with the help of the /e switch (for eval) to the replacement:

$ perl -pe 's/\[.*?\]/($a=$&)=~s%FOO%BAR%g;$a/eg' sample.txt
abcd FOO efgh [this BAR is in brackets] ijkl FOO nmop [another BAR in brackets] blah
abcd FOO efgh
[this BAR is in brackets]
ijkl FOO mnop [another BAR in brackets] blah
efgh [normal text in brackets] ijkl mnop [another normal text] blah
FOO FOO [BAR BAR][BAR BAR]FOO FOO
[BAR]
[BAR]ijkl

Also note that Perl's regular expressions are able to match contexts that would be difficult or impossible to match with standard awk/sed REs (think non-greedy quantifiers or lookaround). The example uses a simple context (brackets) so it's possible to use all the tools.

Out of context

This is a bit harder to accomplish, and in some cases we must resort to dirty tricks.

awk

Looking closely at the awk in-context solution, we see that during the loop we see both the contexts and the out-of-context data, alternatively. So all we need is to perform the replacements on the out-of-context data instead of the in-context one. So the solution is almost the same as the one for in-context replacement:

{
  newline = ""
  while(match($0, /\[[^]]*\]/) > 0) {
    outofcontext = substr($0, 1, RSTART - 1)
    gsub(/FOO/, "BAR", outofcontext)
    newline = newline outofcontext
    context = substr($0, RSTART, RLENGTH)
    newline = newline context
    $0 = substr($0, RSTART + RLENGTH)
  }
  gsub(/FOO/, "BAR")
  newline = newline $0
  print newline
}
$ awk -f outofcontext.awk sample.txt
abcd BAR efgh [this FOO is in brackets] ijkl BAR nmop [another FOO in brackets] blah
abcd BAR efgh
[this FOO is in brackets]
ijkl BAR mnop [another FOO in brackets] blah
efgh [normal text in brackets] ijkl mnop [another normal text] blah
BAR BAR [FOO FOO][FOO FOO]BAR BAR
[FOO]
[FOO]ijkl
sed

The idea here is that contexts are removed from the line and stored away, the replacement is done on what's left (which thus must be the out-of-context data), and finally the contexts are restored to their original positions. Of course, to "remember" where the removed contexts are, we should use some sort of placeholder character.
So we put contexts in the hold space (separated by a ASCII 1 character), and we use an ASCII 1 in the original line to mark a spot where a context has to be reinserted after the replacements.

h                   # save line to hold space

# remove non-contexts (ie, leave only contexts separated by \x1)
s/^[^[]*\[/[/
s/\][^]]*$/]\x1/
s/\][^[]*\[/]\x1[/g

# swap hold/pattern space to get the original line in pattern space
x

# remove contexts (ie, leave only non-contexts separated by \x1)
s/\[[^]]*\]/\x1/g

# do the actual replacement
s/FOO/BAR/g

# append hold space to pattern space, this gives <patternspace>\n<holdspace> in pattern space
G

# reinsert contexts where they belong
:loop
s/\x1\(.*\)\n\([^\x1]*\)\x1/\2\1\n/
t loop

# remove leftover stuff
s/\n.*//

Not the most straightforward way, but in these cases sed is a bit limited. I probably wouldn't recommend to use sed for this task.

With a sed that supports EREs like GNU sed (which is probably needed anyway to use \x1 as in the other solution above), there is also the option of using a loop, similar to the in-context solution:

sed -r ':loop; s/((^|\])[^[]*)FOO([^[]*($|\[))/\1BAR\3/; t loop' sample.txt

This has the same problem as the in-context solution (the replacement can't contain the pattern), and also leads us directly to the Perl solution.

Perl

With Perl, again, it's quite easy:

$ perl -pe 's/(?:^|\]).*?(?:$|\[)/($a=$&)=~s%FOO%BAR%g;$a/eg' sample.txt
abcd BAR efgh [this FOO is in brackets] ijkl BAR nmop [another FOO in brackets] blah
abcd BAR efgh
[this FOO is in brackets]
ijkl BAR mnop [another FOO in brackets] blah
efgh [normal text in brackets] ijkl mnop [another normal text] blah
BAR BAR [FOO FOO][FOO FOO]BAR BAR
[FOO]
[FOO]ijkl

Essentially the idea is the same as before, but this time we are matching all the out-of-context parts (that is, from either beginning of line or "]" to either end of line or "[").

Crontab calendar generator

(for want of a better name).

Link to github repo: https://github.com/waldner/croncal.

The problem: a crontab file containing hundreds of entries which execute jobs at many different times, and we want to know what runs when, or, perhaps more interestingly, what will run when. All displayed in a calendar-like format, so we can see that on day X, job Y will run at 9:30 and job Z will run at 00:00, 06:00, 12:00 and 18:00 (for example).
Obviously, "manually" extracting this information by looking at the crontab file itself is going to be an exercise in frustration if there are more than a handful entries (and even then, depending on how they are defined, it would probably require some messing around).

We'd like to have some program that takes a time range (start and end date or start date and duration) and a crontab file to read, and automagically produces the calendar output for the given period. Example follows to better illustrate the concept.

# Sample crontab file for this example

# runs at 5:00 every day
0 5 * * * job1

# runs at 00:00 every day
@daily job2

# runs every 7 hours every day, between 7 and 17, that is at 07:00 and 14:00
0 7-17/7 * * * job3

# runs at 17:30 on day 10 of each month
30 17 10 * * job4

We want to see the job timeline for the time range between 2012-06-09 00:00 and 2012-06-12 00:00. So we run it thus:

$ croncal.pl -s '2012-06-09 00:00' -e '2012-06-12 00:00' -f /path/to/crontab
2012-06-09 00:00|job2
2012-06-09 05:00|job1
2012-06-09 07:00|job3
2012-06-09 14:00|job3
2012-06-10 00:00|job2
2012-06-10 05:00|job1
2012-06-10 07:00|job3
2012-06-10 14:00|job3
2012-06-10 17:30|job4
2012-06-11 00:00|job2
2012-06-11 05:00|job1
2012-06-11 07:00|job3
2012-06-11 14:00|job3

That's basically the idea of the program described in this article. Running it with -h will print a summary of the options it accepts. Output can be in icalendar format (so the timeline can be visually seen with any calendar application), plain as above, or we can just print how many jobs would run at each time. When using the plain or icalendar formats, mostly as a debugging aid, it's possible to print the job scheduling spec as was originally found in the crontab file. This is done with the -x switch, example follows:

$ croncal.pl -s '2012-06-09 00:00' -e '2012-06-12 00:00' -f /path/to/crontab -x
2012-06-09 00:00|@daily|job2
2012-06-09 05:00|0 5 * * *|job1
2012-06-09 07:00|0 7-17/7 * * *|job3
2012-06-09 14:00|0 7-17/7 * * *|job3
2012-06-10 00:00|@daily|job2
2012-06-10 05:00|0 5 * * *|job1
2012-06-10 07:00|0 7-17/7 * * *|job3
2012-06-10 14:00|0 7-17/7 * * *|job3
2012-06-10 17:30|30 17 10 * *|job4
2012-06-11 00:00|@daily|job2
2012-06-11 05:00|0 5 * * *|job1
2012-06-11 07:00|0 7-17/7 * * *|job3
2012-06-11 14:00|0 7-17/7 * * *|job3

This should help confirm that the job should indeed run at the time shown in the first column (or not: there may be bugs!). Since the program reads from standard input if an explicit filename is not specified, it's possible to output the timeline resulting from multiple crontab files, for example by doing

cat crontab1 crontab2 crontabN | croncal.pl [ options ]

Ok, semi-UUOC but I was too lazy to implement multiple -f options.

Final words of caution:

  • if you run the program over a long period of time and have cron jobs that run very often, like "* * * * *" or similar, that will produce a lot of output.
  • The program was written as a quick and dirty way to solve a specific need, "works for me", is not optimized and does not try to be particularly smart of flexible. It may not be exactly what you were looking for, or may not do what you want, or in the way you want. That's life. Hopefully, it may still be useful to someone.

Perl fun (at last)

After reading something about dynamic programming problems (and thanks to Perl's very liberal parser for the alignment)...

Copy and distribution of the code published in this page, with or without
modification, are permitted in any medium without royalty provided the copyright
notice (see bottom of page) and this notice are preserved.
@v=map{-2+ord}"a\x0c>oq\"e0#zo#iBtd%agx#cfR"=~/./g;($W,$w,$n)=(606,607,23);@w=
map{ord}"a#Q, J!!A%(V&)#(M-!!P).P"=~/./g;for($w..14567){if($_%$w){$p=$_-$w;$m[
$_]=$w[$a=int($_/$w)]>($_%$w)?$m[$p]:($x=$m[$p])>=($y=$m[$p-$w[$a]]+$v[$a])?$x
:$y}}for(;$n;$n--){$m[$a=$n*$w+$W]!=$m[$a-$w]and print chr$v[$n]and$W-=$w[$n]}

The only fact it exploits is that undefined values implicitly become 0 when used in calculations. Indeed, if the necessary declarations and initializations are added (not shown), the code runs without warnings under use warnings; and use strict;.

Load balancing and HA for multiple applications with Apache, HAProxy and keepalived

Let's imagine a situation where, for whatever reason, we have a number of web applications available for users, and we want users to access them using, for example,

https://example.com/appname/

Each application is served by a number of backend servers, so we want some sort of load balancing. Each backend server is not aware of the clustering, and expects requests relative to /, not /appname. Also, SSL connections and caching are needed. The following picture illustrates the situation:

Here is an example of how to implement such a setup using Apache, HAProxy and keepalived. The provided configuration samples refer to the load balancer machine(s). Here's the logical structure of a load balancer described here:

Apache

Apache is where user requests land. The main functions of Apache in this setup are providing SSL termination, redirection for non-SSL requests (we want users to access everything over SSL), and possibly caching. Conforming requests are sent to HAProxy (see below) for load balancing.

Here is an excerpt from the Apache configuration:

# virtual host for port 80
# mostly just redirections

<VirtualHost *:80>
        ServerAdmin admin@example.com
        ServerName lb1.example.com
        # add ServerAlias as needed

        RewriteEngine on

        # redirect everything to https, the (.*) has a leading /
        RewriteRule ^(.*)$ https://%{HTTP_HOST}$1 [R,L]

        ErrorLog ${APACHE_LOG_DIR}/error.log

        # Possible values include: debug, info, notice, warn, error, crit,
        # alert, emerg.
        LogLevel warn

        CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>

# virtual host for SSL

<IfModule mod_ssl.c>

NameVirtualHost *:443
<VirtualHost *:443>
        ServerAdmin admin@example.com
        ServerName lb1.example.com
        # add ServerAlias as needed

        SSLEngine on
        SSLProxyEngine on
        RewriteEngine On

        # SSL cert files
        SSLCertificateFile    /etc/apache2/ssl/example.com.crt
        SSLCertificateKeyFile /etc/apache2/ssl/example.com.key
        SSLCertificateChainFile /etc/apache2/ssl/chain.example.com.crt

        # Redirect requests not ending in slash eg. /app1 to /app1/
        RewriteRule ^/([^/]+)$ https://%{HTTP_HOST}/$1/ [R,L]

        # Uncomment this (end enable mod_disk_cache) to enable caching
        # CacheEnable disk /

        # pass everything to the local haproxy
        RewriteRule ^/([^/]+)/(.*)$ http://127.0.0.1:8080/$1/$2 [P]

        # The above RewriteRule is equivalent to multiple ProxyPass rules, eg
        #  ProxyPass /app1/ http://127.0.0.1:8080/app1/
        # etc.

        # THIS NEEDS A LINE FOR EACH APPLICATION
        ProxyPassReverse /app1/ http://127.0.0.1:8080/app1/
        ProxyPassReverse /app2/ http://127.0.0.1:8080/app2/
        ProxyPassReverse /app3/ http://127.0.0.1:8080/app3/
        # add other apps here... 

        <Proxy http://127.0.0.1:8080/*>
           Allow from all
        </Proxy>

        ErrorLog ${APACHE_LOG_DIR}/error_ssl.log

        # Possible values include: debug, info, notice, warn, error, crit,
        # alert, emerg.
        LogLevel warn

        CustomLog ${APACHE_LOG_DIR}/access_ssl.log combined
</VirtualHost>
</IfModule>

So, nothing special here. Users trying to connect over plain HTTP, or not using the trailing slash, are automatically redirected to the right URL. Disk caching can be enabled using mod_cache and mod_disk_cache, and mod_proxy is used to send valid requests to HAProxy. The actual proxying is performed using a rewrite rule with the [P] flag, which is essentially equivalent to using ProxyPass, but has the advantage that the rule can be made generic and what would otherwise require N ProxyPass directives (where N is the number of backend applications) can be done with a single RewriteRule. (ProxyPassMatch could also have been used to achieve a similar result).

Unfortunately, there's no shortcut for the ProxyPassReverse and the other ProxyPassReverse* directives, which means that all the applications have to be explicitly listed there (one directive for each application).

In this scenario sessions are not synchronized between backend servers, so once a new connection has been dispatched to a backend server, it must persist to it until the end (unless the server fails, of course, in which case it will be redispatched to another backend server, and users will have to log in again). This is accomplished by HAProxy through the use of cookies: a cookie is inserted in replies sent back to the client recording the backend that the connection is using. When new requests for the same session come from the client, HAProxy just needs to read the cookie to find out the backend server to use. This cookie is also removed from the request before it's sent to the backend, so the application never sees it.

The Apache server terminates SSL, performs basic checks on the requests, redirects them if necessary, and (mostly) passes the traffic to HAProxy, which is listening on port 8080 (see below).

The second load balancer runs the same configuration (probably with ServerName set to lb2.example.com).

HAProxy

HAProxy analyzes the URLs and paths in the requests it's given to learn which application is being requested, and dispatches them to the right backend. But since backends expect requests relative to /, HAProxy also needs to strip the /appname/ part from the requests before forwarding it to the backends, and readd it to replies on the way back. The application name can appaear in some HTTP header, or in cookies. HAProxy needs to fix it in all these places.

Here is HAProxy's /etc/haproxy/haproxy.cfg:

global
        ##log 127.0.0.1 local0
        ##log 127.0.0.1 local1 notice
        #log loghost    local0 info
        maxconn 4096
        user haproxy
        group haproxy
        daemon
        node lb1
        spread-checks 5     # 5%
        # uncomment this to get debug output
        #debug
        #quiet

# This section is fixed and just sets some default values.
# These values can be overridden by more-specific redefinitions 
# later in the config
defaults
        log     global
        mode    http
        # option  httplog
        option  dontlognull
        retries 3
        option redispatch
        maxconn 2000
        contimeout      5000
        clitimeout      50000
        srvtimeout      50000

# Enable admin/stats interface
# go to http://lb1.example.com:8000/stats to access it
listen admin_stats *:8000
       mode http
       stats uri /stats
       stats refresh 10s
       stats realm HAProxy\ Global\ stats
       stats auth admin:admin             # CHANGE THIS TO A SECURE PASSWORD

# A single frontend section is needed. This listens on 127.0.0.1:8080, and 
# receives the requests from Apache.
frontend web
  bind 127.0.0.1:8080 
  mode http

  # This determines which application is being requested
  # These ACL will match if the path in the request contains the relevant application name
  # for example the first ACL (want_app1) will match if the request is for /app1/something/, etc.
  acl want_app1 path_dir app1
  acl want_app2 path_dir app2
  acl want_app3 path_dir app3
  # ... add lines for other applications here...

  # these ACLs match if at least one server
  # for the application is available.
  acl app1_avail nbsrv(app1) ge 1
  acl app2_avail nbsrv(app2) ge 1
  acl app3_avail nbsrv(app3) ge 1
  # ... add lines for other applications here...

  # Here is where HAProxy decides which backend to use. Conditions
  # are ANDed.
  # This says: use the backend called "app1" if the request 
  # contains /app1/ (want_app1) AND the backend is available (app1_avail), etc.
  use_backend app1 if want_app1 app1_avail
  use_backend app2 if want_app2 app2_avail
  use_backend app3 if want_app3 app3_avail
  # ... etc

  # If we get here, no backend is available for the requested
  # application and users will get an error

########## BACKENDS ###################
backend app1
  mode http
  option httpclose

  # The load balancing method to use
  balance roundrobin

  # insert a cookie to record the real server
  cookie SRVID insert indirect nocache
  option nolinger 

  # Here is where requests coming from Apache are rewritten to
  # remove the reference to the application name

  # The request is something like
  # ^GET /app1/something HTTP/1.0$
  # but it should be seen by the real server as /something/,
  # so remove the application name on requests
  reqirep ^([^\ ]*)\ /app1/([^\ ]*)\ (.*)$       \1\ /\2\ \3

  # If the response contains a Location: header, reinsert
  # the application name in its value
  rspirep ^(Location:)\ http://([^/]*)/(.*)$    \1\ http://\2/app1/\3
  
  # Insert application name in the cookie's path
  rspirep ^(Set-Cookie:.*\ path=)([^\ ]+)(.*)$       \1/app1\2\3

  # This is to perform health checking: just get /
  # Adjust as needed by the specific application
  # Requests have the User-Agent: HAProxy so they can be excluded from logs
  # on the backend
  option httpchk GET / HTTP/1.0\r\nUser-Agent:\ HAProxy

  # Here is the actual list of local servers for the application
  # adjust parameters as needed
  server app1_1 192.168.0.46:80 cookie app1_1 check inter 10s rise 2 fall 2
  server app1_2 192.168.0.47:80 cookie app1_2 check inter 10s rise 2 fall 2
  server app1_3 192.168.0.48:80 cookie app1_3 check inter 10s rise 2 fall 2
  # ...add other servers for the appliaction here...

# the following backends follow the same pattern
backend app2
  mode http
  option httpclose

  balance roundrobin

  cookie SRVID insert indirect nocache
  option nolinger 

  reqirep ^([^\ ]*)\ /app2/([^\ ]*)\ (.*)$       \1\ /\2\ \3
  rspirep ^(Location:)\ http://([^/]*)/(.*)$    \1\ http://\2/app2/\3
  rspirep ^(Set-Cookie:.*\ path=)([^\ ]+)(.*)$       \1/app2\2\3

  option httpchk GET / HTTP/1.0\r\nUser-Agent:\ HAProxy

  server app2_1 192.168.4.14:80 cookie app2_1 check inter 10s rise 2 fall 2
  server app2_2 192.168.4.18:80 cookie app2_2 check inter 10s rise 2 fall 2
  server app2_3 192.168.4.19:80 cookie app2_3 check inter 10s rise 2 fall 2

backend app3
  mode http
  option httpclose

  balance roundrobin

  cookie SRVID insert indirect nocache
  option nolinger 

  reqirep ^([^\ ]*)\ /app3/([^\ ]*)\ (.*)$       \1\ /\2\ \3
  rspirep ^(Location:)\ http://([^/]*)/(.*)$    \1\ http://\2/app3/\3
  rspirep ^(Set-Cookie:.*\ path=)([^\ ]+)(.*)$       \1/app3\2\3

  option httpchk GET / HTTP/1.0\r\nUser-Agent:\ HAProxy

  server app3_1 172.17.5.1:80 cookie app3_1 check inter 10s rise 2 fall 2
  server app3_2 172.17.5.2:80 cookie app3_2 check inter 10s rise 2 fall 2
  server app3_3 172.17.5.3:80 cookie app3_3 check inter 10s rise 2 fall 2

# these are the error pages returned by HAProxy when an error occurs
# customize as needed
        errorfile       400     /etc/haproxy/errors/400.http
        errorfile       403     /etc/haproxy/errors/403.http
        errorfile       408     /etc/haproxy/errors/408.http
        errorfile       500     /etc/haproxy/errors/500.http
        errorfile       502     /etc/haproxy/errors/502.http
        errorfile       503     /etc/haproxy/errors/503.http
        errorfile       504     /etc/haproxy/errors/504.http

The second load balancer runs the same configuration (probably using "node lb2" in the config).

Load balancer redundancy

Of course, we don't want to have a single point of failure in the load balancer, so two load balancers are set up with identical configurations, and Keepalived is used to run VRRP between them. VRRP provides a "virtual" IP address which is assigned to the active load balancer, and is where traffic comes in (ie, the address to which the URL used by users resolves). If the active load balancer fails, Keepalived transfers the VIP to the hot standby balancer, which takes over seamlessly. This is possible because the two load balancers need no shared state: all the information needed to dispatch to the backends is contained in the requests coming from users (in the form HAProxy's persistence cookie); both balancers also perform the same health checks, so at any time both know which backends are available to dispatch new requests.

Here is /etc/keepalived/keepalived.conf:

vrrp_script chk_apache {
  script "killall -0 apache2"  # cheaper than pidof
  interval 2                       # check every 2 seconds
  weight 2                        # add 2 points of prio if OK
}

vrrp_instance apache_vip {

  # Initial state, MASTER|BACKUP
  # As soon as the other machine(s) come up,
  # an election will be held and the machine
  # with the highest "priority" will become MASTER.
  # So the entry here doesn't matter a whole lot.
  state BACKUP

  # interface to run VRRP
  interface eth0

  # optional, monitor these as well.
  # go to FAULT state if any of these go down.
  track_interface {
    eth0
    eth1
  }

  track_script {
    chk_apache
  }

  # delay for gratuitous ARP after transition to MASTER
  garp_master_delay 1    # secs

  # arbitary unique number 0..255
  # used to differentiate multiple instances of vrrpd
  # running on the same NIC
  virtual_router_id 51

  # for electing MASTER, highest priority wins.
  # THIS IS DIFFERENT ON THE LBs. SET 101 for the MASTER, 100 for the SLAVE.
  priority 101

  # VRRP Advert interval, secs (use default)
  advert_int 1

  # This is the floating IP address that will be added or removed to
  # the LB's interface when a transition occurs.
  virtual_ipaddress {
    1.1.1.1/24 dev eth0
  }

  # VRRP will normally preempt a lower priority
  # machine when a higher priority machine comes
  # online.  "nopreempt" allows the lower priority
  # machine to maintain the master role, even when
  # a higher priority machine comes back online.
  # NOTE: For this to work, the initial state of this
  # entry must be BACKUP.
  nopreempt

  #debug
}

The only difference between the versions of this file installed on the balancers is that one of the balancers (the one that will start as active) must have a lower priority than the other, so VRRP knows to which to assign the VIP.

Final notes

Logging on the backends

On the backends, there are two things to be aware of when configuring logging:

  • Normally, health checks performed by the load balancers will be logged;
  • All the requests, including user requests, will appear to be coming from the load balancer's IP.

To solve the first problem, we can recognize health check requests by looking at the "user-agent" field, and if it's HAProxy, don't log the request.
For the second problem, we can see what the original IP was by looking at the X-Forwarded-For header that Apache kindly inserts when acting as a reverse proxy.

So putting all together, here's a possible log configuration for a backend using Apache:

BrowserMatch ^HAProxy$ healthcheck
# define a log format that uses the X-Forwarded-For header to log the source of the request
LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" mycombined

# log only if it's not a health check, and using the mycombined format       
CustomLog ${APACHE_LOG_DIR}/access.log mycombined env=!healthcheck
A weird bug

If using a version of HAProxy less than 1.3.23 (which is still the case if using Ubuntu Lucid), there is a nasty bug in the cookie parser that causes HAProxy to not recognize the persistence cookie if it appears after cookies whose name or value contain special characters. When that happens, HAProxy issues a new persistence cookie even if there is a valid one in the request, possibly directing users to another backend server and thus breaking their sessions. This was fixed in HAProxy 1.3.23. Whether the bug triggers or not depends on what the application does with cookies, and also some change in behavior between different browsers has been observed. So, you may or may not hit the bug.
If working with the bugged version and upgrading is not possible (for whatever reason), one way to work around it is to rewrite the Cookie: header received from clients in the Apache frontend, so that HAProxy's cookie always comes first (if it's present, of course).

To use this kludge, mod_headers needs to be enabled.

# Edit Cookie: header so the HAProxy's persistence cookie comes first!
RequestHeader edit Cookie: "^(.*; *)?(SRVID=[^ ;]+) *;? *(.*)$" "$2; $1 $3"

So essentially what this does can be summarized with the following table:

Browser sends                        HAProxy sees

Cookie: a=b; c=d; e=f                Cookie: a=b; c=d; e=f  # no change
Cookie: SRVID=app1_2; a=b; c=d       Cookie: SRVID=app1_2; a=b; c=d
Cookie: a=b; SRVID=app1_2; c=d       Cookie: SRVID=app1_2; a=b; c=d
Cookie: a=b; c=d; SRVID=app1_2       Cookie: SRVID=app1_2; a=b; c=d

The regular expression must also consider that there can be an arbitrary number of spaces between cookies.

After the Cookie: header editing is applied, HAProxy's cookie always comes first, and things sort of work. Obviously, this is a workaround (and a pretty bad one), not a fix. Also, it's likely that there are obscure, or even not-so-obscure, cases where it fails. Alternatives to this kludge, all of them preferrable to the above method, include:

  • Modify the applications on the backend servers so that cookie names and values never include the patterns that trigger the bug
  • Create your own package for HAProxy 1.3.23
  • Switch to a distro which includes HAProxy 1.4.x, or at least a version greater than 1.3.22
It depends on the application

Keep in mind that not every application lends itself well to be easily put behind a reverse proxy. There are applications that generate absolute URLs in the HTML code, just to name an especially bad and common behavior. In those cases, additional work is needed beyond that shown here; it can involve fixing the application (the right thing to do) or adding more kludges to the load balancing (which can be a lot of silly work).