Here we'll see some hints on how to write more idiomatic (and usually shorter and more efficient) awk programs. Many awk programs you’re likely to encounter, especially short ones, make large use of these notions.
The power of conditions
As a starting example, suppose you want to print all the records (normally lines) in a file that match some pattern (a kind of awk-grep, if you like). A reasonable first shot is usually something like
awk '{if ($0 ~ /pattern/) print $0}'
That works, but there are some things to note.
The first thing to note is that it is not structured according to awk’s definition of a program, which is
condition { actions }
Our program can clearly be rewritten using this form, since both the condition and the action are very clear:
awk '$0 ~ /pattern/ {print $0}'
Our next step in the perfect awk-ification of this program is to note that the /pattern/ syntax is the same as
awk '/pattern/ {print $0}'
Now, let’s turn our attention to the action part (the stuff inside braces).
awk '/pattern/ {print}'
But let's make another step. When it finds that a condition is true, and there are no associated actions, awk performs a default action, and that action (you guessed it) is print (which we already know is equivalent to
awk '/pattern/'
Now we have reduced the initial program to its simplest (and more idiomatic) form. In many cases, if all you want to do is print some records (lines), according to a condition, you can write awk programs composed only of a condition (although complex):
awk '(NR%2 && /pattern/) || (!(NR%2) && /anotherpattern/)'
That prints odd lines that match /pattern/ and even lines that match /anotherpattern/. Naturally, if you don’t want to print $0 but instead do something else, then you’ll have to manually add a specific action to do what you want.
From the above, it follows that
awk 1 awk '"a"' # single quotes are important!
are two awk programs that just print their input unchanged, both "1" and the string "a" obviously being always-true conditions. This is not terribly useful by itself, but it can be used in combination with other code in a number of circumstances.
For example, sometimes you want to operate only on some records of the input (according to some condition), but also want to print all records, regardless of whether they were affected by your operation or not. A typical example is a program like this:
awk '{sub(/pattern/, "foobar")} 1'
This tries to replace whatever matches
sed 's/pattern/foobar/'
Here are some examples of typical awk programs, using only conditions:
awk 'NR % 6' # prints all lines except lines 6,12,18... awk 'NR > 5' # prints from line 6 onwards (like tail -n +6, or sed '1,5d') awk '$2 == "foo"' # prints lines where the second field is "foo" awk 'NF >= 6' # prints lines with 6 or more fields awk '/foo/ && /bar/' # prints lines that match /foo/ and /bar/, in any order awk '/foo/ && !/bar/' # prints lines that match /foo/ but not /bar/ awk '/foo/ || /bar/' # prints lines that match /foo/ or /bar/ (like grep -e 'foo' -e 'bar') awk '/foo/,/bar/' # prints from line matching /foo/ to line matching /bar/, inclusive awk 'NF' # prints only nonempty lines (or: do not print empty lines, where NF==0) awk 'NF--' # removes last field and prints the line awk '$0 = NR" "$0' # prepends line numbers (assignments are valid in conditions) awk '!a[$0]++' # suppresses duplicated lines! (figure out how it works)
As an extreme example of the power of conditions, let's examine the following code:
awk 'ORS = NR % 5 ? FS : RS'
You might also find it written with no spaces at all, especially in golf-ish contexts:
awk 'ORS=NR%5?FS:RS'
Let's run it using some simple input:
$ seq 1 30 | awk 'ORS=NR%5?FS:RS' 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
So what it does is columnate the input (in groups of five columns here, but just change the number 5 in the code to the number you want, or use a variable and pass the value using -v). Why does it work? Well, when awk prints an output record (line), the special variable ORS, as the name suggests, contains the separator to append to the line. By default, ORS is "\n", ie a newline character. But here we are explicitly assigning a value to ORS depending on the outcome of the ternary operator. So if
awk 'ORS=NR%5?" ":"\n"'
Overall, in the end the whole code is an assignment which, as we've seen, is evaluated and returns the assigned value (so either the value of RS or FS here, neither of which is nontrue). Since we're having a true condition, the default action (print) is executed for every line, and as that is done, ORS either adds a newline or a space, depending on whether we are at column 5 or not.
An issue with this code is that if the number of lines in the input is not a multiple of the number of columns (5 in this case), the output ends with a space rather than a newline. You can add an END block to correct that.
Note that the ternary operator can be changed to yield different results; for example, this code
awk 'ORS=/foo$/?FS:RS'
joins the following line to the current one if the current one ends in foo. You are encouraged to find your own variations.
Self-assignments
Let's take another relatively common task: replacing the delimiters. Your input fields are delimited, for example, by semicolons (or any other arbitrarily complex separator), and you want to change that to, say, commas. Armed with the knowledge you gathered in the first part of this article, you do
awk -v FS=';' -v OFS=',' 1 # doesn't work!
but that doesn't work, and it outputs the input unchanged. The reason for this behavior is that awk does not rebuild $0 (that is, replacing FS with OFS) until some field is modified. That might seem strange at first, but it makes sense (and is even useful) in many circumstances. One reason for this behavior is one of efficiency: if the input and output separator are the same, as it happens many times, replacing FS with OFS is pointless. And if you think about it, if awk really always replaced FS with OFS, a program as simple as
$ echo 'foo;bar' | awk -v FS=';' -v OFS=',' '/foo/'
would output
foo,bar # ????
which violates the principle of least surprise and most certainly is not what one expects here.
On the other hand, if replacing FS with OFS is needed, then obviously awk has to do that at some point. So the question is: when is it that awk thinks FS has to be replaced with OFS? As mentioned above, awk assumes that it's time to replace FS with OFS (and thus recompute $0) when a field is modified, which is a sensible assumption, and almost always produces the output that one would expect.
So back to the original problem, you can now see a solution: to force recomputation of $0, just let awk think you've changed a field, but without changing it:
awk -v FS=';' -v OFS=',' '{$1=$1}1'
The
Warning: useless information follows.
If you're sure that $1 will never be an empty string, you can even golf the code a bit more and use the assignment as the condition, so
awk -v FS=';' -v OFS=',' '$1=$1'
If you want to shorten it even more (and make it more cryptic) to impress your friends or whatever reason, you can move the assignments at the end to save the -v and remove some quotes:
awk '$1=$1' FS=\; OFS=,
That exploits an obscure feature of awk where any argument that has a "=" in it is treated as a variable assignment instead of a file to read (which, btw, makes it hard to persuade awk to operate on files whose name contains "=". I'm sure your life will never be the same now that you know this). Well, I did say that this was useless information, so let's get back to something more practical...
Build strings with separators
This is similar to the so-called fencepost problem. In many occasions you need to build a string using concatenation, starting from an empty string, and adding values as you go. The values should be separated by some separator (let's say, a semicolon for these examples). One might do this, in some loop:
string = string ";" somedata
but then string has an unwanted leading semicolon. Putting the semicolon after the variable has a similar problem. So a typical way to do this with awk is this:
string = string sep somedata; sep = ";"
This exploits the fact that awk variables start out containing the dual value empty string or zero, so the first time the code is executed, sep is empty (you can explicitly initialize it to the empty string in a BEGIN block, if you like, but it's redundant). Then it's set to a semicolon, and it will have that value from the second time the code is executed onwards. The result is that at the end string will have a neat list of values with the semicolons only where they should be.
As a practical example of this idiom, let's see how to transpose a matrix using awk:
$ cat matrix.txt a1;a2;a3;a4;a5 b1;b2;b3;b4;b5 c1;c2;c3;c4;c5 d1;d2;d3;d4;d5 $ awk -F\; '{for(i=1;i<=NF;i++)r[i]=r[i] sep $i;sep=FS}END{for(i=1;i<=NF;i++)print r[i]}' matrix.txt a1;b1;c1;d1 a2;b2;c2;d2 a3;b3;c3;d3 a4;b4;c4;d4 a5;b5;c5;d5
The idea here is to build an array r with NF elements (the number of columns in the original input), each of which will hold a line of the output. For each input line, every element of r has another "column" added. The variable sep is initially empty, then (after the first input line has been processed) it's set to semicolon. Of course, for more complex processing, an array of separators could be used.
Two-file processing
Another construct that is often used in awk is as follows:
$ awk 'NR == FNR { # some actions; next} # other condition {# other actions}' file1.txt file2.txt
This is used when processing two files. When processing more than one file, awk reads each file sequentially, one after another, in the order they are specified on the command line. The special variable NR stores the total number of input records read so far, regardless of how many files have been read. The value of NR starts at 1 and always increases until the program terminates. Another variable, FNR, stores the number of records read from the current file being processed. The value of FNR starts at 1, increases until the end of the current file is reached, then is set again to 1 as soon as the first line of the next file is read, and so on. So, the condition NR == FNR is only true while awk is reading the first file. Thus, in the program above, the actions indicated by
Probably, it all becomes much clearer with some examples. There are really many problems that involve two files that can be solved using this technique. Let's look at this:
# prints lines that are both in file1.txt and file2.txt (intersection) $ awk 'NR == FNR{a[$0];next} $0 in a' file1.txt file2.txt
Here we see another typical idiom:
# prints lines that are only in file1.txt and not in file2.txt $ awk 'NR == FNR{a[$0];next} !($0 in a)' file2.txt file1.txt
Note the order of the arguments.
Another example. Suppose we have a data file like this
$ cat data.txt 20081010 1123 xxx 20081011 1234 def 20081012 0933 xyz 20081013 0512 abc 20081013 0717 def ...thousand of lines...
where "xxx", "def", etc. are some kind of operation codes. We want to replace each operation code with its description. We have another file that maps operation codes to human readable descriptions, like this:
$ cat map.txt abc withdrawal def payment xyz deposit xxx balance ...other codes...
We can easily replace the opcodes in the data file with this simple awk program, that again uses the two-files idiom (and other idioms that were already introduced):
# use information from a map file to modify a data file $ awk 'NR == FNR{a[$1]=$2;next} {$3=a[$3]}1' map.txt data.txt
First, the array a, indexed by opcode, is populated with the human readable descriptions, read from the map file. Then, it is used during the reading of the data file to do the replacements. Each line of the data file is then printed after the substitution has been made.
Another case where the two-files idiom is useful is when you have to read the same file twice, the first time to get some information that can be correctly defined only by reading the whole file, and the second time to process the file using that information. For example, you want to replace each number in a list of numbers with its difference from the largest number in the list:
# replace each number with its difference from the maximum $ awk 'NR == FNR{if($0 > max) max = $0;next} {$0 = max - $0}1' file.txt file.txt
Note that we specify
As with all other idioms, you are encouraged to find your own uses and variations.
Caveat: all the programs that use the two-files idiom will not work correctly if the first file is empty (in that case, awk will execute the actions associated to
Hi,
Let's say I have file1 and file2. I want to match a sequence in file1 with a sequence in file2 and return the output with some other sequence from file 2. For example, in file1 I have the line:
Jessica Robert 123 USA association
In file 2 I have:
Mike association UK 567
And I want to return Mike from file2.
Thank you
Actually, I have used the command:
awk 'NR == FNR{a[$5];next} $2 in a' Route_guide_1.txt Carrier_associations_1.txt
but it displays the entire line of file2, and I need only the first sequence.
Thanks
Ok, I have managed to do this with:
awk 'NR == FNR {a[$5];next} $2 in a {print $1} ' file1 file2
If you have any advice please let me know. And sorry for the spam, but i am new in this and really wanted some help :)
Thank you
You did it right. Cheers
Hello Guys!
I need help on the requirement where i need to perform the join operation on 3 csv files where 1 column will be common in 3 files.
And I also want to print the resulting values in another csv file. Order of Printing columns in csv is also required.
Best Regards
Akshat
It's not quite clear from your description, an example of input and expected output would be better.
I have two files that I want to merge based on first column as key in both the files.
So if first column matches, I want to merge the two records (one from each file) based on a condition on the value of the second column (if the second columm="abc" in second file take the value of second colum from first file provided it is not null in first file)
If the first column do not exist in both the files I want the record in my output as it is from whichever file it is in.
First file record => alphabeta,xyz,Thu Apr 05 09:30:50 AM 18,sss,ttt,uuu,0
Second file record => alphabeta,abc, Tue Apr 10 12:40:50 EDT 2018 , sss ,ttr , xxe, 95
Merged file record => alphabeta,xyz, Tue Apr 10 12:40:50 EDT 2018 , sss ,ttr , xxe, 95
You don't say what you want to do when second column in file2 is "abc" but the corresponding column in file1 is null (I suppose you mean "empty" rather); I'm assuming that in this case you keep the "abc" value from file2.
Also it's not clear whether order matters in the output (I'm assuming it doesn't).
I'm also assuming that each key (ie, column 1) cannot appear more than once in a file.
Try with
Thanks for this useful stuff waldner, really helpful. If I want to do the following with awk, could you please help?
cat file1.txt
8888234234224234
hello world - wonderful
8888824545534334
hellow rold - nice
8888334323234234
hello world - amazing
hello world - excellent
cat file2.txt
some random text
some other random text
andmorerandom:{'and: somemore text','text: and more/du0038888334323234234','name:''hello world'}
some random text
some other random text
andmorerandom:{'and: somemore text','text: and more/du0038888234234224234','name:''hello world'}
Now I want to replace 'hello world' in file2 with 'hello world - xxxx' next line to matching 8888xxxxxx from file1.
Note: if there are 2 hello world line for matching entry in file1, then I want to duplicate the entire matching line in file2 and replace second 'hello world' entry there. (am not sure if this is possible though)
So my output should look like this..
cat file2.txt
some random text
some other random text
andmorerandom:{'and: somemore text','text: and more/du0038888334323234234','name:''hello world - amazing'}
some random text
some other random text
andmorerandom:{'and: somemore text','text: and more/du0038888234234224234','name:''hello world - wonderful'}
Thank you so much in advance for your precious time and help on this.
Your problem isn't well specified and in many places there could be many possibilities. I've made some assumptions and come up with the following code (mind you, needs GNU awk):
With that code in a file, you can do
and get the output you describe (or at least my understanding of it). If not, I hope you can do the necessary adjustments yourself.
In your colunmation example, you said adding an END block would eliminate the trailing space for sequences that don't evenly divide. Do you have an example of that END block?
My solution requires two passes.
seq 1 29 | awk 'ORS=NR%5?FS:RS;END{printf ""}' | awk 'NR>1{print prev}{prev=$0}END{sub(/ $/,"",$0); print}'
In the END block you just put the printing of the newline, eg
That END block would terminate the output as " \n", whereas all other lines terminated without space. I didn't see any way of doing that in the END block while keeping the same code without piping to another awk for post-processing.
I see what you mean. Well in that case it's not possible to remove the space, since it has already been printed. You can change the code and lose a lot of idiomaticity to accumulate the line instead of printing it so you have the chance to change it before printing, eg something like
I want following format for my program:
awk
1. {Body (processing file_1)}
2. END{ computation from previous data}
3. {Body update fields in file_1 using computed result}
the problem is with 3. part as i am unable to read the file_1 from begining after end.
The problem is dificult for me as I am new to awk.
Thanks in advance.
So you can read the same file twice, and do your computations just before you process the first line for the second time.
For example:
Thanks a lot waldner,
It worked like a charm..
Don't see how the 1 in this example matches the template of pattern{action}
awk '{sub(/pattern/, "foobar")} 1'
It's explained in there. "1" is an always-true pattern; the action is missing, which means that it is
{print}
(the default action that is executed if the pattern is true). So essentially the lone "1" is used to print all lines. The example you cite can thus be rewritten as eitherawk '{sub(/pattern/, "foobar")} 1 {print}'
or simplyawk '{sub(/pattern/, "foobar")} {print}'
.Follow up question:
Are pattern{action} normally separated by a ; like
pattern1{action1};pattern2{action2}
Mike
The semicolon is not mandatory, as long as awk is able to tell where a pattern begins.
there is a way to do this but with five files, instead of two?
# prints lines that are both in file1.txt and file2.txt (intersection)
$ awk 'NR == FNR{a[$0];next} $0 in a' file1.txt file2.txt
I assume you want to print the lines that appear in all five files (or N files, for that matter). This should do it, assuming no file is empty (a bit reformatted for readability, but can be written all on a single line):
[...] http://backreference.org/2010/02/10/idiomatic-awk/ [...]
I have two files with the difference in rows and columns but we have a common column, first column in file1 is same in last column in file2 and I'm struggling to merge them and doesn't work as expected. Any help would be appreciated.
I'm using this command but doesn't work as expected. I'm putting file1 in array and for second file I'm trying to compare on column -> awk 'NR==FNR{a[$1]=$2;next} $NF=a[$1]{print $0, a[$NF]}' file1 file2
File1
-----
/g01/ffb /systst/tst.ds
File2
-----
21 0 rw- 1 gem gem 12 Jul 19 2016 /g01/ffb
22 0 rw- 1 gem gem 12 Jul 19 2016 /g01/ffa
...
output - I'm looking for
==========================
21 0 rw- 1 gem gem 12 Jul 19 2016 /g01/ffb /systst/tst.ds
Thanks
Try the following: