CSV parsing with awk

Posted by waldner on 17 April 2010, 11:09 am

This is another thing people do all the time with awk. Although the CSV format should be more or less standardized, it seems there are still a number of subtle variations floating around. Let's look at some of them.

Simple CSV files (with fields separated by commas, and commas cannot appear anywhere else) are easily parsed by setting FS to ",", so we won't go into further detail here, as there's not much to add to it.

If there can be spaces around fields, and we don’t want them, like eg

    field1  ,   field2   , field3   , field4

Exploiting the fact that FS can be a regex, we could try using something like

FS="^ *| *, *| *$"
# or, on the command line
awk -F '^ *| *, *| *$' ...

However, this can be problematic because the actual data fields might end up corresponding either to awk fields 1 … NF or 2 … NF (or NF-1), depending on whether the line had leading/trailing spaces or not. So in this case, it’s probably better to just parse using "," for FS as usual, and remove leading and trailing spaces from each field:

# FS=","
for(i=1;i<=NF;i++){
  gsub(/^ *| *$/,"",$i)
  print "Field " i " is " $i
}

If commas can appear in fields provided that they are escaped, like this

field1,field2\,with\,commas,field3,field4\,

a common trick is to replace all the \, sequences beforehand with some character that is not (or should not be!) present in the file, parse the records as usual, and then replace back the sequences that were previously removed. While some implementations of awk allow using any character via the \xnn syntax, so you could use some ASCII control character, usually the special variable SUBSEP (whose default value is 0x1c hex, or 034 octal) is good enough for the job:

# FS="," or whatever appropriate
gsub(/\\,/,SUBSEP)
for (i=1;i<=NF;i++) {
  f=$i
  gsub(SUBSEP,"\\,",f)
  print "Field " i " is " f
}

Another common CSV format is

"field1","field2","field3,with,commas","field4"

Assuming double quotes cannot occur in fields, but here fields can contain unescaped commas. This is easily parsed using

FS="^\"|\",\"|\"$"
# or, if you like
FS="\",\"|\""
# or on the command line (perhaps clearer)
awk -F '^"|","|"$' ...
awk -F '","|"' ...

but keeping in mind that the actual fields will be in position 2, 3 … NF-1. We can extend that to allow for spaces around fields, like eg

   "field1"  , "field2",   "field3,with,commas" , "field4"

by using

FS="^ *\"|\" *, *\"|\" *$"
# or, on the command line
awk -F '^ *"|" *, *"|" *$' ...

usable fields will still be in positions 2 … NF-1, so there will be no ambiguities. You can NOT use the good old "," for FS here, as the commas within fields will be used as separators.

Another CSV format is as follows (the only quoted fields are those that contain commas):

  field1, "field2,with,commas"  ,  field3  ,  "field4,foo"

We have a mixture of quoted and unquoted fields here, which cannot parsed directly by any value of FS (that I know of, at least). However, we can still get the fields using match() in a loop (and cheating a bit):

c=0
$0=$0","                                   # yes, cheating
while($0) {
  match($0,/ *"[^"]*" *,|[^,]*,/) 
  f=substr($0,RSTART,RLENGTH)             # save what matched in f
  gsub(/^ *"?|"? *,$/,"",f)               # remove extra stuff
  print "Field " ++c " is " f
  $0=substr($0,RLENGTH+1)                 # "consume" what matched
}

As the complexity of the format increases (for example when quotes are allowed in fields, escaped in some way), awk solutions become more fragile. Although I should not say this here, for anything more complex than the last example, I suggest using other tools (eg, Perl just to name one). Btw, it looks like there is an awk CSV parsing library here (I have not tried it, but it looks quite good).

Filed under awk, shell, tips Tagged awk, csv, text processing

Comments are closed | Permalink

14 Comments

Marco Coletti says:

March 11, 2016 at 16:43

We should be using a recent version of gawk anyway, hence there is a rather simple solution using FPAT (field pattern) instead of FS (field separator) which is documented in GAWK manual:
https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html

However the regexp exposed there is not completely RFC 4180 compliant because it does not account for [""] inside ["] like ["He shouted ""Hello"""].
Here is my solution:
```
-----------------------------------------------------------------------------
BEGIN {
  FPAT = "(\"([^\"]|\"\")*\")|([^,\"]*)"
}
{
  for (i = 1; i <= NF; i++) {
    $i = gensub(/\"\"/,"\"","g",gensub(/^\"|\"$/,"","g",$i))
  }
  # at this point the variables $1, $2, $3... contain the original undecorated unescaped data
}
-----------------------------------------------------------------------------
```
Jarno Suni says:

August 12, 2015 at 21:05

Here is a solution to parse CSV data (in format defined in RFC 4180 linked above) using gawk's patsplit function:

http://lists.gnu.org/archive/html/bug-gawk/2015-07/msg00002.html
- jee says:
  
  January 23, 2016 at 10:15
  
  This solution only works with gawk 4.0 and above.
Ralph Little says:

March 24, 2015 at 22:05

Here's a regex that can be used with the match() example above:

([^\\,"]|(\\.))*($|,)|(^"([^"\\]|(\\.))*"($|,))

It allows you to parse a mixture of "text" and text in the CSV and you can \ any character to treat it as data.
The data extracted needs to be postprocessed to remove the quoting if appropriate.
Just another idea.

Example:
"Test",Example containing \",Example containing \,,Another example containing \\
Dan says:

May 5, 2013 at 00:51

For a different approach, see https://github.com/dbro/csvquote - it's a script I wrote that sanitizes the quoted data so that awk can work with it easily (no FPAT required, handles double quote marks), and then restores the special characters after awk is done.
Andy Law says:

June 4, 2012 at 16:01

Here's a different way to do the mixed fields (i.e. the Excel output) format.

BEGIN {FS="\"";OFS=" "}; {for (i = 1; i <= NF; i += 2) { gsub(",", "\t", $i)}; print}

Rationale:
Use the quotes to split the line. In a well-formed line, we will get an odd number of fields. The even numbered fields will be the contents of a single field that are contained within the quotes, the odd-numbered fields will be all the other fields between them.

So, given the input line of...

field1, "field2,with,commas" , field3 , "field4,foo"

we would initially split into 5 fields which are...

field1,
field2,with,commas
, field3 ,
field4,foo

and a blank one at the end.

We process these, converting any commas in the odd-numbered fields into tabs and print them out, we get the desired effect. For bonus points we can strip away leading and trailing whitespace by specifying a slightly different regex for gsub.

BEGIN {FS="\"";OFS=" "}; {for (i = 1; i <= NF; i += 2) { gsub(" *, *", "\t", $i)}; print}

awk is your friend.
- waldner says:
  
  June 8, 2012 at 10:25
  
  You probably want to set OFS to be empty rather than a space (which is the default anyway), otherwise you'll get spurious spaces between fields in the output.
  
  Also consider that your code creates different number of awk fields depending on which CSV fields are quoted, eg this input
```
field1, "field2,with,commas" , field3 , "field4,foo"
```
  is treated differently from
```
field1, "field2,with,commas" , "field3,abc" , "field4,foo"
```
  although the replacement of " *, *" with tabs in the odd numbered field makes it look like the end result is similar.
  - Andy Law says:
    
    June 11, 2012 at 14:45
    
    You're right about the OFS - my bad on the typing.
    
    But I beg to differ with you over your second assertion. Provided that the double quotes are correctly balanced, my code works regardless of how many or how few CSV fields are quoted.
    
    Am I missing something?
    - waldner says:
      
      June 11, 2012 at 15:05
      
      I didn't say it wouldn't work, just that internally it is treated differently, and the end result happens to be the same due to the way the substitutions are made. However if you start to do something a bit more involved with fields, you'll have to introduce more logic in the code to handle all the cases.
      
      As a simple example, if you want to prepend "XXX" to all fields before printing them, you'll probably have to do some sort of loop over the fields that awk builds, but which awk fields are CSV fields and which are not depends on which CSV fields were quoted in the input. If the input is
      
      field1, "field2,with,commas" , field3 , "field4,foo"
      
      you'll have "real" CSV fields in $1, $2, $3, $4 (although they would need to be cleaned up). But if the input is
      
      field1, "field2,with,commas" , "field3,abc" , "field4,foo"
      
      then the CSV fields end up in $1, $2, $4 and $6. If the positions of the quoted fields change, then the mapping to awk fields changes again. Sure, you can then check every (even) awk field for embedded commas or use other tricks to find out, but imho that complicates the code.
      
      If your goal is just to transform an input line into another with tab-separated fields and double quotes removed, without looking at the actual fields, then the code is fine provided OFS is set to the empty string.
      - Andy Law says:
        
        June 11, 2012 at 15:21
        
        Again, not sure that I agree with you. If i want to prepend "XXX" to each field, then all I need to do is a little bit of extra work with field 1 to account for the fact that I'm not replacing a comma in front of it.
        
        BEGIN {FS="\"";OFS=""}; {for (i = 1; i <= NF; i += 2) { gsub(" *, *", "\tXXX", $i)}; $1="XXX"$1; print }
        
        works for both of your input examples.
        
        waldner says:
        
        June 11, 2012 at 15:48
        
        I chose a bad example, but hopefully you get the idea. Let's assume we have to do something with the third CSV field. The input may vary from
        
        field1, field2 , field3 , field4
        
        which gives you a single awk field (everything in $1), to
        
        "field1,a", "field2,b" , "field3,c" , "field4,d"
        
        which gives you 9 awk fields. And of course, any combination in between these two extremes. For each different combination of quoted/unquoted fields in the input, the mapping to awk fields changes.
        
        In many cases, the third CSV field will not come alone so you'd have to find it first (find in which awk field it ended up), isolate it if needed, then finally do what you have to do. I don't think this is very practical.
        
        Andy Law says:
        
        June 11, 2012 at 15:55
        
        OK. I understand what you're suggesting is the problem, but I would naturally deal with processing the third field in a subsequent process through a pipe rather than trying to do everything in a single script.
        
        One re-usable awk script to convert from Excel's b**tardised CSV format into sane tab-delimited format then another script (be it in awk or whatever) to do the downstream processing.
        
        Regardless, awk is still our friend. :o}
IF_Rock says:

September 23, 2011 at 19:59

Huh? How are you using apostrophes as string delimiters in awk/gawk?
- waldner says:
  
  September 23, 2011 at 20:29
  
  Well yes, rereading it I see that you are right. The fact is that they were meant to be used in the shell as in -F '^ *"|" *, *"|" *$' 'awk code here', but I ended up writing them in that hybrid and incorrect way.
  I've fixed it now (and included both ways), thanks.

\1