This is another thing people do all the time with awk. Although the CSV format should be more or less standardized, it seems there are still a number of subtle variations floating around. Let's look at some of them.
Simple CSV files (with fields separated by commas, and commas cannot appear anywhere else) are easily parsed by setting FS to ",", so we won't go into further detail here, as there's not much to add to it.
If there can be spaces around fields, and we don’t want them, like eg
field1 , field2 , field3 , field4
Exploiting the fact that FS can be a regex, we could try using something like
FS="^ *| *, *| *$" # or, on the command line awk -F '^ *| *, *| *$' ...
However, this can be problematic because the actual data fields might end up corresponding either to awk fields
# FS=","
for(i=1;i<=NF;i++){
gsub(/^ *| *$/,"",$i)
print "Field " i " is " $i
}
If commas can appear in fields provided that they are escaped, like this
field1,field2\,with\,commas,field3,field4\,
a common trick is to replace all the
# FS="," or whatever appropriate
gsub(/\\,/,SUBSEP)
for (i=1;i<=NF;i++) {
f=$i
gsub(SUBSEP,"\\,",f)
print "Field " i " is " f
}
Another common CSV format is
"field1","field2","field3,with,commas","field4"
Assuming double quotes cannot occur in fields, but here fields can contain unescaped commas. This is easily parsed using
FS="^\"|\",\"|\"$" # or, if you like FS="\",\"|\"" # or on the command line (perhaps clearer) awk -F '^"|","|"$' ... awk -F '","|"' ...
but keeping in mind that the actual fields will be in position
"field1" , "field2", "field3,with,commas" , "field4"
by using
FS="^ *\"|\" *, *\"|\" *$" # or, on the command line awk -F '^ *"|" *, *"|" *$' ...
usable fields will still be in positions
Another CSV format is as follows (the only quoted fields are those that contain commas):
field1, "field2,with,commas" , field3 , "field4,foo"
We have a mixture of quoted and unquoted fields here, which cannot parsed directly by any value of FS (that I know of, at least). However, we can still get the fields using match() in a loop (and cheating a bit):
c=0
$0=$0"," # yes, cheating
while($0) {
match($0,/ *"[^"]*" *,|[^,]*,/)
f=substr($0,RSTART,RLENGTH) # save what matched in f
gsub(/^ *"?|"? *,$/,"",f) # remove extra stuff
print "Field " ++c " is " f
$0=substr($0,RLENGTH+1) # "consume" what matched
}
As the complexity of the format increases (for example when quotes are allowed in fields, escaped in some way), awk solutions become more fragile. Although I should not say this here, for anything more complex than the last example, I suggest using other tools (eg, Perl just to name one). Btw, it looks like there is an awk CSV parsing library here (I have not tried it, but it looks quite good).
Huh? How are you using apostrophes as string delimiters in awk/gawk?
Well yes, rereading it I see that you are right. The fact is that they were meant to be used in the shell as in -F '^ *"|" *, *"|" *$' 'awk code here', but I ended up writing them in that hybrid and incorrect way.
I've fixed it now (and included both ways), thanks.