This is another thing people do all the time with awk. Although the CSV format should be more or less standardized, it seems there are still a number of subtle variations floating around. Let's look at some of them.
Simple CSV files (with fields separated by commas, and commas cannot appear anywhere else) are easily parsed by setting FS to ",", so we won't go into further detail here, as there's not much to add to it.
If there can be spaces around fields, and we don’t want them, like eg
field1 , field2 , field3 , field4
Exploiting the fact that FS can be a regex, we could try using something like
FS="^ *| *, *| *$" # or, on the command line awk -F '^ *| *, *| *$' ...
However, this can be problematic because the actual data fields might end up corresponding either to awk fields
# FS="," for(i=1;i<=NF;i++){ gsub(/^ *| *$/,"",$i) print "Field " i " is " $i }
If commas can appear in fields provided that they are escaped, like this
field1,field2\,with\,commas,field3,field4\,
a common trick is to replace all the
# FS="," or whatever appropriate gsub(/\\,/,SUBSEP) for (i=1;i<=NF;i++) { f=$i gsub(SUBSEP,"\\,",f) print "Field " i " is " f }
Another common CSV format is
"field1","field2","field3,with,commas","field4"
Assuming double quotes cannot occur in fields, but here fields can contain unescaped commas. This is easily parsed using
FS="^\"|\",\"|\"$" # or, if you like FS="\",\"|\"" # or on the command line (perhaps clearer) awk -F '^"|","|"$' ... awk -F '","|"' ...
but keeping in mind that the actual fields will be in position
"field1" , "field2", "field3,with,commas" , "field4"
by using
FS="^ *\"|\" *, *\"|\" *$" # or, on the command line awk -F '^ *"|" *, *"|" *$' ...
usable fields will still be in positions
Another CSV format is as follows (the only quoted fields are those that contain commas):
field1, "field2,with,commas" , field3 , "field4,foo"
We have a mixture of quoted and unquoted fields here, which cannot parsed directly by any value of FS (that I know of, at least). However, we can still get the fields using match() in a loop (and cheating a bit):
c=0 $0=$0"," # yes, cheating while($0) { match($0,/ *"[^"]*" *,|[^,]*,/) f=substr($0,RSTART,RLENGTH) # save what matched in f gsub(/^ *"?|"? *,$/,"",f) # remove extra stuff print "Field " ++c " is " f $0=substr($0,RLENGTH+1) # "consume" what matched }
As the complexity of the format increases (for example when quotes are allowed in fields, escaped in some way), awk solutions become more fragile. Although I should not say this here, for anything more complex than the last example, I suggest using other tools (eg, Perl just to name one). Btw, it looks like there is an awk CSV parsing library here (I have not tried it, but it looks quite good).
We should be using a recent version of gawk anyway, hence there is a rather simple solution using FPAT (field pattern) instead of FS (field separator) which is documented in GAWK manual:
https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html
However the regexp exposed there is not completely RFC 4180 compliant because it does not account for [""] inside ["] like ["He shouted ""Hello"""].
Here is my solution:
Here is a solution to parse CSV data (in format defined in RFC 4180 linked above) using gawk's patsplit function:
http://lists.gnu.org/archive/html/bug-gawk/2015-07/msg00002.html
This solution only works with gawk 4.0 and above.
Here's a regex that can be used with the match() example above:
([^\\,"]|(\\.))*($|,)|(^"([^"\\]|(\\.))*"($|,))
It allows you to parse a mixture of "text" and text in the CSV and you can \ any character to treat it as data.
The data extracted needs to be postprocessed to remove the quoting if appropriate.
Just another idea.
Example:
"Test",Example containing \",Example containing \,,Another example containing \\
For a different approach, see https://github.com/dbro/csvquote - it's a script I wrote that sanitizes the quoted data so that awk can work with it easily (no FPAT required, handles double quote marks), and then restores the special characters after awk is done.
Here's a different way to do the mixed fields (i.e. the Excel output) format.
BEGIN {FS="\"";OFS=" "}; {for (i = 1; i <= NF; i += 2) { gsub(",", "\t", $i)}; print}
Rationale:
Use the quotes to split the line. In a well-formed line, we will get an odd number of fields. The even numbered fields will be the contents of a single field that are contained within the quotes, the odd-numbered fields will be all the other fields between them.
So, given the input line of...
field1, "field2,with,commas" , field3 , "field4,foo"
we would initially split into 5 fields which are...
field1,
field2,with,commas
, field3 ,
field4,foo
and a blank one at the end.
We process these, converting any commas in the odd-numbered fields into tabs and print them out, we get the desired effect. For bonus points we can strip away leading and trailing whitespace by specifying a slightly different regex for gsub.
BEGIN {FS="\"";OFS=" "}; {for (i = 1; i <= NF; i += 2) { gsub(" *, *", "\t", $i)}; print}
awk is your friend.
You probably want to set OFS to be empty rather than a space (which is the default anyway), otherwise you'll get spurious spaces between fields in the output.
Also consider that your code creates different number of awk fields depending on which CSV fields are quoted, eg this input
is treated differently from
although the replacement of " *, *" with tabs in the odd numbered field makes it look like the end result is similar.
You're right about the OFS - my bad on the typing.
But I beg to differ with you over your second assertion. Provided that the double quotes are correctly balanced, my code works regardless of how many or how few CSV fields are quoted.
Am I missing something?
I didn't say it wouldn't work, just that internally it is treated differently, and the end result happens to be the same due to the way the substitutions are made. However if you start to do something a bit more involved with fields, you'll have to introduce more logic in the code to handle all the cases.
As a simple example, if you want to prepend "XXX" to all fields before printing them, you'll probably have to do some sort of loop over the fields that awk builds, but which awk fields are CSV fields and which are not depends on which CSV fields were quoted in the input. If the input is
you'll have "real" CSV fields in $1, $2, $3, $4 (although they would need to be cleaned up). But if the input is
then the CSV fields end up in $1, $2, $4 and $6. If the positions of the quoted fields change, then the mapping to awk fields changes again. Sure, you can then check every (even) awk field for embedded commas or use other tricks to find out, but imho that complicates the code.
If your goal is just to transform an input line into another with tab-separated fields and double quotes removed, without looking at the actual fields, then the code is fine provided OFS is set to the empty string.
Again, not sure that I agree with you. If i want to prepend "XXX" to each field, then all I need to do is a little bit of extra work with field 1 to account for the fact that I'm not replacing a comma in front of it.
BEGIN {FS="\"";OFS=""}; {for (i = 1; i <= NF; i += 2) { gsub(" *, *", "\tXXX", $i)}; $1="XXX"$1; print }
works for both of your input examples.
I chose a bad example, but hopefully you get the idea. Let's assume we have to do something with the third CSV field. The input may vary from
which gives you a single awk field (everything in $1), to
which gives you 9 awk fields. And of course, any combination in between these two extremes. For each different combination of quoted/unquoted fields in the input, the mapping to awk fields changes.
In many cases, the third CSV field will not come alone so you'd have to find it first (find in which awk field it ended up), isolate it if needed, then finally do what you have to do. I don't think this is very practical.
OK. I understand what you're suggesting is the problem, but I would naturally deal with processing the third field in a subsequent process through a pipe rather than trying to do everything in a single script.
One re-usable awk script to convert from Excel's b**tardised CSV format into sane tab-delimited format then another script (be it in awk or whatever) to do the downstream processing.
Regardless, awk is still our friend. :o}
Huh? How are you using apostrophes as string delimiters in awk/gawk?
Well yes, rereading it I see that you are right. The fact is that they were meant to be used in the shell as in -F '^ *"|" *, *"|" *$' 'awk code here', but I ended up writing them in that hybrid and incorrect way.
I've fixed it now (and included both ways), thanks.