Splitting a csv with awk: how to consider returns? -

May 15, 2012

i have file:

field1|field2|field3|f41;f42|f5 field1|field2|field3|f41|f5| field1|field2|field3|f41;f42;f43|f5

i want parse , obtain:

field1|field2|field3|f41|f5 field1|field2|field3|f42|f5 ...

in short make subparsing according semicolumn in field 4. awk script following:

awk < myfile.txt -f\| '{  n=split($4,a,";"); print $1 for(i=0; ++i <= n;) print $1"|"$2"|"$3"|"a[i]"|"$5"|";  }'

it works, anyway lines not ending "|" first character of following line disappearing! example, given file get:

field1|field2|field3|f41|f5 ield1|field2|field3|f42|f5

i think due fact there no "|" @ end of line. there way tell awk consider carriage return?

don't write loops using wacky syntax for(i=0; ++i <= n;) obfuscates code (e.g. need think if i 0 or 1 first time through loop since it's not stated). write them intended written for (init;condition;increment): for(i=1;i <= n;i++).
don't redirect input awk, e.g. awk < file 'script', let awk open file awk 'script' file have access filename in scripts.
don't add spurious semi-colons throughout script - not c.
don't print hard-coded field separator multiple times, e.g. print $1"|"$2"|"$3"|"a[i]"|"$5, use ofs designed instead: ofs="|";...;print $1,$2,$3,a[i],$5.
don't use strings in regexp context unless have excellent reason obfuscate, complicate , reduce efficiency of code, e.g. instead of split($4,a,";") should use split($4,a,/;/).
use white space/indentation, surprisingly cheap.

so step 1 rewrite script:

awk < myfile.txt -f\| '{  n=split($4,a,";"); print $1 for(i=0; ++i <= n;) print $1"|"$2"|"$3"|"a[i]"|"$5"|";  }'

as:

awk ' begin { fs=ofs="|" } {     n=split($4,a,/;/)     print $1     for(i=1; i<=n; i++)         print $1, $2, $3, a[i], $5, ""  } ' myfile.txt

from that, fixing for loop syntax can see printing first field twice, first time on line of it's own, can change to:

$ awk ' begin { fs=ofs="|" } {     n=split($4,a,/;/)     for(i=1; i<=n; i++)         print $1, $2, $3, a[i], $5, "" } ' myfile.txt field1|field2|field3|f41|f5| field1|field2|field3|f42|f5| field1|field2|field3|f41|f5| field1|field2|field3|f41|f5| field1|field2|field3|f42|f5| field1|field2|field3|f43|f5|

so - wanted? unfortunately used same values same field positions on input lines can't tell output lines/fields coming input lines/fields , didn't post full expected output can't tell if above expected output or not. it's not clear if want print empty field @ end of every output line or not or whether or not want hard-code number of output fields.

oh, , if characters disappearing in output it's because have control-ms or other spurious control characters in input file. use cat -v see them , dos2unix or similar remove them if control-ms.

Search This Blog

Lix

Splitting a csv with awk: how to consider returns? -

Comments

Post a Comment

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - How can I echo out this array? -

javascript - IE11 incompatibility with jQuery's 'readonly'? -