Splitting a csv with awk: how to consider returns? -
i have file:
field1|field2|field3|f41;f42|f5 field1|field2|field3|f41|f5| field1|field2|field3|f41;f42;f43|f5 i want parse , obtain:
field1|field2|field3|f41|f5 field1|field2|field3|f42|f5 ... in short make subparsing according semicolumn in field 4. awk script following:
awk < myfile.txt -f\| '{ n=split($4,a,";"); print $1 for(i=0; ++i <= n;) print $1"|"$2"|"$3"|"a[i]"|"$5"|"; }' it works, anyway lines not ending "|" first character of following line disappearing! example, given file get:
field1|field2|field3|f41|f5 ield1|field2|field3|f42|f5 i think due fact there no "|" @ end of line. there way tell awk consider carriage return?
- don't write loops using wacky syntax
for(i=0; ++i <= n;)obfuscates code (e.g. need think ifi0 or 1 first time through loop since it's not stated). write them intended writtenfor (init;condition;increment):for(i=1;i <= n;i++). - don't redirect input awk, e.g.
awk < file 'script', let awk open fileawk 'script' filehave accessfilenamein scripts. - don't add spurious semi-colons throughout script - not
c. - don't print hard-coded field separator multiple times, e.g.
print $1"|"$2"|"$3"|"a[i]"|"$5, use ofs designed instead:ofs="|";...;print $1,$2,$3,a[i],$5. - don't use strings in regexp context unless have excellent reason obfuscate, complicate , reduce efficiency of code, e.g. instead of
split($4,a,";")should usesplit($4,a,/;/). - use white space/indentation, surprisingly cheap.
so step 1 rewrite script:
awk < myfile.txt -f\| '{ n=split($4,a,";"); print $1 for(i=0; ++i <= n;) print $1"|"$2"|"$3"|"a[i]"|"$5"|"; }' as:
awk ' begin { fs=ofs="|" } { n=split($4,a,/;/) print $1 for(i=1; i<=n; i++) print $1, $2, $3, a[i], $5, "" } ' myfile.txt from that, fixing for loop syntax can see printing first field twice, first time on line of it's own, can change to:
$ awk ' begin { fs=ofs="|" } { n=split($4,a,/;/) for(i=1; i<=n; i++) print $1, $2, $3, a[i], $5, "" } ' myfile.txt field1|field2|field3|f41|f5| field1|field2|field3|f42|f5| field1|field2|field3|f41|f5| field1|field2|field3|f41|f5| field1|field2|field3|f42|f5| field1|field2|field3|f43|f5| so - wanted? unfortunately used same values same field positions on input lines can't tell output lines/fields coming input lines/fields , didn't post full expected output can't tell if above expected output or not. it's not clear if want print empty field @ end of every output line or not or whether or not want hard-code number of output fields.
oh, , if characters disappearing in output it's because have control-ms or other spurious control characters in input file. use cat -v see them , dos2unix or similar remove them if control-ms.
Comments
Post a Comment