Splitting a csv with awk: how to consider returns? -
i have file:
field1|field2|field3|f41;f42|f5 field1|field2|field3|f41|f5| field1|field2|field3|f41;f42;f43|f5
i want parse , obtain:
field1|field2|field3|f41|f5 field1|field2|field3|f42|f5 ...
in short make subparsing according semicolumn in field 4. awk script following:
awk < myfile.txt -f\| '{ n=split($4,a,";"); print $1 for(i=0; ++i <= n;) print $1"|"$2"|"$3"|"a[i]"|"$5"|"; }'
it works, anyway lines not ending "|" first character of following line disappearing! example, given file get:
field1|field2|field3|f41|f5 ield1|field2|field3|f42|f5
i think due fact there no "|" @ end of line. there way tell awk consider carriage return?
- don't write loops using wacky syntax
for(i=0; ++i <= n;)
obfuscates code (e.g. need think ifi
0 or 1 first time through loop since it's not stated). write them intended writtenfor (init;condition;increment)
:for(i=1;i <= n;i++)
. - don't redirect input awk, e.g.
awk < file 'script'
, let awk open fileawk 'script' file
have accessfilename
in scripts. - don't add spurious semi-colons throughout script - not
c
. - don't print hard-coded field separator multiple times, e.g.
print $1"|"$2"|"$3"|"a[i]"|"$5
, use ofs designed instead:ofs="|";...;print $1,$2,$3,a[i],$5
. - don't use strings in regexp context unless have excellent reason obfuscate, complicate , reduce efficiency of code, e.g. instead of
split($4,a,";")
should usesplit($4,a,/;/)
. - use white space/indentation, surprisingly cheap.
so step 1 rewrite script:
awk < myfile.txt -f\| '{ n=split($4,a,";"); print $1 for(i=0; ++i <= n;) print $1"|"$2"|"$3"|"a[i]"|"$5"|"; }'
as:
awk ' begin { fs=ofs="|" } { n=split($4,a,/;/) print $1 for(i=1; i<=n; i++) print $1, $2, $3, a[i], $5, "" } ' myfile.txt
from that, fixing for
loop syntax can see printing first field twice, first time on line of it's own, can change to:
$ awk ' begin { fs=ofs="|" } { n=split($4,a,/;/) for(i=1; i<=n; i++) print $1, $2, $3, a[i], $5, "" } ' myfile.txt field1|field2|field3|f41|f5| field1|field2|field3|f42|f5| field1|field2|field3|f41|f5| field1|field2|field3|f41|f5| field1|field2|field3|f42|f5| field1|field2|field3|f43|f5|
so - wanted? unfortunately used same values same field positions on input lines can't tell output lines/fields coming input lines/fields , didn't post full expected output can't tell if above expected output or not. it's not clear if want print empty field @ end of every output line or not or whether or not want hard-code number of output fields.
oh, , if characters disappearing in output it's because have control-ms or other spurious control characters in input file. use cat -v
see them , dos2unix
or similar remove them if control-ms.
Comments
Post a Comment