linux - Replace text using awk and sed at every 2-3-4 lines for pdb file -
i have pdb text file 200 000 rows. every rows looks :
compnd source hetatm 1 ct 100 1 -23.207 17.632 14.543 hetatm 2 ct 99 1 -22.069 18.353 15.280 hetatm 3 oh 101 1 -21.074 18.762 14.358 hetatm 4 f 103 1 -23.816 18.483 13.675 hetatm 5 f 103 1 -24.119 17.162 15.433 hetatm 6 f 103 1 -22.680 16.591 13.841 hetatm 7 hc 104 1 -21.623 17.681 16.014 hetatm 8 hc 104 1 -22.451 19.218 15.823 hetatm 9 ho 102 1 -21.040 18.108 13.673 hetatm 10 ct 100 2 -4.340 -29.478 45.144 hetatm 11 ct 99 2 -3.051 -29.846 44.395 hetatm 12 oh 101 2 -1.968 -29.072 44.880 hetatm 13 f 103 2 -4.217 -29.778 46.464 hetatm 14 f 103 2 -5.396 -30.156 44.621 hetatm 15 f 103 2 -4.551 -28.140 45.015 hetatm 16 hc 104 2 -3.178 -29.656 43.329 hetatm 17 hc 104 2 -2.829 -30.908 44.511 hetatm 18 ho 102 2 -2.315 -28.222 45.119 hetatm 19 ct 100 3 -49.455 -17.542 -31.718 hetatm 20 ct 99 3 -49.981 -18.984 -31.736 hetatm 21 oh 101 3 -48.905 -19.897 -31.607 hetatm 22 f 103 3 -48.867 -17.273 -30.521 hetatm 23 f 103 3 -50.474 -16.668 -31.929 hetatm 24 f 103 3 -48.527 -17.405 -32.704 ...
i have change first ct c1 , second ct c2, , same f1, f2, f3 , hc h1, h2.
is possible change them awk , sed in small script? each c1-c2 , f1,f2,f3 part of same molecule (trifluoroethanol - tfe) there many molecules of tfe defined.
so want :
compnd source hetatm 1 c1 100 1 -23.207 17.632 14.543 hetatm 2 c2 99 1 -22.069 18.353 15.280 hetatm 3 oh 101 1 -21.074 18.762 14.358 hetatm 4 f1 103 1 -23.816 18.483 13.675 hetatm 5 f2 103 1 -24.119 17.162 15.433 hetatm 6 f3 103 1 -22.680 16.591 13.841 hetatm 7 h1 104 1 -21.623 17.681 16.014 hetatm 8 h2 104 1 -22.451 19.218 15.823 hetatm 9 ho 102 1 -21.040 18.108 13.673 hetatm 10 c1 100 2 -4.340 -29.478 45.144 hetatm 11 c2 99 2 -3.051 -29.846 44.395 hetatm 12 oh 101 2 -1.968 -29.072 44.880 hetatm 13 f1 103 2 -4.217 -29.778 46.464 hetatm 14 f2 103 2 -5.396 -30.156 44.621 hetatm 15 f3 103 2 -4.551 -28.140 45.015 hetatm 16 h1 104 2 -3.178 -29.656 43.329 hetatm 17 h2 104 2 -2.829 -30.908 44.511 hetatm 18 ho 102 2 -2.315 -28.222 45.119 hetatm 19 c1 100 3 -49.455 -17.542 -31.718 hetatm 20 c2 99 3 -49.981 -18.984 -31.736 hetatm 21 oh 101 3 -48.905 -19.897 -31.607 hetatm 22 f1 103 3 -48.867 -17.273 -30.521 hetatm 23 f2 103 3 -50.474 -16.668 -31.929 hetatm 24 f3 103 3 -48.527 -17.405 -32.704 ...
thanks
you can use awk
more sed
, though have little doubt done in sed
if wanted to.
you need to:
- print lines number of fields 1 (or 2 — less 3).
- keep track of last value in column 3 when there @ least 3 columns.
- if current column 1 of ct, f or hc:
- if last value in column 3 different, replace input column 3 first letter plus 1; record 1 output.
- otherwise, increment count , output first letter plus counter.
- otherwise output line unchanged.
which being translated awk
script in file, awk.script
, be:
nf < 3 { print; next } $3 != "ct" && $3 != "f" && $3 != "hc" { print; next } { if (old_col3 != $3) { counter = 0 } old_col3 = $3 $3 = substr($3, 1, 1) ++counter print }
and, when run on data file (named, unoriginally, data
), get:
$ awk -f awk.script data compnd source hetatm 1 c1 100 1 -23.207 17.632 14.543 hetatm 2 c2 99 1 -22.069 18.353 15.280 hetatm 3 oh 101 1 -21.074 18.762 14.358 hetatm 4 f1 103 1 -23.816 18.483 13.675 hetatm 5 f2 103 1 -24.119 17.162 15.433 hetatm 6 f3 103 1 -22.680 16.591 13.841 hetatm 7 h1 104 1 -21.623 17.681 16.014 hetatm 8 h2 104 1 -22.451 19.218 15.823 hetatm 9 ho 102 1 -21.040 18.108 13.673 hetatm 10 c1 100 2 -4.340 -29.478 45.144 hetatm 11 c2 99 2 -3.051 -29.846 44.395 hetatm 12 oh 101 2 -1.968 -29.072 44.880 hetatm 13 f1 103 2 -4.217 -29.778 46.464 hetatm 14 f2 103 2 -5.396 -30.156 44.621 hetatm 15 f3 103 2 -4.551 -28.140 45.015 hetatm 16 h1 104 2 -3.178 -29.656 43.329 hetatm 17 h2 104 2 -2.829 -30.908 44.511 hetatm 18 ho 102 2 -2.315 -28.222 45.119 hetatm 19 c1 100 3 -49.455 -17.542 -31.718 hetatm 20 c2 99 3 -49.981 -18.984 -31.736 hetatm 21 oh 101 3 -48.905 -19.897 -31.607 hetatm 22 f1 103 3 -48.867 -17.273 -30.521 hetatm 23 f2 103 3 -50.474 -16.668 -31.929 hetatm 24 f3 103 3 -48.527 -17.405 -32.704 $
this doesn't preserve spacing in modified lines, otherwise need. if need preserve spacing, have write printf()
statement format fields correctly (in place of print
in last block of code:
printf("%s %4s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7, $8);
this preserve spacing, makes code less robust in general. exploits property strings shorter n in %ns
right-justified. yields:
compnd source hetatm 1 c1 100 1 -23.207 17.632 14.543 hetatm 2 c2 99 1 -22.069 18.353 15.280 hetatm 3 oh 101 1 -21.074 18.762 14.358 hetatm 4 f1 103 1 -23.816 18.483 13.675 hetatm 5 f2 103 1 -24.119 17.162 15.433 hetatm 6 f3 103 1 -22.680 16.591 13.841 hetatm 7 h1 104 1 -21.623 17.681 16.014 hetatm 8 h2 104 1 -22.451 19.218 15.823 hetatm 9 ho 102 1 -21.040 18.108 13.673 hetatm 10 c1 100 2 -4.340 -29.478 45.144 hetatm 11 c2 99 2 -3.051 -29.846 44.395 hetatm 12 oh 101 2 -1.968 -29.072 44.880 hetatm 13 f1 103 2 -4.217 -29.778 46.464 hetatm 14 f2 103 2 -5.396 -30.156 44.621 hetatm 15 f3 103 2 -4.551 -28.140 45.015 hetatm 16 h1 104 2 -3.178 -29.656 43.329 hetatm 17 h2 104 2 -2.829 -30.908 44.511 hetatm 18 ho 102 2 -2.315 -28.222 45.119 hetatm 19 c1 100 3 -49.455 -17.542 -31.718 hetatm 20 c2 99 3 -49.981 -18.984 -31.736 hetatm 21 oh 101 3 -48.905 -19.897 -31.607 hetatm 22 f1 103 3 -48.867 -17.273 -30.521 hetatm 23 f2 103 3 -50.474 -16.668 -31.929 hetatm 24 f3 103 3 -48.527 -17.405 -32.704
since appears when got 10,000 records, hetatm
column , number column following merged single column:
hetatm 21 oh 101 3 -48.905 -19.897 -31.607 … hetatm 9999 ho 102 1111 -24.504 -16.257 -35.613 hetatm10000 ct 100 1112 9.045 23.978 29.038 hetatm10001 ct 99 1112 10.488 24.501 29.083 hetatm10002 oh 101 1112 11.370 23.545 28.522 hetatm10003 f 103 1112 8.650 23.804 27.749 hetatm10004 f 103 1112 8.209 24.855 29.654 hetatm10005 f 103 1112 8.996 22.779 29.679
it isn't clear happens when numbers reach 100,000 , above. however, possible deal (for part) counting columns , working appropriately.
nf < 7 { print; next } nf == 8 && $3 != "ct" && $3 != "f" && $3 != "hc" { print; next } nf == 7 && $2 != "ct" && $2 != "f" && $2 != "hc" { print; next } nf == 8 { if (old_mark != $3) { counter = 0 } old_mark = $3 $3 = substr($3, 1, 1) ++counter printf("%s %4s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7, $8); } nf == 7 { if (old_mark != $2) { counter = 0 } old_mark = $2 $2 = substr($2, 1, 1) ++counter printf("%s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7); }
note use of 'column number neutral' name old_mark
. if row 9,999 contains ct
, row 10,000 contains ct
, mapping needs continuous (c1, c2) etc. use:
nf < 7 { print; next } nf == 8 && $3 != "ct" && $3 != "f" && $3 != "hc" { print; next } nf == 7 && $2 != "ct" && $2 != "f" && $2 != "hc" { print; next } { colnum = nf - 5 if (old_mark != $colnum) { counter = 0 } old_mark = $colnum $colnum = substr($colnum, 1, 1) ++counter if (nf == 7) printf("%s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7); else printf("%s %4s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7, $8); }
there might way use 1 printf()
call, doubt if worth effort.
Comments
Post a Comment