linux - Replace text using awk and sed at every 2-3-4 lines for pdb file -


i have pdb text file 200 000 rows. every rows looks :

compnd source     hetatm    1  ct  100     1     -23.207  17.632  14.543 hetatm    2  ct   99     1     -22.069  18.353  15.280 hetatm    3  oh  101     1     -21.074  18.762  14.358 hetatm    4  f   103     1     -23.816  18.483  13.675 hetatm    5  f   103     1     -24.119  17.162  15.433 hetatm    6  f   103     1     -22.680  16.591  13.841 hetatm    7  hc  104     1     -21.623  17.681  16.014 hetatm    8  hc  104     1     -22.451  19.218  15.823 hetatm    9  ho  102     1     -21.040  18.108  13.673 hetatm   10  ct  100     2      -4.340 -29.478  45.144 hetatm   11  ct   99     2      -3.051 -29.846  44.395 hetatm   12  oh  101     2      -1.968 -29.072  44.880 hetatm   13  f   103     2      -4.217 -29.778  46.464 hetatm   14  f   103     2      -5.396 -30.156  44.621 hetatm   15  f   103     2      -4.551 -28.140  45.015 hetatm   16  hc  104     2      -3.178 -29.656  43.329 hetatm   17  hc  104     2      -2.829 -30.908  44.511 hetatm   18  ho  102     2      -2.315 -28.222  45.119 hetatm   19  ct  100     3     -49.455 -17.542 -31.718 hetatm   20  ct   99     3     -49.981 -18.984 -31.736 hetatm   21  oh  101     3     -48.905 -19.897 -31.607 hetatm   22  f   103     3     -48.867 -17.273 -30.521 hetatm   23  f   103     3     -50.474 -16.668 -31.929 hetatm   24  f   103     3     -48.527 -17.405 -32.704 ... 

i have change first ct c1 , second ct c2, , same f1, f2, f3 , hc h1, h2.

is possible change them awk , sed in small script? each c1-c2 , f1,f2,f3 part of same molecule (trifluoroethanol - tfe) there many molecules of tfe defined.

so want :

compnd source     hetatm    1  c1  100     1     -23.207  17.632  14.543 hetatm    2  c2   99     1     -22.069  18.353  15.280 hetatm    3  oh  101     1     -21.074  18.762  14.358 hetatm    4  f1  103     1     -23.816  18.483  13.675 hetatm    5  f2  103     1     -24.119  17.162  15.433 hetatm    6  f3  103     1     -22.680  16.591  13.841 hetatm    7  h1  104     1     -21.623  17.681  16.014 hetatm    8  h2  104     1     -22.451  19.218  15.823 hetatm    9  ho  102     1     -21.040  18.108  13.673 hetatm   10  c1  100     2      -4.340 -29.478  45.144 hetatm   11  c2   99     2      -3.051 -29.846  44.395 hetatm   12  oh  101     2      -1.968 -29.072  44.880 hetatm   13  f1  103     2      -4.217 -29.778  46.464 hetatm   14  f2  103     2      -5.396 -30.156  44.621 hetatm   15  f3  103     2      -4.551 -28.140  45.015 hetatm   16  h1  104     2      -3.178 -29.656  43.329 hetatm   17  h2  104     2      -2.829 -30.908  44.511 hetatm   18  ho  102     2      -2.315 -28.222  45.119 hetatm   19  c1  100     3     -49.455 -17.542 -31.718 hetatm   20  c2   99     3     -49.981 -18.984 -31.736 hetatm   21  oh  101     3     -48.905 -19.897 -31.607 hetatm   22  f1  103     3     -48.867 -17.273 -30.521 hetatm   23  f2  103     3     -50.474 -16.668 -31.929 hetatm   24  f3  103     3     -48.527 -17.405 -32.704 ... 

thanks

you can use awk more sed, though have little doubt done in sed if wanted to.

you need to:

  • print lines number of fields 1 (or 2 — less 3).
  • keep track of last value in column 3 when there @ least 3 columns.
  • if current column 1 of ct, f or hc:
    • if last value in column 3 different, replace input column 3 first letter plus 1; record 1 output.
    • otherwise, increment count , output first letter plus counter.
  • otherwise output line unchanged.

which being translated awk script in file, awk.script, be:

nf < 3 { print; next } $3 != "ct" && $3 != "f" && $3 != "hc" { print; next } { if (old_col3 != $3) { counter = 0 }   old_col3 = $3   $3 = substr($3, 1, 1) ++counter   print } 

and, when run on data file (named, unoriginally, data), get:

$ awk -f awk.script data compnd source     hetatm 1 c1 100 1 -23.207 17.632 14.543 hetatm 2 c2 99 1 -22.069 18.353 15.280 hetatm    3  oh  101     1     -21.074  18.762  14.358 hetatm 4 f1 103 1 -23.816 18.483 13.675 hetatm 5 f2 103 1 -24.119 17.162 15.433 hetatm 6 f3 103 1 -22.680 16.591 13.841 hetatm 7 h1 104 1 -21.623 17.681 16.014 hetatm 8 h2 104 1 -22.451 19.218 15.823 hetatm    9  ho  102     1     -21.040  18.108  13.673 hetatm 10 c1 100 2 -4.340 -29.478 45.144 hetatm 11 c2 99 2 -3.051 -29.846 44.395 hetatm   12  oh  101     2      -1.968 -29.072  44.880 hetatm 13 f1 103 2 -4.217 -29.778 46.464 hetatm 14 f2 103 2 -5.396 -30.156 44.621 hetatm 15 f3 103 2 -4.551 -28.140 45.015 hetatm 16 h1 104 2 -3.178 -29.656 43.329 hetatm 17 h2 104 2 -2.829 -30.908 44.511 hetatm   18  ho  102     2      -2.315 -28.222  45.119 hetatm 19 c1 100 3 -49.455 -17.542 -31.718 hetatm 20 c2 99 3 -49.981 -18.984 -31.736 hetatm   21  oh  101     3     -48.905 -19.897 -31.607 hetatm 22 f1 103 3 -48.867 -17.273 -30.521 hetatm 23 f2 103 3 -50.474 -16.668 -31.929 hetatm 24 f3 103 3 -48.527 -17.405 -32.704 $ 

this doesn't preserve spacing in modified lines, otherwise need. if need preserve spacing, have write printf() statement format fields correctly (in place of print in last block of code:

printf("%s %4s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7, $8); 

this preserve spacing, makes code less robust in general. exploits property strings shorter n in %ns right-justified. yields:

compnd source     hetatm    1  c1  100     1     -23.207  17.632  14.543 hetatm    2  c2   99     1     -22.069  18.353  15.280 hetatm    3  oh  101     1     -21.074  18.762  14.358 hetatm    4  f1  103     1     -23.816  18.483  13.675 hetatm    5  f2  103     1     -24.119  17.162  15.433 hetatm    6  f3  103     1     -22.680  16.591  13.841 hetatm    7  h1  104     1     -21.623  17.681  16.014 hetatm    8  h2  104     1     -22.451  19.218  15.823 hetatm    9  ho  102     1     -21.040  18.108  13.673 hetatm   10  c1  100     2      -4.340 -29.478  45.144 hetatm   11  c2   99     2      -3.051 -29.846  44.395 hetatm   12  oh  101     2      -1.968 -29.072  44.880 hetatm   13  f1  103     2      -4.217 -29.778  46.464 hetatm   14  f2  103     2      -5.396 -30.156  44.621 hetatm   15  f3  103     2      -4.551 -28.140  45.015 hetatm   16  h1  104     2      -3.178 -29.656  43.329 hetatm   17  h2  104     2      -2.829 -30.908  44.511 hetatm   18  ho  102     2      -2.315 -28.222  45.119 hetatm   19  c1  100     3     -49.455 -17.542 -31.718 hetatm   20  c2   99     3     -49.981 -18.984 -31.736 hetatm   21  oh  101     3     -48.905 -19.897 -31.607 hetatm   22  f1  103     3     -48.867 -17.273 -30.521 hetatm   23  f2  103     3     -50.474 -16.668 -31.929 hetatm   24  f3  103     3     -48.527 -17.405 -32.704 

since appears when got 10,000 records, hetatm column , number column following merged single column:

hetatm   21  oh  101     3     -48.905 -19.897 -31.607 … hetatm 9999  ho  102  1111     -24.504 -16.257 -35.613 hetatm10000  ct  100  1112       9.045  23.978  29.038 hetatm10001  ct   99  1112      10.488  24.501  29.083 hetatm10002  oh  101  1112      11.370  23.545  28.522 hetatm10003  f   103  1112       8.650  23.804  27.749 hetatm10004  f   103  1112       8.209  24.855  29.654 hetatm10005  f   103  1112       8.996  22.779  29.679 

it isn't clear happens when numbers reach 100,000 , above. however, possible deal (for part) counting columns , working appropriately.

nf < 7 { print; next } nf == 8 && $3 != "ct" && $3 != "f" && $3 != "hc" { print; next } nf == 7 && $2 != "ct" && $2 != "f" && $2 != "hc" { print; next } nf == 8 {           if (old_mark != $3) { counter = 0 }           old_mark = $3           $3 = substr($3, 1, 1) ++counter           printf("%s %4s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7, $8);         } nf == 7 {           if (old_mark != $2) { counter = 0 }           old_mark = $2           $2 = substr($2, 1, 1) ++counter           printf("%s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7);         } 

note use of 'column number neutral' name old_mark. if row 9,999 contains ct , row 10,000 contains ct, mapping needs continuous (c1, c2) etc. use:

nf < 7 { print; next } nf == 8 && $3 != "ct" && $3 != "f" && $3 != "hc" { print; next } nf == 7 && $2 != "ct" && $2 != "f" && $2 != "hc" { print; next } {     colnum = nf - 5     if (old_mark != $colnum) { counter = 0 }     old_mark = $colnum     $colnum = substr($colnum, 1, 1) ++counter     if (nf == 7)         printf("%s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7);     else         printf("%s %4s %3s %4s %5s %11s %7s %7s\n", $1, $2, $3, $4, $5, $6, $7, $8); } 

there might way use 1 printf() call, doubt if worth effort.


Comments

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - Nothing but 'run(); ' when browsing to my local project, how do I fix this? -

php - How can I echo out this array? -