sorting - bug in bash sort with different columns? -
i working file contains 3 values, id (they happen protein ids in case curious), value, , value. tab delimited, looks this:
a2m 0.979569315988908 1 aacs 0.925340159491081 1 aagab 0.982296215686199 1 aak1 0.736903840140103 1 aamp 0.00589711816127862 0.138868449447202 aars2 1 1 aars 3.13300124295614e-05 0.00212792325492566 aarsd1 0.527417792161261 1 aasdh 0.869909252023668 1 aasdhppt 0.763918221284724 1 aatf 0.691907759125663 1 abat 0.989693691462661 1 abca1 0.601194017450064 1 abca5 1 1 abca6 1 1
i interested in sorting these ids in alphabetical order , extracting various values. however, noticed sort sorts ids differently, depending on extracting. when execute:
cut --fields\=1,2 input.txt|sort --key=1
the resulting file is:
a2m 0.979569315988908 aacs 0.925340159491081 aagab 0.982296215686199 aak1 0.736903840140103 aamp 0.00589711816127862 aars2 1 aars 3.13300124295614e-05 aarsd1 0.527417792161261 aasdh 0.869909252023668 aasdhppt 0.763918221284724 aatf 0.691907759125663 abat 0.989693691462661 abca1 0.601194017450064 abca5 1 abca6 1
but when execute:
cut --fields\=1,3 input.txt|sort --key=1
i get
a2m 1 aacs 1 aagab 1 aak1 1 aamp 0.138868449447202 aars 0.00212792325492566 aars2 1 aarsd1 1 aasdh 1 aasdhppt 1 aatf 1 abat 1 abca1 1 abca5 1 abca6 1
notice positions of aars , aars2 switched, shouldn't since sorting based on first column. i've never seen behavior sort, , i've been using bash while now. bug, or doing wrong?
the --key=1
option tells sort
use "fields" first through end of line sort input. @rici observed first, default locale-sensitive sort, , in many locales whitespace ignored collation purposes. that's seems happening here.
if want sort only on protein ids, this:
cut --fields=1,2 input.txt | sort --key=1,1 cut --fields=1,3 input.txt | sort --key=1,1
@rici explains how approach problem specifying collation order accounts whitespace.
Comments
Post a Comment