python - parse sas .out into pandas -
i receive lot of data this:
even_ long_var_ longer_var obs var1 name name ... ================================================= 1 xxx 23 lolz ... 2 yyy 34 foo ... 3 zzz 96 bar ... in form of .out files sas.
if these simple tab-delimited files, there'd no problem, sas sort of magical pretty-printing linebreaks variable names keep columns lined , repeats headers every 60 or 70 lines or so. because variables have different lengths (as in example), results in 2 lines of variable names, in three, , can imagine breaking four.
let's moment can't convince programmers deliver data way dump nice clean csv or something.
the challenges 3:
- split variable names
- line breaking
- removal of repeated headers
i think can handle 3 big stupid multi-line regex, don't have clue how i'd handle 1 , 2 (not pandas.read_fwf, @ least not without enough pre-processing i'd have solved 1 , 2 point anyway).
is there library somewhere in pydata universe this? if not, suggestions?
approach: use fixed-width variables extract nasty variable name columns. lisp code reconstitute them. when marching on data, recognize repetitions of header optional presence of blank line using @(maybe). if present, @(skip) blank line , stuff follows right next =====.....
@(collect) @{col1-hdr 3}@{col2-var 12}@{col3-var 14}@{col4-var} @(last) @/=====.*/ @(end) @(set (col1-hdr col2-var col3-var col4-var) @[mapcar (opip (mapcar trim-str) cat-str) (list col1-hdr col2-var col3-var col4-var)]) @(collect) @col1 @col2 @col3 @col4 @ (maybe) @ (skip) @/=====.*/ @ (end) @(end) @(output) @{col1-hdr 20} @{col2-var 20} @{col3-var 20} @{col4-var} @ (repeat) @{col1 20} @{col2 20} @{col3 20} @col4 @ (end) @(end) run:
$ cat data even_ long_var_ longer_var obs var1 name name ======================================== 1 xxx 23 lolz 2 yyy 34 foo 3 zzz 96 bar even_ long_var_ longer_var obs var1 name name ======================================== 4 aaa 12 quux 5 bbb 45 xyzzy 6 ccc 78 bork $ txr data.txr data obs var1 long_var_name even_longer_varname 1 xxx 23 lolz 2 yyy 34 foo 3 zzz 96 bar 4 aaa 12 quux 5 bbb 45 xyzzy 6 ccc 78 bork this program cheats! depends on "longest varname" being rightmost. if data contains shorter lines, this, $ indicates end of line?
longest $ var shorter$ name name shortest$ one thing can pad input stream spaces given column: ensure lines padded 256 characters. txr pattern language processes lazy list of strings implicitly produced data source. can redirect instead march on explicitly created lazy list massage. can done, instance, adding line top:
@(next :list @(mapcar* (op format nil "~<256a") (get-lines))) we modify column collecting match not include trailing spaces in col4 variable.
@col1 @col2 @col3 @col4@/ +/ lastly, since used (get-lines) reads standard input (when not given argument), use:
$ txr data.txr < data
Comments
Post a Comment