python - parse sas .out into pandas -

February 15, 2014

i receive lot of data this:

                                even_                   long_var_     longer_var   obs var1        name          name            ...   =================================================     1 xxx         23            lolz            ...     2 yyy         34            foo             ...     3 zzz         96            bar             ...

in form of .out files sas.

if these simple tab-delimited files, there'd no problem, sas sort of magical pretty-printing linebreaks variable names keep columns lined , repeats headers every 60 or 70 lines or so. because variables have different lengths (as in example), results in 2 lines of variable names, in three, , can imagine breaking four.

let's moment can't convince programmers deliver data way dump nice clean csv or something.

the challenges 3:

split variable names
line breaking
removal of repeated headers

i think can handle 3 big stupid multi-line regex, don't have clue how i'd handle 1 , 2 (not pandas.read_fwf, @ least not without enough pre-processing i'd have solved 1 , 2 point anyway).

is there library somewhere in pydata universe this? if not, suggestions?

txr:

approach: use fixed-width variables extract nasty variable name columns. lisp code reconstitute them. when marching on data, recognize repetitions of header optional presence of blank line using @(maybe). if present, @(skip) blank line , stuff follows right next =====.....

@(collect) @{col1-hdr 3}@{col2-var 12}@{col3-var 14}@{col4-var} @(last) @/=====.*/ @(end) @(set (col1-hdr col2-var col3-var col4-var)       @[mapcar (opip (mapcar trim-str) cat-str)                (list col1-hdr col2-var col3-var col4-var)]) @(collect)  @col1 @col2 @col3 @col4 @  (maybe)  @  (skip) @/=====.*/ @  (end) @(end) @(output) @{col1-hdr 20} @{col2-var 20} @{col3-var 20} @{col4-var} @  (repeat) @{col1     20} @{col2     20} @{col3     20} @col4 @  (end) @(end)

run:

$ cat data                               even_                 long_var_     longer_var obs var1        name          name ========================================   1 xxx         23            lolz   2 yyy         34            foo   3 zzz         96            bar                                even_                 long_var_     longer_var obs var1        name          name ========================================   4 aaa         12            quux   5 bbb         45            xyzzy   6 ccc         78            bork  $ txr data.txr data obs                  var1                 long_var_name        even_longer_varname 1                    xxx                  23                   lolz 2                    yyy                  34                   foo 3                    zzz                  96                   bar 4                    aaa                  12                   quux 5                    bbb                  45                   xyzzy 6                    ccc                  78                   bork

this program cheats! depends on "longest varname" being rightmost. if data contains shorter lines, this, $ indicates end of line?

longest     $ var         shorter$ name        name        shortest$

one thing can pad input stream spaces given column: ensure lines padded 256 characters. txr pattern language processes lazy list of strings implicitly produced data source. can redirect instead march on explicitly created lazy list massage. can done, instance, adding line top:

@(next :list @(mapcar* (op format nil "~<256a") (get-lines)))

we modify column collecting match not include trailing spaces in col4 variable.

 @col1 @col2 @col3 @col4@/ +/

lastly, since used (get-lines) reads standard input (when not given argument), use:

$ txr data.txr < data

Search This Blog

Lix

python - parse sas .out into pandas -

Comments

Post a Comment

Popular posts from this blog

smartface.io - Proper way to change color scheme for whole application -

Email notification in google apps script -

javascript - three.js lot of meshes optimization -