python - Pandas read csv out of memory -


i try manipulate large csv file using pandas, when wrote this

df = pd.read_csv(strfilename,sep='\t',delimiter='\t') 

it raises "pandas.parser.cparsererror: error tokenizing data. c error: out of memory" wc -l indicate there 13822117 lines, need aggregate on csv file data frame, there way handle other split csv several files , write codes merge results? suggestions on how that?

the input this:

columns=[ka,kb_1,kb_2,timeofevent,timeinterval] 0:'3m' '2345' '2345' '2014-10-5',3000 1:'3m' '2958' '2152' '2015-3-22',5000 2:'ge' '2183' '2183' '2012-12-31',515 3:'3m' '2958' '2958' '2015-3-10',395 4:'ge' '2183' '2285' '2015-4-19',1925 5:'ge' '2598' '2598' '2015-3-17',1915 

and desired output this:

columns=[ka,kb,errornum,errorrate,totalnum of records] '3m','2345',0,0%,1 '3m','2958',1,50%,2 'ge','2183',1,50%,2 'ge','2598',0,0%,1 

if data set small, below code used provided another

df2 = df.groupby(['ka','kb_1'])['iserror'].agg({ 'errornum':  'sum',                                              'recordnum': 'count' })  df2['errorrate'] = df2['errornum'] / df2['recordnum']  ka kb_1  recordnum  errornum  errorrate  3m 2345          1         0        0.0    2958          2         1        0.5 ge 2183          2         1        0.5    2598          1         0        0.0 

(definition of error record: when kb_1!=kb_2,the corresponding record treated abnormal record)

based on snippet in out of memory error when reading csv file in chunk, when reading line-by-line.

i assume kb_2 error indicator,

groups={} open("data/petajoined.csv", "r") large_file:     line in large_file:         arr=line.split('\t')         #assuming structure: ka,kb_1,kb_2,timeofevent,timeinterval         k=arr[0]+','+arr[1]         if not (k in groups.keys())             groups[k]={'record_count':0, 'error_sum': 0}         groups[k]['record_count']=groups[k]['record_count']+1         groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2]) k,v in groups.items:     print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count'])) 

this code snippet stores groups in dictionary, , calculates error rate after reading entire file.

it encounter out-of-memory exception, if there many combinations of groups.


Comments

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - Nothing but 'run(); ' when browsing to my local project, how do I fix this? -

php - How can I echo out this array? -