python - pickling pandas dataframe does multiply by 5 the file size -

August 15, 2012

i reading 800 mo csv file panda.csv_reader, , use original pythoin pickle.dump(datfarame) save it. result 4 gb pkl file, csv size multipled 5.

i expected pickle compress data rather extend it. because can gzip on csv file compress 200 mo, dividing 4.

i willing accelerate loading time of program, , thought pickling help, considering disk access main bottleneck understanding rather have compress files , use compression option pandas.csv_read speed loading time.

is correct ?

is normal pickling pandas dataframe extend data size ?

how speed loading time ?

thks.

edit

considering answers here below question emerge : data-size limit load pandas ?

it in best interest stash csv file in database of sort , perform operations on rather loading csv file ram, kathirmani suggested. see speedup in loading time expect due fact not filling 800 mb worth of ram every time load script.

file compression , loading time 2 conflicting elements of seem trying accomplish. compressing csv file , loading take more time; you've added step of having decompress file, doesn't solve problem.

consider precursory step ship data sqlite3 database, described here: importing csv file sqlite3 database table using python.

you have pleasure of being able query subset of data , load pandas.dataframe further use, follows:

from pandas.io import sql import sqlite3  conn = sqlite3.connect('your/database/path') query = "select * foo bar = 'foobar';"  results_df = sql.read_frame(query, con=conn) ...

conversely, can use pandas.dataframe.to_sql() save these later use.

Search This Blog

Lix

python - pickling pandas dataframe does multiply by 5 the file size -

edit

Comments

Post a Comment

Popular posts from this blog

Email notification in google apps script -

c++ - Difference between pre and post decrement in recursive function argument -

javascript - IE11 incompatibility with jQuery's 'readonly'? -