python - pickling pandas dataframe does multiply by 5 the file size -
i reading 800 mo csv file panda.csv_reader, , use original pythoin pickle.dump(datfarame) save it. result 4 gb pkl file, csv size multipled 5.
i expected pickle compress data rather extend it. because can gzip on csv file compress 200 mo, dividing 4.
i willing accelerate loading time of program, , thought pickling help, considering disk access main bottleneck understanding rather have compress files , use compression option pandas.csv_read speed loading time.
is correct ?
is normal pickling pandas dataframe extend data size ?
how speed loading time ?
thks.
edit
considering answers here below question emerge : data-size limit load pandas ?
it in best interest stash csv file in database of sort , perform operations on rather loading csv file ram, kathirmani suggested. see speedup in loading time expect due fact not filling 800 mb worth of ram every time load script.
file compression , loading time 2 conflicting elements of seem trying accomplish. compressing csv file , loading take more time; you've added step of having decompress file, doesn't solve problem.
consider precursory step ship data sqlite3
database, described here: importing csv file sqlite3 database table using python.
you have pleasure of being able query subset of data , load pandas.dataframe
further use, follows:
from pandas.io import sql import sqlite3 conn = sqlite3.connect('your/database/path') query = "select * foo bar = 'foobar';" results_df = sql.read_frame(query, con=conn) ...
conversely, can use pandas.dataframe.to_sql()
save these later use.
Comments
Post a Comment