python - Break pandas DataFrame column into multiple pieces and combine with other DataFrame -

May 15, 2012

i have table of phrases , have table of individual words make these phrases. want break phrases individual words, gather , reduce information these individual words , add new column in phrase data. there smart way using pandas dataframes?

    df_multigram = pd.dataframe([         ["happy birthday", 23],         ["used below", 10],         ["frame for", 2]     ], columns=["multigram", "frequency"])     df_onegram = pd.dataframe([         ["happy", 35],         ["birthday", 25],         ["used", 14],         ["below", 11],         ["frame", 2],         ["for", 13]     ], columns=["onegram", "frequency"])      ###### do here????? #######      sum_freq_onegrams = list(df_multigram["sum_freq_onegrams"])     self.assertequal(sum_freq_onegrams, [60, 25, 15])

just clarify, desire sum_freq_onegrams equal [60, 25, 15], 60 frequency of "happy" plus frequency of "birthday".

you use

freq = df_onegram.set_index(['onegram'])['frequency'] sum_freq_onegrams = df_multigram['multigram'].str.split().apply(     lambda x: pd.series(x).map(freq).sum())

which yields

in [43]: sum_freq_onegrams out[45]:  0    60 1    25 2    15 name: multigram, dtype: int64

but note calling (lambda) function once every row , building new (tiny) series each time may rather slow. using different data structure -- plain python lists , dicts -- may faster. example, if defined list phrases , dict freq_dict,

phrases = df_multigram['multigram'].tolist() freq_dict = freq.to_dict()

then list comprehension (below) 280x faster pandas-based method:

in [65]: [sum(freq_dict.get(item, 0) item in phrase.split()) phrase in phrases] out[65]: [60, 25, 15]  in [38]: %timeit [sum(freq_dict.get(item, 0)for item in phrase.split()) phrase in phrases] 100000 loops, best of 3: 3.6 µs per loop  in [41]: %timeit df_multigram['multigram'].str.split().apply(lambda x: pd.series(x).map(freq).sum()) 1000 loops, best of 3: 1.01 ms per loop

thus, using pandas dataframe here hold phrases might not right data structure problem.

Search This Blog

Lix

python - Break pandas DataFrame column into multiple pieces and combine with other DataFrame -

Comments

Post a Comment

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

javascript - IE11 incompatibility with jQuery's 'readonly'? -

php - How can I echo out this array? -