python - Break pandas DataFrame column into multiple pieces and combine with other DataFrame -
i have table of phrases , have table of individual words make these phrases. want break phrases individual words, gather , reduce information these individual words , add new column in phrase data. there smart way using pandas dataframes?
df_multigram = pd.dataframe([ ["happy birthday", 23], ["used below", 10], ["frame for", 2] ], columns=["multigram", "frequency"]) df_onegram = pd.dataframe([ ["happy", 35], ["birthday", 25], ["used", 14], ["below", 11], ["frame", 2], ["for", 13] ], columns=["onegram", "frequency"]) ###### do here????? ####### sum_freq_onegrams = list(df_multigram["sum_freq_onegrams"]) self.assertequal(sum_freq_onegrams, [60, 25, 15])
just clarify, desire sum_freq_onegrams equal [60, 25, 15], 60 frequency of "happy" plus frequency of "birthday".
you use
freq = df_onegram.set_index(['onegram'])['frequency'] sum_freq_onegrams = df_multigram['multigram'].str.split().apply( lambda x: pd.series(x).map(freq).sum())
which yields
in [43]: sum_freq_onegrams out[45]: 0 60 1 25 2 15 name: multigram, dtype: int64
but note calling (lambda) function once every row , building new (tiny) series each time may rather slow. using different data structure -- plain python lists , dicts -- may faster. example, if defined list phrases
, dict freq_dict
,
phrases = df_multigram['multigram'].tolist() freq_dict = freq.to_dict()
then list comprehension (below) 280x faster pandas-based method:
in [65]: [sum(freq_dict.get(item, 0) item in phrase.split()) phrase in phrases] out[65]: [60, 25, 15] in [38]: %timeit [sum(freq_dict.get(item, 0)for item in phrase.split()) phrase in phrases] 100000 loops, best of 3: 3.6 µs per loop in [41]: %timeit df_multigram['multigram'].str.split().apply(lambda x: pd.series(x).map(freq).sum()) 1000 loops, best of 3: 1.01 ms per loop
thus, using pandas dataframe here hold phrases might not right data structure problem.
Comments
Post a Comment