python - Pandas groupby: apply vs agggregate with missing categories -
i'm running issue panda's groupby.apply , groupby.aggregate give different-shaped results when categorical data has missing values. aggregate retains "known" categories, apply keeps categories present in data.
here's simplified example:
import pandas pd import numpy np # `missing` has 'b' category no data uses it. missing = pd.categorical(list('aaa'), categories=['a', 'b']) dense = pd.categorical(list('abc')) values = np.arange(len(dense)) df = pd.dataframe({'missing': missing, 'dense': dense, 'values': values}) grouped = df.groupby(['missing', 'dense']) print grouped.mean() print grouped.agg(np.mean) print grouped.apply(lambda chunk: np.mean(chunk)) which prints
values missing dense 0 b 1 c 2 b nan b nan c nan values missing dense 0 b 1 c 2 b nan b nan c nan values missing dense 0 b 1 c 2 note last data frame missing nan rows missing = b. understand why apply might (it chooses not pass group full of nans reduction function). above snippet toy example: need use apply result want.
question: what's best way use apply create output shape matching 1 returned aggregate?
this in need of enhancement pull-request this, see here.
in general, should use use .mean() idiomatic way (and faster).
Comments
Post a Comment