python - Fast, efficient way to remove rows from large Pandas DataFrame -
i'm looking remove rows large pandas dataframe contains analytics data based on actions/events users have done on website. streams of user actions begin start
event, , finish end
event. want find users have done particular event (such signed up
- index 13 in example dataframe) , remove events after event until (and including) end
event. in example, viewed blog post
, page view
, visited site
, ad campaign hit
, viewed blog post
, visited site
, page view
, , end
have removed dataframe.
in [26]: data out[26]: event user 0 start user1 1 visited blog user1 2 page view user1 3 visited blog user1 4 viewed blog post user1 5 ad campaign hit user1 6 page view user1 7 visited site user1 8 visited blog user1 9 viewed blog post user1 10 visited site user1 11 page view user1 12 signed user1 13 viewed blog post user1 14 page view user1 15 visited site user1 16 ad campaign hit user1 17 viewed blog post user1 18 visited site user1 19 page view user1 20 end user1
i've tried in number of ways - using np.where()
identify correct rows or
removal_starts_at = data[(data.user == 'user1') & (data.event == 'signed up')] removal_ends_at = data[(data.user == 'user1') & (data.event == 'end')] data[data.user == 'user1'].drop(data.index[removal_start_at+1:removal_ends_at+1], inplace=true)
however, slow! it's takes ~20 seconds per user. have 1000s of users, not efficient. i'd in quicker fashion if possible.
another issue discovered writing question: if don't include [data.user == 'user1']
subset dataframe, goes crazy , takes memory on computer. if include it, doesn't subsetting , gives me warning settingwithcopy
.
i'm relatively new pandas, it's quite there's simpler way , i'm doing entirely incorrectly. ideas i've thought using multiindex
find combination of user & event directly or perhaps subsetting in more efficient way?
if i'm understanding correctly, idea have lots of users in 1 dataframe. i've expanded have 2 users. if that's right, ought pretty fast:
df['keep'] = np.where( df['event'] == 'start', 1, np.nan ) df['keep'] = np.where( df['event'].shift() == 'signed up', 0, df['keep'] ) df['keep'] = df['keep'].ffill() event user keep 0 start user1 1 1 visited blog user1 1 2 page view user1 1 3 signed user1 1 4 viewed blog post user1 0 5 page view user1 0 6 end user1 0 7 start user2 1 8 visited blog user2 1 9 signed user2 1 10 viewed blog post user2 0 11 end user2 0 df[df['keep']==1] event user keep 0 start user1 1 1 visited blog user1 1 2 page view user1 1 3 signed user1 1 7 start user2 1 8 visited blog user2 1 9 signed user2 1
Comments
Post a Comment