python - Fast, efficient way to remove rows from large Pandas DataFrame -


i'm looking remove rows large pandas dataframe contains analytics data based on actions/events users have done on website. streams of user actions begin start event, , finish end event. want find users have done particular event (such signed up - index 13 in example dataframe) , remove events after event until (and including) end event. in example, viewed blog post, page view, visited site, ad campaign hit, viewed blog post, visited site, page view, , end have removed dataframe.

in [26]: data out[26]:      event            user 0   start            user1 1   visited blog     user1 2   page view        user1 3   visited blog     user1 4   viewed blog post user1 5   ad campaign hit  user1 6   page view        user1 7   visited site     user1 8   visited blog     user1 9   viewed blog post user1 10  visited site     user1 11  page view        user1 12  signed        user1 13  viewed blog post user1 14  page view        user1 15  visited site     user1 16  ad campaign hit  user1 17  viewed blog post user1 18  visited site     user1 19  page view        user1 20  end              user1 

i've tried in number of ways - using np.where() identify correct rows or

removal_starts_at = data[(data.user == 'user1') & (data.event == 'signed up')] removal_ends_at = data[(data.user == 'user1') & (data.event == 'end')] data[data.user == 'user1'].drop(data.index[removal_start_at+1:removal_ends_at+1], inplace=true) 

however, slow! it's takes ~20 seconds per user. have 1000s of users, not efficient. i'd in quicker fashion if possible.

another issue discovered writing question: if don't include [data.user == 'user1'] subset dataframe, goes crazy , takes memory on computer. if include it, doesn't subsetting , gives me warning settingwithcopy.

i'm relatively new pandas, it's quite there's simpler way , i'm doing entirely incorrectly. ideas i've thought using multiindex find combination of user & event directly or perhaps subsetting in more efficient way?

if i'm understanding correctly, idea have lots of users in 1 dataframe. i've expanded have 2 users. if that's right, ought pretty fast:

df['keep'] = np.where( df['event'] == 'start', 1, np.nan ) df['keep'] = np.where( df['event'].shift() == 'signed up', 0, df['keep'] ) df['keep'] = df['keep'].ffill()                 event   user  keep 0              start  user1     1 1       visited blog  user1     1 2          page view  user1     1 3          signed  user1     1 4   viewed blog post  user1     0 5          page view  user1     0 6                end  user1     0 7              start  user2     1 8       visited blog  user2     1 9          signed  user2     1 10  viewed blog post  user2     0 11               end  user2     0  df[df['keep']==1]            event   user  keep 0         start  user1     1 1  visited blog  user1     1 2     page view  user1     1 3     signed  user1     1 7         start  user2     1 8  visited blog  user2     1 9     signed  user2     1 

Comments

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - Nothing but 'run(); ' when browsing to my local project, how do I fix this? -

php - How can I echo out this array? -