r - removing and aggregating duplicates -
i've posted sample of data i'm working here.
"parcel.." main indexing variable , there amount of duplicates. duplicates not consistent in of other columns. goal aggregate data set there 1 observation of each parcel. i've used following code attempt summing numerical vectors:
aggregate(ap.sample$x.11~ap.sample$parcel..,fun=sum)
the problem removes except parcel , other vector reference.
my goal use same rule numerical vectors (sum) (x.11,x.13,x.15, num_units) of observations of parcelid, different rule (average) other numerical vectors (acres,ttl_sq_ft,mtr.size), , still different rule (just pick 1 name) character variables (pretend there's column "customer.name" different values same unique parcel id, i.e. "steven condominiums" , "stephen apartments"), , delete observations other variables.
i've tried use numcolwise
function doesn't need. instinct specify columns want sum , columns want take average so:
dt<-as.data.table(ap.sample) sum_cols<-ap.05[,c(10,12,14)] mean_cols<-ap.05[,c(17:19)]
and use lapply
function go through each observation , need.
df05<-dt[,lapply(.sd,sum), by=dt$parcel..,.sdcols=sum_cols] df05<-dt[,lapply(.sd,mean),by=dt$parcel..,.sdcols=mean_cols]
but spits out errors on first go. know there's simpler work around trying muscle through it.
you do:
library(dplyr) df %>% # create hypothetical "customer.name" column mutate(customer.name = sample(letters[1:10], size = n(), replace = true)) %>% # group data "parcel.." group_by(parcel..) %>% # apply sum() selected columns mutate_each(funs(sum(.)), one_of("x.11", "x.13", "x.15", "num_units")) %>% # likewise mean() mutate_each(funs(mean(.)), one_of("acres", "ttl_sq_ft", "mtr.size")) %>% # select desired columns select(x.11, x.13, x.15, num_units, acres, ttl_sq_ft, mtr.size, customer.name) %>% # de-duplicate while keeping arbitrary value (the first 1 in row order) distinct(parcel..)
Comments
Post a Comment