python - Finding and Counting String Values in Pandas DataFrame -
i have pandas data frame , in string values want count. strings want count "synonymous_coding" , "non_synonymous_coding". i've found these strings located in columns 23, 24, 25, 29 , 31.
columns 23 looks this:
15392 oanc=c 15393 114 15394 eff=non_synonymous_coding(moderate|missense|gc... 15395 0/0:30:90.29:0 15396 psc=0.441 15397 psc=0.030 15398 bsc=884 ... column 24 looks this:
3092 exon(modifier||||870|rsph10b|protein_coding|co... 3093 non_synonymous_coding(moderate|missense|acg/at... 3094 intergenic(modifier||||||||||1) 3095 intergenic(modifier||||||||||1) 3096 downstream(modifier||489|||pms2||coding|nr_003... 3097 downstream(modifier||408|||pms2||coding|nr_003... 3098 dp=12 ... column 25 looks like:
13062 c 13063 c 13064 eff=synonymous(modifier|||||dkfzp434l192||coding... 13065 eff=synonymous(modifier|||||dkfzp434l192||coding... 13066 canc=g 13067 c 13068 g column 29 looks like:
15688 0:0 15689 0:0 15690 nan 15691 eff=synonymous_coding(low|silent|tcc/tcg|s782|... 15692 0:0 15693 nan 15694 0:1 and column 31 looks like:
3081 45 3082 1432:0 3083 0:0 3084 synonymous_coding(low|silent|acg/aca|t473|482|... 3085 9 3086 0:0 3087 0:0 i wanted know how can go through 5 columns , count number of times strings "synonymous_coding" or "non_synonymous_coding" appears without double counting. because there might rows these strings appear in 2 or more different columns.
thank you.
rodrigo
here's worked through, include code used create dataframe. can see algorithm focusing on main() method
def create_df(): grid = ( {'a': ["exon(modifier||||870|rsph10b|protein_coding|co)", "non_synonymous_coding(moderate|missense|acg/at)", "intergenic(modifier||||||||||1)", "downstream(modifier||489|||pms2||coding|nr_003)", "downstream(modifier||408|||pms2||coding|nr_003)"], 'b': ["foo", "eff=non_synonymous_coding(moderate|missense|gc", "non_synonymous_coding(moderate|missense|acg/at)", "psc=0.441", "bsc=884"], 'c': ["bar", "bar", "eff=synonymous(modifier|||||dkfzp434l192||coding", "eff=synonymous(modifier|||||dkfzp434l192||coding", "eff=synonymous_coding(low|silent|tcc/tcg|s782|"], 'd': ["eff=synonymous_coding(low|silent|tcc/tcg|s782|", "0:0", "0:0", "eff=synonymous_coding(low|silent|tcc/tcg|s782|", "eff=synonymous_coding(low|silent|tcc/tcg|s782|"], } ) return pd.dataframe(grid) def get_masks(df): non_syn = pd.dataframe(index=df.index, columns=df.columns) synonymous = pd.dataframe(index=df.index, columns=df.columns) in df: non_syn[i] = df[i].str.contains("non_synonymous_coding") synonymous[i] = df[i][~non_syn[i]].str.contains("synonymous_coding") return non_syn, synonymous.dropna() def count_unique_truths(df): # make unique across rows, , restore regular df = df.transpose().drop_duplicates().transpose() return np.sum(df).sum() def main(): df = create_df() non_syn, synonymous = get_masks(df) non_syn_count = count_unique_truths(non_syn) synonymous_count = count_unique_truths(synonymous) print(df) print("synonymous count = {:d}\nnon_synonymous count = {:d}".format(int(synonymous_count), int(non_syn_count))) df.groupby() if __name__ == '__main__': main()
Comments
Post a Comment