optimization - Python : How to optimize calculations? -
i'm making text-mining corpus of words, , i'm having textfile output 3000 lines :
dns 11 11 [2, 355, 706, 1063, 3139, 3219, 3471, 3472, 3473, 4384, 4444]
xhtml 8 11 [1651, 2208, 2815, 3487, 3517, 4480, 4481, 4504]
javascript 18 18 [49, 50, 175, 176, 355, 706, 1063, 1502, 1651, 2208, 2280, 2815, 3297, 4068, 4236, 4480, 4481, 4504]
there word, number of lines it've appeared, number of total appearances, , n° of these lines.
i'm trying calculate chi-squared value, , textfile input code below :
measure = nltk.collocations.bigramassocmeasures() dicto = {} in lines : tokens = nltk.wordpunct_tokenize(i) m = tokens[0] #m word list_i = tokens[4:] list_i.pop() x in list_i : if x ==',': ind = list_i.index(x) list_i.pop(ind) dicto[m]=list_i #for each word create dictionnary n° of lines #for each word calculate chi-squared every other word #and problem starting right here think #the "for" loop , z = ..... word1 in dicto : x=dicto[word1] vector = [] word2 in dicto : y=dicto[word2] z=[val val in x if val in y] #contingency matrix m11 = cpt-(len(x)+len(y)-len(z)) m12 = len(x)-len(z) m21 = len(y)-len(z) m22 = len(z) n_ii =m11 n_ix =m11+m21 n_xi =m11+m12 n_xx =m11+m12+m21+m22 chi_squared = measure.chi_sq(n_ii, (n_ix, n_xi), n_xx) #i compare minimum value check independancy between words if chi_squared >3.841 : vector.append([word1, word2 , round(chi_square,3)) #the correlations calculated #i sort vector in descending way final=sorted(vector, key=lambda vector: vector[2],reverse = true) print word1 #i take 4 best scores in final[:4]: print i,
my problem calcul taking time (i'm talking hours !!) there can change ? improve code ? other python structures ? ideas ?
there few opportunities speedup, first concern vector. initialized? in code posted, gets n^2 entries , sorted n times! seems unintentional. should cleared? should final outside loop?
final=sorted(vector, key=lambda vector: vector[2],reverse = true)
is functional, has ugly scoping, better is:
final=sorted(vector, key=lambda entry: entry[2], reverse=true)
in general, solve timing issues consider using profiler.
Comments
Post a Comment