optimization - Python : How to optimize calculations? -

i'm making text-mining corpus of words, , i'm having textfile output 3000 lines :

dns 11 11 [2, 355, 706, 1063, 3139, 3219, 3471, 3472, 3473, 4384, 4444]

xhtml 8 11 [1651, 2208, 2815, 3487, 3517, 4480, 4481, 4504]

javascript 18 18 [49, 50, 175, 176, 355, 706, 1063, 1502, 1651, 2208, 2280, 2815, 3297, 4068, 4236, 4480, 4481, 4504]

there word, number of lines it've appeared, number of total appearances, , n° of these lines.

i'm trying calculate chi-squared value, , textfile input code below :

measure = nltk.collocations.bigramassocmeasures()  dicto = {}  in lines :     tokens = nltk.wordpunct_tokenize(i)     m = tokens[0]       #m word     list_i = tokens[4:]     list_i.pop()     x in list_i :         if x ==',':             ind = list_i.index(x)             list_i.pop(ind)     dicto[m]=list_i #for each word create dictionnary n° of lines  #for each word calculate chi-squared every other word  #and problem starting right here think #the "for" loop , z = .....   word1 in dicto :     x=dicto[word1]     vector = []      word2 in dicto :             y=dicto[word2]         z=[val val in x if val in y]          #contingency matrix         m11 = cpt-(len(x)+len(y)-len(z))         m12 = len(x)-len(z)         m21 = len(y)-len(z)         m22 = len(z)          n_ii =m11         n_ix =m11+m21         n_xi =m11+m12         n_xx =m11+m12+m21+m22           chi_squared = measure.chi_sq(n_ii, (n_ix, n_xi), n_xx)          #i compare minimum value check independancy between words         if chi_squared >3.841 :             vector.append([word1, word2 , round(chi_square,3))      #the correlations calculated     #i sort vector in descending way     final=sorted(vector, key=lambda vector: vector[2],reverse = true)      print word1     #i take 4 best scores     in final[:4]:         print i, 

my problem calcul taking time (i'm talking hours !!) there can change ? improve code ? other python structures ? ideas ?

there few opportunities speedup, first concern vector. initialized? in code posted, gets n^2 entries , sorted n times! seems unintentional. should cleared? should final outside loop?

final=sorted(vector, key=lambda vector: vector[2],reverse = true)

is functional, has ugly scoping, better is:

final=sorted(vector, key=lambda entry: entry[2], reverse=true)

in general, solve timing issues consider using profiler.


Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - Nothing but 'run(); ' when browsing to my local project, how do I fix this? -

php - How can I echo out this array? -