list - Python : How to optimize comparison between two large sets? -
i salute ! i'm new here, , i've got little problem trying optimize part of code.
i'm reading 2 files :
corpus.txt -----> contains text (of 1.000.000 words)
stop_words.txt -----> contains stop_list (of 4000 words)
i must compare each word corpus every word in stop_list, because want have text without stop words, i've : 1.000.000*4000 comparisons code below :
fich= open("corpus.txt", "r") text = fich.readlines() fich1= open("stop_words.txt", "r") stop = fich1.read() tokens_stop = nltk.wordpunct_tokenize(stop) tokens_stop=sorted(set(tokens_stop)) line in text : tokens_rm = nltk.wordpunct_tokenize(line) z = [val val in tokens_rm if val not in tokens_stop] in z: print
my question : there differently ? structure optimize ?
you can create set of stop_words, every word in text see if in set.
actually looks using set. though don't know why sorting it.
Comments
Post a Comment