c# - Excluding words from dictionary -
i reading through documents, , splitting words each word in dictionary, how exclude words (like "the/a/an").
this function:
private void splitter(string[] file) { try { tempdict = file .selectmany(i => file.readalllines(i) .selectmany(line => line.split(new[] { ' ', ',', '.', '?', '!', }, stringsplitoptions.removeemptyentries)) .asparallel() .distinct()) .groupby(word => word) .todictionary(g => g.key, g => g.count()); } catch (exception ex) { ex(ex); } }
also, in scenario, right place add .tolower()
call make words file in lowercase? thinking before (temp = file
..):
file.tolist().convertall(d => d.tolower());
do want filter out stop words?
hashset<string> stopwords = new hashset<string> { "a", "an", "the" }; ... tempdict = file .selectmany(i => file.readalllines(i) .selectmany(line => line.split(new[] { ' ', ',', '.', '?', '!', }, stringsplitoptions.removeemptyentries)) .asparallel() .select(word => word.tolower()) // <- lower case .where(word => !stopwords.contains(word)) // <- no stop words .distinct() .groupby(word => word) .todictionary(g => g.key, g => g.count());
however, code partial solution: proper names berlin converted lower case: berlin acronyms: kiss (keep simple, stupid) become kiss , numbers incorrect.
Comments
Post a Comment