lucene - Apache nutch not indexing all documents to apache solr -


i using apache nutch 2.3 (latest version). have crawled 49000 documnts nutch. documents mime analysis, crawled data containes 45000 thouseand text/html documents. when saw indexed documents in solr (4.10.3), 14000 documents indexed. why huge difference between documents (45000-14000=31000). if assume nutch index text/html documents, atleast 45000 documents should indexed.

what problem. how solve it?

in case problem due missing solr indexer infomration in nutch-site.xml. when update config, problem resolved. please check crawler log @ indexing step. in case informed no solr indexer plugin found.

following lines (property) added in nutch-site.xml

<property>   <name>plugin.includes</name>  <value>protocol-httpclient|protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>  <description>plugin details here </description> </property> 

Comments

Popular posts from this blog

c++ - Difference between pre and post decrement in recursive function argument -

php - Nothing but 'run(); ' when browsing to my local project, how do I fix this? -

php - How can I echo out this array? -