lucene - Apache nutch not indexing all documents to apache solr -
i using apache nutch 2.3 (latest version). have crawled 49000 documnts nutch. documents mime analysis, crawled data containes 45000 thouseand text/html documents. when saw indexed documents in solr (4.10.3), 14000 documents indexed. why huge difference between documents (45000-14000=31000). if assume nutch index text/html documents, atleast 45000 documents should indexed.
what problem. how solve it?
in case problem due missing solr indexer infomration in nutch-site.xml. when update config, problem resolved. please check crawler log @ indexing step. in case informed no solr indexer plugin found.
following lines (property) added in nutch-site.xml
<property> <name>plugin.includes</name> <value>protocol-httpclient|protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more)|urlnormalizer-(pass|regex|basic)|scoring-opic</value> <description>plugin details here </description> </property>
Comments
Post a Comment