python - Scrapy Error - HTTP status code is not handled or not allowed -
i trying run spider have log:
2015-05-15 12:44:43+0100 [scrapy] info: scrapy 0.24.5 started (bot: reviews) 2015-05-15 12:44:43+0100 [scrapy] info: optional features available: ssl, http11 2015-05-15 12:44:43+0100 [scrapy] info: overridden settings: {'newspider_module': 'reviews.spiders', 'spider_modules': ['reviews.spiders'], 'download_delay': 2, 'bot_name': 'reviews'} 2015-05-15 12:44:43+0100 [scrapy] info: enabled extensions: logstats, telnetconsole, closespider, webservice, corestats, spiderstate 2015-05-15 12:44:43+0100 [scrapy] info: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2015-05-15 12:44:43+0100 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2015-05-15 12:44:43+0100 [scrapy] info: enabled item pipelines: 2015-05-15 12:44:43+0100 [theverge] info: spider opened 2015-05-15 12:44:43+0100 [theverge] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-05-15 12:44:43+0100 [scrapy] error: error caught on signal handler: <bound method ?.start_listening of <scrapy.telnet.telnetconsole instance @ 0x105127b48>> traceback (most recent call last): file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 1107, in _inlinecallbacks result = g.send(result) file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/scrapy/core/engine.py", line 77, in start yield self.signals.send_catch_log_deferred(signal=signals.engine_started) file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred return signal.send_catch_log_deferred(*a, **kw) file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred *arguments, **named) --- <exception caught here> --- file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 140, in maybedeferred result = f(*args, **kw) file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 54, in robustapply return receiver(*arguments, **named) file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/scrapy/telnet.py", line 47, in start_listening self.port = listen_tcp(self.portrange, self.host, self) file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/scrapy/utils/reactor.py", line 14, in listen_tcp return reactor.listentcp(x, factory, interface=host) file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 495, in listentcp p.startlistening() file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/twisted/internet/tcp.py", line 984, in startlistening raise cannotlistenerror(self.interface, self.port, le) twisted.internet.error.cannotlistenerror: couldn't listen on 127.0.0.1:6073: [errno 48] address in use.
this first error, started appear in spiders, other spiders work anyway. the: "[errno 48] address in use." comes:
2015-05-15 12:44:43+0100 [scrapy] debug: web service listening on 127.0.0.1:6198 2015-05-15 12:44:44+0100 [theverge] debug: crawled (403) <get http://www.theverge.com/reviews> (referer: none) 2015-05-15 12:44:44+0100 [theverge] debug: ignoring response <403 http://www.theverge.com/reviews>: http status code not handled or not allowed 2015-05-15 12:44:44+0100 [theverge] info: closing spider (finished) 2015-05-15 12:44:44+0100 [theverge] info: dumping scrapy stats: {'downloader/request_bytes': 191, 'downloader/request_count': 1, 'downloader/request_method_count/get': 1, 'downloader/response_bytes': 265, 'downloader/response_count': 1, 'downloader/response_status_count/403': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 5, 15, 11, 44, 44, 136026), 'log_count/debug': 3, 'log_count/error': 1, 'log_count/info': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2015, 5, 15, 11, 44, 43, 829689)} 2015-05-15 12:44:44+0100 [theverge] info: spider closed (finished) 2015-05-15 12:44:44+0100 [scrapy] error: error caught on signal handler: <bound method ?.stop_listening of <scrapy.telnet.telnetconsole instance @ 0x105127b48>> traceback (most recent call last): file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 1107, in _inlinecallbacks result = g.send(result) file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/scrapy/core/engine.py", line 300, in _finish_stopping_engine yield self.signals.send_catch_log_deferred(signal=signals.engine_stopped) file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred return signal.send_catch_log_deferred(*a, **kw) file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred *arguments, **named) --- <exception caught here> --- file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 140, in maybedeferred result = f(*args, **kw) file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 54, in robustapply return receiver(*arguments, **named) file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-packages/scrapy/telnet.py", line 53, in stop_listening self.port.stoplistening() exceptions.attributeerror: telnetconsole instance has no attribute 'port'
the error "exceptions.attributeerror: telnetconsole instance has no attribute 'port'" new me... not know happening since other spiders other websites work well.
can tell me how fix?
edit:
with reboot errors disappeared. still can not crawl spider... here logs now:
2015-05-15 15:46:55+0100 [scrapy] info: scrapy 0.24.5 started (bot: reviews)piders_toshub/reviews (spiderdev) $ scrapy crawl theverge 2015-05-15 15:46:55+0100 [scrapy] info: optional features available: ssl, http11 2015-05-15 15:46:55+0100 [scrapy] info: overridden settings: {'newspider_module': 'reviews.spiders', 'spider_modules': ['reviews.spiders'], 'download_delay': 2, 'bot_name': 'reviews'} 2015-05-15 15:46:55+0100 [scrapy] info: enabled extensions: logstats, telnetconsole, closespider, webservice, corestats, spiderstate 2015-05-15 15:46:55+0100 [scrapy] info: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2015-05-15 15:46:55+0100 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2015-05-15 15:46:55+0100 [scrapy] info: enabled item pipelines: 2015-05-15 15:46:55+0100 [theverge] info: spider opened 2015-05-15 15:46:55+0100 [theverge] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-05-15 15:46:55+0100 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2015-05-15 15:46:55+0100 [scrapy] debug: web service listening on 127.0.0.1:6080 2015-05-15 15:46:56+0100 [theverge] debug: crawled (403) <get http://www.theverge.com/reviews> (referer: none) 2015-05-15 15:46:56+0100 [theverge] debug: ignoring response <403 http://www.theverge.com/reviews>: http status code not handled or not allowed 2015-05-15 15:46:56+0100 [theverge] info: closing spider (finished) 2015-05-15 15:46:56+0100 [theverge] info: dumping scrapy stats: {'downloader/request_bytes': 191, 'downloader/request_count': 1, 'downloader/request_method_count/get': 1, 'downloader/response_bytes': 265, 'downloader/response_count': 1, 'downloader/response_status_count/403': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 5, 15, 14, 46, 56, 8769), 'log_count/debug': 4, 'log_count/info': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2015, 5, 15, 14, 46, 55, 673723)} 2015-05-15 15:46:56+0100 [theverge] info: spider closed (finished)
this "2015-05-15 15:46:56+0100 [theverge] debug: ignoring response <403 http://www.theverge.com/reviews>: http status code not handled or not allowed" strange since using download_delay = 2 , last week crawl website no problems... can happening?
address in use
suggest else listening on port, running spider in parallel? second error consequence of first, because didn't instantiate port properly, can't find close it.
i suggest reboot make sure no ports still used, , run 1 spider see if it's working. if happens again, can investigate application using port netstat
or similar tool.
update: http error 403 forbidden means have been banned site making many requests. solve this, use proxy server. checkout scrapy httpproxymiddleware.
Comments
Post a Comment