scrapy 503 Service Unavailable on starturl

scrapy 503 Service Unavailable on starturl - python

I modifed this spider but it gives this errors
Gave up retrying <GET https://lib.maplelegends.com/robots.txt> (failed 3 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://lib.maplelegends.com/robots.txt> (referer: None)
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 1 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 2 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 3 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://lib.maplelegends.com/?p=etc&id=4004003> (referer: None)
2019-01-06 23:43:56 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://lib.maplelegends.com/?p=etc&id=4004003>: HTTP status code is not handled or not allowed
Crawler code:
#!/usr/bin/env python3
import scrapy
import time
start_url = 'https://lib.maplelegends.com/?p=etc&id=4004003'
class MySpider(scrapy.Spider):
name = 'MySpider'
start_urls = [start_url]
def parse(self, response):
# print('url:', response.url)
products = response.xpath('.//div[#class="table-responsive"]/table/tbody')
for product in products:
item = {
#'name': product.xpath('./tr/td/b[1]/a/text()').extract(),
'link': product.xpath('./tr/td/b[1]/a/#href').extract(),
}
# url = response.urljoin(item['link'])
# yield scrapy.Request(url=url, callback=self.parse_product, meta={'item': item})
yield response.follow(item['link'], callback=self.parse_product, meta={'item': item})
time.sleep(5)
# execute with low
yield scrapy.Request(start_url, dont_filter=True, priority=-1)
def parse_product(self, response):
# print('url:', response.url)
# name = response.xpath('(//strong)[1]/text()').re(r'(\w+)')
hp = response.xpath('//*[contains(concat( " ", #class, " " ), concat( " ", "image", " " ))] | //img').re(r':(\d+)')
scrolls = response.xpath('//*[contains(concat( " ", #class, " " ), concat( " ", "image", " " ))] | //strong+//a//img/#title').re(r'\bScroll\b')
for price, hp, scrolls in zip(name, hp, scrolls):
yield {'name': name.strip(), 'hp': hp.strip(), 'scroll':scrolls.strip()}
--- it runs without project and saves in output.csv ---
from scrapy.crawler import CrawlerRunner
def _run_crawler(spider_cls, settings):
"""
spider_cls: Scrapy Spider class
returns: Twisted Deferred
"""
runner = CrawlerRunner(settings)
return runner.crawl(spider_cls) # return Deferred
def test_scrapy_crawler():
deferred = _run_crawler(MySpider, settings)
#deferred.addCallback
def _success(results):
"""
After crawler completes, this function will execute.
Do your assertions in this function.
"""
#deferred.addErrback
def _error(failure):
raise failure.value
return deferred

Robots.txt
Your crawler is trying to check robots.txt file but the website doesn't have one present.
To avoid this you can set ROBOTSTXT_OBEY setting to false in your settings.py file.
By default it's False but new scrapy projects generated with scrapy startproject command has ROBOTSTXT_OBEY = True generated from the template.
503 responses
Further the website seems to respond as 503 on every first request. The website is using some sort of bot protection:
First request is 503 then some javascript is being executed to make an AJAX request for generating __shovlshield cookie:
Seems like https://shovl.io/ ddos protection is being used.
To solve this you need to reverse engineer how javascript generates the cookie or employ javascript rendering techniques/services such as selenium or splash

Related

How can I implement custom proxy on Scrapy?

I'm trying to implement custom scraper API but as a begging I think I'm doing wrong. But I follow their documentation to setup everything. Here is a documentation link
from scrapy import Spider
from scrapy.http import Request
from .config import API
from scraper_api import ScraperAPIClient
client = ScraperAPIClient(API)
class GlassSpider(Spider):
name = 'glass'
allowed_domains = ['glassdoor.co.uk']
start_urls = [client.scrapyGet(url='https://www.glassdoor.co.uk/Job/russian-jobs-SRCH_KE0,7.htm?fromAge=1')]
def parse(self, response):
jobs = response.xpath('//*[contains(#class, "react-job-listing")]')
for job in jobs:
job_url = job.xpath('.//*[contains(#class, "jobInfoItem jobTitle")]/#href').extract_first()
absulate_job_url = response.urljoin(job_url)
yield Request(client.scrapyGet(url=absulate_job_url),
callback=self.parse_jobpage,
meta={
"Job URL": absulate_job_url
})
def parse_jobpage(self, response):
absulate_job_url = response.meta.get('Job URL')
job_description = "".join(line for line in response.xpath('//*[contains(#class, "desc")]//text()').extract())
yield {
"Job URL": absulate_job_url,
"Job Description": job_description
}
That's the output I'm receiving.... Please what's wrong with my code. Please fix it for me. So I can follow and get the point. Thank you.
2020-10-01 23:01:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://api.scraperapi.com/?url=https%3A%2F%2Fwww.glassdoor.co.uk%2FJob%2F
russian-jobs-SRCH_KE0%2C7.htm%3FfromAge%3D1&api_key=bec9dd9f2be095dfc6158a7e609&scraper_sdk=python>
(referer: None)
2020-10-01 23:01:45 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET
https://api.scraperapi.c
om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6
7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61
5c8a7e639scraper_sdk=python>

I'm not familiar with this particular lib, but from your execution logs the issue is that your request is beign filtered, since it's consider offsite.
[scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET https://api.scraperapi.c om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6 7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61 5c8a7e639scraper_sdk=python>
Since scraperapi will make your request go through their domain and that's outside of what you defined in your allowed_domains it's filtered as an offsite request. To avoid this issue you can remove this line entirely:
allowed_domains = ['glassdoor.co.uk']
or try include 'api.scraperapi.com' in it.

Issue with scrapy spider

I am trying to get volume-weighted average prices for stocks from the moneycontrol.com website. The parse function is running without any issues but the parse_links function is not getting called. Am i missing something here?
# -*- coding: utf-8 -*-
import scrapy
class MoneycontrolSpider(scrapy.Spider):
name = "moneycontrol"
allowed_domains = ["https://www.moneycontrol.com"]
start_urls = ["https://www.moneycontrol.com/india/stockpricequote"]
def parse(self,response):
for link in response.css('td.last > a::attr(href)').extract():
if(link):
yield scrapy.Request(link, callback=self.parse_links,method='GET')
def parse_links(self, response):
VWAP= response.xpath('//*[#id="n_vwap_val"]/text()').extract_first()
print(VWAP)
with open('quotes.txt','a+') as f:
f.write('VWAP: {}'.format(VWAP) + '\n')

If you read the log output, the error becomes obvious.
2018-09-08 19:52:38 [py.warnings] WARNING: c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py:59: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://www.moneycontrol.com in allowed_domains.
warnings.warn("allowed_domains accepts only domains, not URLs. Ignoring URL entry %s in allowed_domains." % domain, URLWarning)
2018-09-08 19:52:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-08 19:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.moneycontrol.com/india/stockpricequote> (referer: None)
2018-09-08 19:52:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.moneycontrol.com': <GET http://www.moneycontrol.com/india/stockpricequote/chemicals/aartiindustries/AI45>
So just fix your allowed_domains, and you should be fine:
allowed_domains = ["moneycontrol.com"]

Python scrapy SitemapSpider callbacks not being called

I read the documentation on the SitemapSpider class over here: https://scrapy.readthedocs.io/en/latest/topics/spiders.html#sitemapspider
Here's my code:
class CurrentHarvestSpider(scrapy.spiders.SitemapSpider):
name = "newegg"
allowed_domains = ["newegg.com"]
sitemap_urls = ['http://www.newegg.com/Siteindex_USA.xml']
# if I comment this out, then the parse function should be called by default for every link, but it doesn't
sitemap_rules = [('/Product', 'parse_product_url'), ('product','parse_product_url')]
sitemap_follow = ['/newegg_sitemap_product', '/Product']
def parse(self, response):
with open("/home/dan/debug/newegg_crawler.log", "a") as log:
log.write("logging from parse " + response.url)
self.this_function_does_not_exist()
yield Request(response.url, callback=self.some_callback)
def some_callback(self, response):
with open("/home/dan/debug/newegg_crawler.log", "a") as log:
log.write("logging from some_callback " + response.url)
self.this_function_does_not_exist()
def parse_product_url(self, response):
with open("/home/dan/debug/newegg_crawler.log ", "a") as log:
log.write("logging from parse_product_url" + response.url)
self.this_function_does_not_exist()
This can be run successfully with scrapy installed.
Run pip install scrapy to get scrapy and execute with scrapy crawl newegg from the working directory.
My question is, why aren't any of these callbacks being called? The documentation claims that the callback defined in sitemap_rules should be called. If I comment it out, then parse() should be called by default but it still doesn't get called. Are the docs just 100% wrong? I'm checking this log file that I setup, and nothing is being written. I've even set the permissions on the file to 777. Also, I'm calling a non existent function which should cause an error to prove that the functions are not being called, but no error occurs. What am I doing wrong?

When I run your spider, this is what I get on the console:
$ scrapy runspider op.py
2016-11-09 21:34:51 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 21:34:51 [scrapy] INFO: Spider opened
2016-11-09 21:34:51 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 21:34:51 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 21:34:51 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 21:34:53 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 21:34:53 [scrapy] ERROR: Spider error processing <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
Traceback (most recent call last):
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spiders/sitemap.py", line 44, in _parse_sitemap
s = Sitemap(body)
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/sitemap.py", line 17, in __init__
rt = self._root.tag
AttributeError: 'NoneType' object has no attribute 'tag'
You've probably noticed the AttributeError exception.
So scrapy is saying it has trouble parsing the sitemap response body.
And if scrapy cannot understand the sitemap content, it cannot parse content as XML, hence cannot follow any <loc> URL and will therefore not call any callback since it found nothing.
So you've actually found a bug in scrapy (thanks for reporting): https://github.com/scrapy/scrapy/issues/2389
As for the bug itself,
The different sub-sitemaps, e.g. http://www.newegg.com/Sitemap/USA/newegg_sitemap_store01.xml.gz, are sent "on the wire" as gzipped .gz files (gzipped twice -- so the HTTP response needs to be gunzipped twice) to be parsed as XML correctly.
Scrapy does not handle this case, hence the exception printed out.
Here's a basic sitemap spider that tries to double-gunzip responses:
from scrapy.utils.gz import gunzip
import scrapy
class CurrentHarvestSpider(scrapy.spiders.SitemapSpider):
name = "newegg"
allowed_domains = ["newegg.com"]
sitemap_urls = ['http://www.newegg.com/Siteindex_USA.xml']
def parse(self, response):
self.logger.info('parsing %r' % response.url)
def _get_sitemap_body(self, response):
body = super(CurrentHarvestSpider, self)._get_sitemap_body(response)
self.logger.debug("body[:32]: %r" % body[:32])
try:
body_unzipped_again = gunzip(body)
self.logger.debug("body_unzipped_again[:32]: %r" % body_unzipped_again[:100])
return body_unzipped_again
except:
pass
return body
And this the logs showing that newegg's .xml.gz sitemaps indeed need gunzipping twice:
$ scrapy runspider spider.py
2016-11-09 13:10:56 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 13:10:56 [scrapy] INFO: Spider opened
2016-11-09 13:10:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 13:10:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding='
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_store01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xda\xef\x1eX\x00\x0bnewegg_sitemap_store01'
2016-11-09 13:10:57 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
2016-11-09 13:10:57 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.newegg.com/Hubs/SubCategory/ID-26> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-11-09 13:10:59 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:59 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xe3\xfa\x1eX\x00\x0bnewegg_sitemap_product'
2016-11-09 13:10:59 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
(...)
2016-11-09 13:11:02 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512> (referer: http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz)
(...)
2016-11-09 13:11:02 [newegg] INFO: parsing 'http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512'
(...)

Scrapy after login parsing the url list

I am not very familiar with python so please have patience with me.
I have a scrapy crawler, that works like it should, but now I need to do a new one, but this time it should crawl a logged in session.
So my scrapy uses as start_urls a list of urls obtained from a sitemap, it should make a request to the login form, then, if logged in it should start parsing my list...
This is my code so far :
class StockPricesSpider(Spider):
name = "logged-in"
allowed_domains = ["example.com"]
d = strftime("%Y-%m-%d", gmtime())
start_urls = ['https://www.example.com/customer/account/login/']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'myuser', 'password': 'mypass'},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "Invalid login or password." in response.body:
self.log("Login failed", level=log.ERROR)
return
else:
logging.log(logging.INFO,'Logged in and start parsing')
return Request("http://www.example.com/", callback=self.parse_products)
def parse_products(self, response):
f = open("data/sitemaps/urls04102015.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
d = strftime("%Y-%m-%d", gmtime())
if os.path.exists("data/results/stock_"+d+".csv"):
os.remove("data/results/stock_"+d+".csv")
sel = Selector(response)
separator = ";"
items = []
item = MyPrices()
sku = sel.xpath('.//strong[#itemprop="productID"]/text()').extract()
logging.log(logging.INFO, sku)
if len(sku) > 0:
item['sku'] = "med_" + sel.xpath('.//strong[#itemprop="productID"]/text()').extract()[0].strip()
...
items.append(item)
return items
So this is not working, because I am not calling the parser correctly.
So basically, I do not get errors, but the urls do not get parsed either.
So the login works, I succeed logging in, but after that (after login) how do I do what scrapy does (parsing the list of urls) ?
EDIT
I found a new approach to my problem, but it also does not work properly. Please help me debug this (or the first approach)
class StockPricesSpiderX(InitSpider):
name = "logged-in"
allowed_domains = ["example.com"]
login_page = 'https://www.example.com/ro/customer/account/login/'
d = strftime("%Y-%m-%d", gmtime())
f = open("data/sitemaps/urls04102015.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
if os.path.exists("data/results/stock_"+d+".csv"):
os.remove("data/results/stock_"+d+".csv")
def init_request(self):
""" Called before crawler starts """
logging.log(logging.INFO, 'before crawler starts...')
return Request(url=self.login_page, callback=self.login)
def login(self, response):
""" Generate login request """
logging.log(logging.INFO, 'do login...')
return FormRequest.from_response(response,
formdata={'name':'myuser','password':'mypass'},
callback=self.check_login_response)
def check_login_response(self,response):
""" Check the response returned by login request to see if we are logged in """
if "Invalid login or password." in response.body:
logging.log(logging.INFO,'... BAD LOGIN ...')
else:
logging.log(logging.INFO, 'GOOD LOGIN... initialize')
self.initialized()
def parse_item(self, response):
sel = Selector(response)
separator = ";"
items = []
item = StockPrices()
sku = sel.xpath('.//strong[#itemprop="productID"]/text()').extract()
logging.log(logging.INFO, sku)
...
items.append(item)
return items
The log of the execution shows this:
2015-12-03 14:54:16 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-12-03 14:54:16 [scrapy] INFO: Optional features available: ssl, http11
2015-12-03 14:54:16 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'products.spiders', 'FEED_URI': 'calinxautomat.csv', 'LOG_LEVEL': 'INFO', 'DUPEFILTER_CLASS': 'scrapy.dupefilter.RFPDupeFilter', 'SPIDER_MODULES': ['products.spiders'], 'DEFAULT_ITEM_CLASS': 'products.items.Subcategories', 'FEED_FORMAT': 'csv'}
2015-12-03 14:54:21 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-03 14:54:23 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-03 14:54:23 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-03 14:54:23 [scrapy] INFO: Enabled item pipelines: myWriteToCsv
2015-12-03 14:54:23 [root] INFO: before crawler starts...
2015-12-03 14:54:23 [scrapy] INFO: Spider opened
2015-12-03 14:54:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-03 14:54:25 [root] INFO: do login...
2015-12-03 14:54:26 [scrapy] INFO: Closing spider (finished)
2015-12-03 14:54:26 [scrapy] INFO: Dumping Scrapy stats:
...
So this one does not seem to get passed the login phase... It's like the callback does not exit from formRequest...
What am I doing wrong?

In parse_products() the assignment to start_urls will be using a variable local to that routine, not the class global you set up at the top of your spider. In any event I don't think that assigning to start_urls will do what you want, scrapy wont notice and then parse them. What you need to do is queue the new urls to be parsed.
for url in f.readlines()
yield Request(url.strip(), callback=self.parse_products)
Update: from your update: scrapy has a url filter so it doesn't revisit pages. See this, tldr: set dont_filter=True in the FormRequest

python method is not called

I have following class method in a scrapy spider. parse_category yields a Request object that has callback to parse_product. Sometimes a category page redirects to a product page. So here I detect if a category page is a product page. If it is, I just call the parse_product method. But for some reason it does not call the method.
def parse(self, response):
hxs = HtmlXPathSelector(response)
anchors = hxs.select('//div[#id="panelMfr"]/div/ul/li[position() != last()]/a')
for anchor in anchors[2:3]:
url = anchor.select('#href').extract().pop()
cat = anchor.select('text()').extract().pop().strip()
yield Request(urljoin(get_base_url(response), url), callback=self.parse_category, meta={"category": cat})
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
base_url = get_base_url(response)
# check if its a redirected product page
if (hxs.select(self.product_name_xpath)):
self.log("Category-To-Product Redirection")
self.parse_product(response) # <<---- This line is not called.
self.log("Product Parsed")
return
products_xpath = '//div[#class="productName"]/a/#href'
products = hxs.select(products_xpath).extract()
for url in products:
yield Request(urljoin(base_url, url), callback=self.parse_product, meta={"category": response.meta['category']})
next_page = hxs.select('//table[#class="nav-back"]/tr/td/span/a[contains(text(), "Next")]/text()').extract()
if next_page:
url = next_page[0]
yield Request(urljoin(base_url, url), callback=self.parse_category, meta={"category": response.meta['category']})
def parse_product(self, response):
hxs = HtmlXPathSelector(response)
base_url = get_base_url(response)
self.log("Inside parse_product")
In the log I see Category-To-Product Redirection and Product Parsed is printed but Inside parse_product is missing. Whats did I do wrong here?
2013-07-12 21:31:34+0100 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/category.aspx> (referer: None)
2013-07-12 21:31:34+0100 [example.com] DEBUG: Redirecting (302) to <GET http://www.example.com/productinfo.aspx?catref=AM6901> from <GET http://www.example.com/products/Inks-Toners/Apple>
2013-07-12 21:31:35+0100 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/productinfo.aspx?catref=AM6901> (referer: http://www.example.com/category.aspx)
2013-07-12 21:31:35+0100 [example.com] DEBUG: Category-To-Product Redirection
2013-07-12 21:31:35+0100 [example.com] DEBUG: Product Parsed
2013-07-12 21:31:35+0100 [example.com] INFO: Closing spider (finished)
2013-07-12 21:31:35+0100 [-] ERROR: ERROR:root:SPIDER CLOSED: No. of products: 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy 503 Service Unavailable on starturl - python

Related

How can I implement custom proxy on Scrapy?

Issue with scrapy spider

Python scrapy SitemapSpider callbacks not being called

Scrapy after login parsing the url list

python method is not called

Categories

Resources