2019-03-17 17:21:06 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.google.com/www.distancesto.com/coordinates/de/jugenheim-in-rheinhessen-latitude-longitude/history/401814.html> (referer: http://www.google.com/search?q=Rheinhessen+Germany+coordinates+longitude+latitude+distancesto)
2019-03-17 17:21:06 [scrapy.core.scraper] DEBUG: Scraped from <404 http://www.google.com/www.distancesto.com/coordinates/de/jugenheim-in-rheinhessen-latitude-longitude/history/401814.html>
so instead of following 'www.distancesto.com/coordinates/de/jugenheim-in-rheinhessen-latitude-longitude/history/401814.html' it adds 'http://www.google.com/' before and obviously returns in a broken link. this is beyond me and I can't understand why. the response does not have that, I even tried to return after 22 character(undesired preifx length) and it erased part of the real link.
class Googlelocs(Spider):
name = 'googlelocs'
start_urls = []
for i in appellation_list:
baseurl = i.replace(',', '').replace(' ', '+')
cleaned_href = f'http://www.google.com/search?q={baseurl}+coordinates+longitude+latitude+distancesto'
start_urls.append(cleaned_href)
def parse(self, response):
cleaned_href = response.xpath('//*[#id="ires"]/ol/div[1]/h3/a').get().split('https://')[1].split('&')[0]
yield response.follow(cleaned_href, self.parse_distancesto)
def parse_distancesto(self, response):
items = GooglelocItem()
items['appellation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[2]/p/strong)').get()
items['latitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[1]/td)').get()
items['longitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[2]/td)').get()
items['elevation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[10]/td)').get()
yield items
here is the spider.
I found the answer.
href = response.xpath('//*[#id="ires"]/ol/div[1]/h3/a/#href').get()
this was the correct path to obtain the href from google. Also I had to accept the link masked by google without trying to modify it to be able to follow into it.
Related
There is a code for downloading information from the car website.
class AutospiderSpider(scrapy.Spider):
name = 'autospider'
def start_requests(self):
keyword_list = ['subaru']
for keyword in keyword_list:
auto_search_url = f'https://auto.ru/krasnodarskiy_kray/cars/{keyword}/all/?page=1'
yield scrapy.Request(url=auto_search_url, callback=self.discover_car_urls, meta={'keyword': keyword, 'page': 1})
def discover_car_urls(self, response):
page = response.meta['page']
keyword = response.meta['keyword']
#Discover Car URLs
search_cars = response.css("div.ListingItem")
for car in search_cars:
car_url = car.css("a.Link.OfferThumb::attr(href)").get()
yield scrapy.Request(url=car_url, callback=self.parse_car_data, meta={'keyword': keyword, 'page': page})
## Get All Pages
if page == 1:
available_pages = response.css('a.ListingPagination__page::text').getall()
for page_num in available_pages:
auto_search_url = f'https://auto.ru/krasnodarskiy_kray/cars/{keyword}/all/?page={page_num}'
yield scrapy.Request(url=auto_search_url, callback=self.discover_car_urls, meta={'keyword': keyword, 'page': page_num})
def parse_car_data(self, response):
price = response.css("span.OfferPriceCaption__price").get("").strip()
if price != None:
price = price.replace('\u00a0', '').replace('\u20bd', '').strip()
yield {
"brand": response.css("h1.CardHead__title::text").get(),
"price": price,
}
2022-12-26 16:49:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proxy.scrapeops.io/v1/?api_key=cf3a2102-2bf1-4ed1-90a5-b9b676b98009&url=https%3A%2F%2Fauto.ru%2Fkrasnodarskiy_kray%2Fcars%2Fsubaru%2Fall%2F%3Fpage%3D1> (referer: None)
2022-12-26 16:49:45 [scrapy.core.engine] INFO: Closing spider (finished)
The spider closes automatically after receiving the first link without a visible error. I tried changing selectors, but I see the spelling is correct.
I have seen that some similar questions are related to getting a relative rather than an absolute path. But, in the code right here, getting an absolute path, so that's not the problem.
Similar code works perfectly on several other sites. Please tell me what the problem may be if the problem is not in the selector.
I have an initial link from which I get the number of pages. How to get the url of the start link?
Pagination does not work:
DEBUG: Crawled (404) <GET https://www.healthgrades.com/api3/&pageNum=2> (referer: https://www.healthgrades.com/api3/usearch?where=CA&sessionId=%7BsessionId%7D&requestId=%7BrequestId%7D&sort.provider=bestmatch&source=init&what=%7Bspecialty%7D&category=provider&cid&debug=false&debugParams=false&isPsr=false&isFsr=false&isFirstRequest=true&userLocalTime=23%3A55)
spider:
def start_requests(self):
yield scrapy.Request('https://www.healthgrades.com/api3/usearch?where=CA&sessionId={sessionId}&requestId={requestId}' +
'&sort.provider=bestmatch&source=init&what={specialty}&category=provider&cid&debug=false&d' +
'ebugParams=false&isPsr=false&isFsr=false&isFirstRequest=true&userLocalTime=23%3A55',
callback=self.pagination)
def pagination(self, response):
jsonresponse = json.loads(response.body_as_unicode())
totalPages = jsonresponse['search']['searchResults']['totalPages']
for page in range(1, totalPages):
page = '&pageNum=%s' % page
yield scrapy.Request(urljoin(response.request.url, page), callback=self.profile_link)
I am attempting to get the mini-bio from the top of the following page:
https://en.m.wikipedia.org/wiki/C%C3%A9sar_Milstein
With scrapy shell I'm able to perform the following:
C:\Users\broke\Documents\DataViz Projects>scrapy shell https://en.m.wikipedia.org/wiki/C%C3%A9sar_Milstein
...
[s] request <GET https://en.m.wikipedia.org/wiki/C%C3%A9sar_Milstein>
[s] response <200 https://en.m.wikipedia.org/wiki/C%C3%A9sar_Milstein>
...
In [1]: response.xpath('//div[#id="mf-section-0"]/p[text() or normalize-space(.)=""]').extract()
Out[1]:
['<p class="mw-empty-elt">\n\n</p>',
'<p><b>César Milstein</b>, <a href="/wiki/Order_of_the_Companions_of_Honour" title="Order of the Companions
of Honour">CH</a>, FRS<sup id="cite_ref-frs_2-1" clas
s="reference">[2]</sup> (8 October 1927 – 24 March 2002) was an <a href="/wiki
/Argentinian" class="mw-redirect" title="Argentinian">Argentinian</a> biochemist in the field of <a href="/wi
ki/Antibody" title="Antibody">antibody</a> research.<sup id="cite_ref-4" class="reference"><a href="#cite_not
e-4">[4]</a></sup>
...
</a></sup><sup id="cite_ref-12" class="reference">[12]</sup><s
up id="cite_ref-13" class="reference">[13]</sup><sup id="cite_ref-14" class="refe
rence">[14]</sup><sup id="cite_ref-15" class="reference"><a href="#cite_note-15">
[15]</a></sup></p>']
However, the following parser code in my spider is returning an empty list when passed that URL:
def get_mini_bio(self,response):
""" Get the winner's bio text and photo"""
item = response.meta['item']
item['image_urls']=[]
img_src = response.xpath('//table[contains(#class,"infobox")]//img/#src')
if img_src:
item['image_urls'] = ['https:{}'.format(img_src[0].extract())]
mini_bio = ''
#paras = '\n\n'.join(response.xpath('//div[#id="mw-content-text"]//p[text() or normalize-space(.)=""]').extract())
mini_bio = response.xpath('//div[#id="mf-section-0"]/p[text() or normalize-space(.)=""]').extract()
self.logger.warning("mini_bio received {} as a result.".format(mini_bio))
yield
Output:
2019-05-28 01:52:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/C%C3%A9sar_M
ilstein> (referer: None) ['cached']
2019-05-28 01:52:59 [nwinners_minibio] WARNING: mini_bio received [] as a result.
Note the commented line in the parser, that xpath will return a set of paragraphs that include the desired paragraph (the paragraph inside of the 'mf-section-0' div), so the paragraph does appear to be being rendered. However, it will also include all the other paragraphs in the text section as well without enough information to differentiate across other similar pages.
Can anyone tell me why I'm getting different results between the shell and the spider, and how I can get the same results in the spider as I am getting in the shell?
I'm using item loader with scrapy from multiple page, the item loader returns empty dictionaries for some pages though when i use same rules to parse only these pages it returns the values, anyone could know why?
spider code:
class AllDataSpider(scrapy.Spider):
name = 'all_data' # spider name
allowed_domains = ['amazon.com']
# write the start url
start_urls = ["https://www.amazon.com/s? bbn=2619533011&rh=n%3A2619533011%2Cp_n_availability%3A2661601011&ie=UTF8&qid =1541604856&ref=lp_2619533011_nr_p_n_availability_1"]
custom_settings = {'FEED_URI': 'pets_.csv'} # write csv file name
def parse(self, response):
'''
function parses item information from category page
'''
self.category = response.xpath('//span[contains(#class, "nav-a-
content")]//text()').extract_first()
urls = response.xpath('//*[#data-asin]//#data-asin').extract()
for url in urls:
base = f"https://www.amazon.com/dp/{url}"
yield scrapy.Request(base, callback=self.parse_item)
next_page = response.xpath('//*
[text()="Next"]//#href').extract_first()
if next_page is not None:
yield scrapy.Request(response.urljoin(next_page),
dont_filter=True)
def parse_item(self, response):
loader = AmazonDataLoader(selector=response)
loader.add_xpath("Availability", '//div[contains(#id,
"availability")]//span//text()')
loader.add_xpath("NAME", '//h1[#id="title"]//text()')
loader.add_xpath("ASIN", '//*[#data-asin]//#data-asin')
loader.add_xpath("REVIEWS", '//span[contains(#id,
"Review")]//text()')
rank_check = response.xpath('//*[#id="SalesRank"]//text()')
if len(rank_check) > 0:
loader.add_xpath("RANKING", '//*[#id="SalesRank"]//text()')
else:
loader.add_xpath("RANKING", '//span//span[contains(text(), "#")]
[1]//text()')
loader.add_value("CATEGORY", self.category)
return loader.load_item()
for some pages it returns all values, for some pages it returns just the category, and for other "that follow same rules when parsing them only" it returns nothing, it also close the spider before finishing and without errors
DEBUG: Scraped from <200 https://www.amazon.com/dp/B0009X29WK>
{'ASIN': 'B0009X29WK',
'Availability': 'In Stock.',
'NAME': " Dr. Elsey's Cat Ultra Premium Clumping Cat Litter, 40 pound bag ( "
'Pack May Vary ) ',
'RANKING': '#1',
'REVIEWS': '13,612'}
2019-01-21 21:13:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/dp/B01N9KSITZ> (referer: https://www.amazon.com/s?i=pets&bbn=2619533011&rh=n%3A2619533011%2Cp_n_availability%3A2661601011&lo=grid&page=2&ie=UTF8&qid=1548097190&ref=sr_pg_1)
2019-01-21 21:13:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/dp/B01N9KSITZ>
{}
I have made a scraper to crawl all categories related to "au-quotidien", on e-commerce website Cdiscount.
The bot is supposed to start on the top menu, then accessing a second layer deep, then a third, and scrape items. Here is my code, as a test :
class CdiscountSpider(scrapy.Spider):
name = "cdis_bot" # how we have to call the bot
start_urls = ["https://www.cdiscount.com/au-quotidien/v-127-0.html"]
def parse(self, response):
for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
regex_top_category = r"\b(?=\w)" + re.escape("au-quotidien") + r"\b(?!\w)"
if re.search(regex_top_category, link):
yield response.follow(link, callback = self.parse_on_categories) #going to one layer deep from landing page
def parse_on_categories(self, response):
for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
yield response.follow(link, callback = self.parse_on_subcategories) #going to two layer deep from landing page
def parse_on_subcategories(self, response):
for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
yield response.follow(link, callback = self.parse_data) #going to three layer deep from landing page
def parse_data(self, response):
links_list = response.css("div.prdtBILDetails a::attr(href)").extract()
regex_ean = re.compile(r'(\d+)\.html')
eans_list = [regex_ean.search(link).group(1) for link in links_list if regex_ean.search(link)]
desc_list = response.css("div.prdtBILTit::text").extract()
price_euros = response.css("span.price::text").extract()
price_cents = response.css("span.price sup::text").extract()
for euro, cent, ean, desc in zip(price_euros, price_cents, eans_list, desc_list):
if len(ean) > 6:
yield{'ean':ean,'price':euro+cent,'desc':desc,'company':"cdiscount",'url':response.url}
My problem is that, only links are retrieved.
For instance :
2018-12-18 14:40:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/legumes-secs/l-127015303.html> (referer: https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/l-1270153.html)
2018-12-18 14:40:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/semoules/l-127015302.html> (referer: https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/l-1270153.html)
But I get only a very few scraped items, always on the same category, like this :
{'ean': '2009818241269', 'price': '96€00', 'desc': 'Heidsieck & Co Monopole 75cl x6', 'company': 'cdiscount', 'url': 'https://www.cdiscount.com/vin-champagne/vin-champagne/champagne-brut/l-1293402.html'}
2018-12-18 14:40:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cdiscount.com/vin-champagne/vin-champagne/champagne-brut/l-1293402.html>
While it seems to me that other categories share the same items selector.
If you could help me to figure out where I am wrong I would be grateful :) thank you
It looks like the responses your parse_data() method is receiving are all vastly different.
For example, these are the first three urls it parses on a sample run:
https://www.cdiscount.com/vin-champagne/vin-champagne/champagne-millesime/l-1293404.html
https://www.cdiscount.com/vin-champagne/coffrets-cadeaux/v-12960-12960.html
https://www.cdiscount.com/au-quotidien/alimentaire/bio/boisson-bio/jus-de-tomates-bio/l-12701271315.html
It's obvious (even from a quick glance) that the structure of each of these pages is different.
In most cases, your eans_list and desc_list are empty, so the zip() call produces no results.