Scrapy links crawled but not scraped - python

I have made a scraper to crawl all categories related to "au-quotidien", on e-commerce website Cdiscount.
The bot is supposed to start on the top menu, then accessing a second layer deep, then a third, and scrape items. Here is my code, as a test :
class CdiscountSpider(scrapy.Spider):
name = "cdis_bot" # how we have to call the bot
start_urls = ["https://www.cdiscount.com/au-quotidien/v-127-0.html"]
def parse(self, response):
for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
regex_top_category = r"\b(?=\w)" + re.escape("au-quotidien") + r"\b(?!\w)"
if re.search(regex_top_category, link):
yield response.follow(link, callback = self.parse_on_categories) #going to one layer deep from landing page
def parse_on_categories(self, response):
for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
yield response.follow(link, callback = self.parse_on_subcategories) #going to two layer deep from landing page
def parse_on_subcategories(self, response):
for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
yield response.follow(link, callback = self.parse_data) #going to three layer deep from landing page
def parse_data(self, response):
links_list = response.css("div.prdtBILDetails a::attr(href)").extract()
regex_ean = re.compile(r'(\d+)\.html')
eans_list = [regex_ean.search(link).group(1) for link in links_list if regex_ean.search(link)]
desc_list = response.css("div.prdtBILTit::text").extract()
price_euros = response.css("span.price::text").extract()
price_cents = response.css("span.price sup::text").extract()
for euro, cent, ean, desc in zip(price_euros, price_cents, eans_list, desc_list):
if len(ean) > 6:
yield{'ean':ean,'price':euro+cent,'desc':desc,'company':"cdiscount",'url':response.url}
My problem is that, only links are retrieved.
For instance :
2018-12-18 14:40:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/legumes-secs/l-127015303.html> (referer: https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/l-1270153.html)
2018-12-18 14:40:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/semoules/l-127015302.html> (referer: https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/l-1270153.html)
But I get only a very few scraped items, always on the same category, like this :
{'ean': '2009818241269', 'price': '96€00', 'desc': 'Heidsieck & Co Monopole 75cl x6', 'company': 'cdiscount', 'url': 'https://www.cdiscount.com/vin-champagne/vin-champagne/champagne-brut/l-1293402.html'}
2018-12-18 14:40:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cdiscount.com/vin-champagne/vin-champagne/champagne-brut/l-1293402.html>
While it seems to me that other categories share the same items selector.
If you could help me to figure out where I am wrong I would be grateful :) thank you

It looks like the responses your parse_data() method is receiving are all vastly different.
For example, these are the first three urls it parses on a sample run:
https://www.cdiscount.com/vin-champagne/vin-champagne/champagne-millesime/l-1293404.html
https://www.cdiscount.com/vin-champagne/coffrets-cadeaux/v-12960-12960.html
https://www.cdiscount.com/au-quotidien/alimentaire/bio/boisson-bio/jus-de-tomates-bio/l-12701271315.html
It's obvious (even from a quick glance) that the structure of each of these pages is different.
In most cases, your eans_list and desc_list are empty, so the zip() call produces no results.

Related

scrapy closes spider with no visible error

There is a code for downloading information from the car website.
class AutospiderSpider(scrapy.Spider):
name = 'autospider'
def start_requests(self):
keyword_list = ['subaru']
for keyword in keyword_list:
auto_search_url = f'https://auto.ru/krasnodarskiy_kray/cars/{keyword}/all/?page=1'
yield scrapy.Request(url=auto_search_url, callback=self.discover_car_urls, meta={'keyword': keyword, 'page': 1})
def discover_car_urls(self, response):
page = response.meta['page']
keyword = response.meta['keyword']
#Discover Car URLs
search_cars = response.css("div.ListingItem")
for car in search_cars:
car_url = car.css("a.Link.OfferThumb::attr(href)").get()
yield scrapy.Request(url=car_url, callback=self.parse_car_data, meta={'keyword': keyword, 'page': page})
## Get All Pages
if page == 1:
available_pages = response.css('a.ListingPagination__page::text').getall()
for page_num in available_pages:
auto_search_url = f'https://auto.ru/krasnodarskiy_kray/cars/{keyword}/all/?page={page_num}'
yield scrapy.Request(url=auto_search_url, callback=self.discover_car_urls, meta={'keyword': keyword, 'page': page_num})
def parse_car_data(self, response):
price = response.css("span.OfferPriceCaption__price").get("").strip()
if price != None:
price = price.replace('\u00a0', '').replace('\u20bd', '').strip()
yield {
"brand": response.css("h1.CardHead__title::text").get(),
"price": price,
}
2022-12-26 16:49:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proxy.scrapeops.io/v1/?api_key=cf3a2102-2bf1-4ed1-90a5-b9b676b98009&url=https%3A%2F%2Fauto.ru%2Fkrasnodarskiy_kray%2Fcars%2Fsubaru%2Fall%2F%3Fpage%3D1> (referer: None)
2022-12-26 16:49:45 [scrapy.core.engine] INFO: Closing spider (finished)
The spider closes automatically after receiving the first link without a visible error. I tried changing selectors, but I see the spelling is correct.
I have seen that some similar questions are related to getting a relative rather than an absolute path. But, in the code right here, getting an absolute path, so that's not the problem.
Similar code works perfectly on several other sites. Please tell me what the problem may be if the problem is not in the selector.

Scrapy adds undesired prefix link when following link

2019-03-17 17:21:06 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.google.com/www.distancesto.com/coordinates/de/jugenheim-in-rheinhessen-latitude-longitude/history/401814.html> (referer: http://www.google.com/search?q=Rheinhessen+Germany+coordinates+longitude+latitude+distancesto)
2019-03-17 17:21:06 [scrapy.core.scraper] DEBUG: Scraped from <404 http://www.google.com/www.distancesto.com/coordinates/de/jugenheim-in-rheinhessen-latitude-longitude/history/401814.html>
so instead of following 'www.distancesto.com/coordinates/de/jugenheim-in-rheinhessen-latitude-longitude/history/401814.html' it adds 'http://www.google.com/' before and obviously returns in a broken link. this is beyond me and I can't understand why. the response does not have that, I even tried to return after 22 character(undesired preifx length) and it erased part of the real link.
class Googlelocs(Spider):
name = 'googlelocs'
start_urls = []
for i in appellation_list:
baseurl = i.replace(',', '').replace(' ', '+')
cleaned_href = f'http://www.google.com/search?q={baseurl}+coordinates+longitude+latitude+distancesto'
start_urls.append(cleaned_href)
def parse(self, response):
cleaned_href = response.xpath('//*[#id="ires"]/ol/div[1]/h3/a').get().split('https://')[1].split('&')[0]
yield response.follow(cleaned_href, self.parse_distancesto)
def parse_distancesto(self, response):
items = GooglelocItem()
items['appellation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[2]/p/strong)').get()
items['latitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[1]/td)').get()
items['longitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[2]/td)').get()
items['elevation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[10]/td)').get()
yield items
here is the spider.
I found the answer.
href = response.xpath('//*[#id="ires"]/ol/div[1]/h3/a/#href').get()
this was the correct path to obtain the href from google. Also I had to accept the link masked by google without trying to modify it to be able to follow into it.

item loader skip values scrapy

I'm using item loader with scrapy from multiple page, the item loader returns empty dictionaries for some pages though when i use same rules to parse only these pages it returns the values, anyone could know why?
spider code:
class AllDataSpider(scrapy.Spider):
name = 'all_data' # spider name
allowed_domains = ['amazon.com']
# write the start url
start_urls = ["https://www.amazon.com/s? bbn=2619533011&rh=n%3A2619533011%2Cp_n_availability%3A2661601011&ie=UTF8&qid =1541604856&ref=lp_2619533011_nr_p_n_availability_1"]
custom_settings = {'FEED_URI': 'pets_.csv'} # write csv file name
def parse(self, response):
'''
function parses item information from category page
'''
self.category = response.xpath('//span[contains(#class, "nav-a-
content")]//text()').extract_first()
urls = response.xpath('//*[#data-asin]//#data-asin').extract()
for url in urls:
base = f"https://www.amazon.com/dp/{url}"
yield scrapy.Request(base, callback=self.parse_item)
next_page = response.xpath('//*
[text()="Next"]//#href').extract_first()
if next_page is not None:
yield scrapy.Request(response.urljoin(next_page),
dont_filter=True)
def parse_item(self, response):
loader = AmazonDataLoader(selector=response)
loader.add_xpath("Availability", '//div[contains(#id,
"availability")]//span//text()')
loader.add_xpath("NAME", '//h1[#id="title"]//text()')
loader.add_xpath("ASIN", '//*[#data-asin]//#data-asin')
loader.add_xpath("REVIEWS", '//span[contains(#id,
"Review")]//text()')
rank_check = response.xpath('//*[#id="SalesRank"]//text()')
if len(rank_check) > 0:
loader.add_xpath("RANKING", '//*[#id="SalesRank"]//text()')
else:
loader.add_xpath("RANKING", '//span//span[contains(text(), "#")]
[1]//text()')
loader.add_value("CATEGORY", self.category)
return loader.load_item()
for some pages it returns all values, for some pages it returns just the category, and for other "that follow same rules when parsing them only" it returns nothing, it also close the spider before finishing and without errors
DEBUG: Scraped from <200 https://www.amazon.com/dp/B0009X29WK>
{'ASIN': 'B0009X29WK',
'Availability': 'In Stock.',
'NAME': " Dr. Elsey's Cat Ultra Premium Clumping Cat Litter, 40 pound bag ( "
'Pack May Vary ) ',
'RANKING': '#1',
'REVIEWS': '13,612'}
2019-01-21 21:13:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/dp/B01N9KSITZ> (referer: https://www.amazon.com/s?i=pets&bbn=2619533011&rh=n%3A2619533011%2Cp_n_availability%3A2661601011&lo=grid&page=2&ie=UTF8&qid=1548097190&ref=sr_pg_1)
2019-01-21 21:13:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/dp/B01N9KSITZ>
{}

Scrapy stops scraping but continues to crawl

I’m trying to scrape different information from several pages of a website.
Until the sixteenth page, everything works: the pages are crawled, scraped and the information stock in my database, however after the sixteenth page, it stops scraping but continues to crawl.
I checked the website and there are more of 470 pages with information. The HTML tags are the same, so I don't understand why it stopped scraping.
Python:
def url_lister():
url_list = []
page_count = 1
while page_count < 480:
url = 'https://www.active.com/running?page=%s' %page_count
url_list.append(url)
page_count += 1
return url_list
class ListeCourse_level1(scrapy.Spider):
name = 'ListeCAP_ACTIVE'
allowed_domains = ['www.active.com']
start_urls = url_lister()
def parse(self, response):
selector = Selector(response)
for uneCourse in response.xpath('//*[#id="lpf-tabs2-a"]/article/div/div/div/a[#itemprop="url"]'):
loader = ItemLoader(ActiveItem(), selector=uneCourse)
loader.add_xpath('nom_evenement', './/div[2]/div/h5[#itemprop="name"]/text()')
loader.default_input_processor = MapCompose(string)
loader.default_output_processor = Join()
yield loader.load_item()
pass
The shell:
> 2018-01-23 17:22:29 [scrapy.core.scraper] DEBUG: Scraped from <200
> https://www.active.com/running?page=15>
> {
> 'nom_evenement': 'Enniscrone 10k run & 5k run/walk',
> }
> 2018-01-23 17:22:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.active.com/running?page=16> (referer: None)
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> --------------------------------------------------
> 2018-01-23 17:22:34 [scrapy.extensions.logstats] INFO: Crawled 17 pages (at 17 pages/min), scraped 155 items (at 155 items/min)
> 2018-01-23 17:22:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.active.com/running?page=17> (referer: None)
>
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> -------------------------------------------------- 2018-01-23 17:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.active.com/running?page=18> (referer: None)
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> -------------------------------------------------- 2018-01-23 17:22:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.active.com/running?page=19> (referer: None)
This is probably caused by the fact that there are only 17 pages with the content you are looking for, while you instruct Scrapy to visit all 480 pages of form https://www.active.com/running?page=NNN. A better approach is to check on each page you visit that there is a next page and only in that case yield Request to the next page.
So, I would refactor your code to something like (not tested):
class ListeCourse_level1(scrapy.Spider):
name = 'ListeCAP_ACTIVE'
allowed_domains = ['www.active.com']
base_url = 'https://www.active.com/running'
start_urls = [base_url]
def parse(self, response):
selector = Selector(response)
for uneCourse in response.xpath('//*[#id="lpf-tabs2-a"]/article/div/div/div/a[#itemprop="url"]'):
loader = ItemLoader(ActiveItem(), selector=uneCourse)
loader.add_xpath('nom_evenement', './/div[2]/div/h5[#itemprop="name"]/text()')
loader.default_input_processor = MapCompose(string)
loader.default_output_processor = Join()
yield loader.load_item()
# check for next page link
if response.xpath('//a[contains(#class, "next-page")]'):
next_page = response.meta.get('page_number', 1) + 1
next_page_url = '{}?page={}'.format(base_url, next_page)
yield scrapy.Request(next_page_url, callback=self.parse, meta={'page_number': next_page})

Next pages and scrapy crawler doesn't work

I'm trying to follow the pages on this website where the next page number generation is pretty strange. Instead of normal indexation, next pages look like this:
new/v2.php?cat=69&pnum=2&pnum=3
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4&pnum=5
and as a result my scraper gets into loop and never stops, scraping items from this kind of pages:
DEBUG: Scraped from <200 http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=1&pnum=1&pnum=2&pnum=3>`
and so on.
While the scraped items are correct and match the target(s), crawler never stops, going for pages all over again.
my crawler looks like this:
from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from mymobile.items import MymobileItem
class MmobySpider(CrawlSpider):
name = "mmoby2"
allowed_domains = ["mymobile.ge"]
start_urls = [
"http://mymobile.ge/new/v2.php?cat=69&pnum=1"
]
rules = (Rule(SgmlLinkExtractor(allow=("new/v2.php\?cat=69&pnum=\d*", ))
, callback="parse_items", follow=True),)
def parse_items(self, response):
sel = Selector(response)
titles = sel.xpath('//table[#width="1000"]//td/table[#class="probg"]')
items = []
for t in titles:
url = t.xpath('tr//a/#href').extract()
item = MymobileItem()
item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
item["url"] = urljoin("http://mymobile.ge/new/", url[0])
items.append(item)
return(items)
any suggestion how can I tame it?
As I understand it. All page numbers appear in your start url, http://mymobile.ge/new/v2.php?cat=69&pnum=1, so you could use follow=False and the rule only will be executed once but it will extract all the links in that first pass.
I tried with:
from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
class MmobySpider(CrawlSpider):
name = "mmoby2"
allowed_domains = ["mymobile.ge"]
start_urls = [
"http://mymobile.ge/new/v2.php?cat=69&pnum=1"
]
rules = (
Rule(SgmlLinkExtractor(
allow=("new/v2\.php\?cat=69&pnum=\d*",),
)
, callback="parse_items", follow=False),)
def parse_items(self, response):
sel = Selector(response)
print response.url
Ran it like:
scrapy crawl mmoby2
And the number of request count was six, with following output:
...
2014-05-18 12:20:35+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1> (referer: None)
2014-05-18 12:20:36+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1
2014-05-18 12:20:37+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4
2014-05-18 12:20:38+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2
2014-05-18 12:20:38+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5
2014-05-18 12:20:39+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3
2014-05-18 12:20:39+0200 [mmoby2] INFO: Closing spider (finished)
2014-05-18 12:20:39+0200 [mmoby2] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1962,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
...
If extracting links with Smgllinkextractor fails you can always use simple scrapy spider and extract links for next page with selectors/xpaths, then yield Request for next page with callback to parse and stop process when there is no next page link.
Something like this should work for you.
from scrapy.spider import Spider
from scrapy.http import Request
class MmobySpider(Spider):
name = "mmoby2"
allowed_domains = ["mymobile.ge"]
start_urls = [
"http://mymobile.ge/new/v2.php?cat=69&pnum=1"
]
def parse(self, response):
sel = Selector(response)
titles = sel.xpath('//table[#width="1000"]//td/table[#class="probg"]')
items = []
for t in titles:
url = t.xpath('tr//a/#href').extract()
item = MymobileItem()
item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
item["url"] = urljoin("http://mymobile.ge/new/", url[0])
yield item
# extract next page link
next_page_xpath = "//td[span]/following-sibling::td[1]/a[contains(#href, 'num')]/#href"
next_page = sel.xpath(next_page_xpath).extract()
# if there is next page yield Request for it
if next_page:
next_page = urljoin(response.url, next_page[0])
yield Request(next_page, callback=self.parse)
Xpath for next page is not an easy one due to completely unsemantic markup of your page, but it should work ok.

Categories

Resources