I’m trying to scrape different information from several pages of a website.
Until the sixteenth page, everything works: the pages are crawled, scraped and the information stock in my database, however after the sixteenth page, it stops scraping but continues to crawl.
I checked the website and there are more of 470 pages with information. The HTML tags are the same, so I don't understand why it stopped scraping.
Python:
def url_lister():
url_list = []
page_count = 1
while page_count < 480:
url = 'https://www.active.com/running?page=%s' %page_count
url_list.append(url)
page_count += 1
return url_list
class ListeCourse_level1(scrapy.Spider):
name = 'ListeCAP_ACTIVE'
allowed_domains = ['www.active.com']
start_urls = url_lister()
def parse(self, response):
selector = Selector(response)
for uneCourse in response.xpath('//*[#id="lpf-tabs2-a"]/article/div/div/div/a[#itemprop="url"]'):
loader = ItemLoader(ActiveItem(), selector=uneCourse)
loader.add_xpath('nom_evenement', './/div[2]/div/h5[#itemprop="name"]/text()')
loader.default_input_processor = MapCompose(string)
loader.default_output_processor = Join()
yield loader.load_item()
pass
The shell:
> 2018-01-23 17:22:29 [scrapy.core.scraper] DEBUG: Scraped from <200
> https://www.active.com/running?page=15>
> {
> 'nom_evenement': 'Enniscrone 10k run & 5k run/walk',
> }
> 2018-01-23 17:22:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.active.com/running?page=16> (referer: None)
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> --------------------------------------------------
> 2018-01-23 17:22:34 [scrapy.extensions.logstats] INFO: Crawled 17 pages (at 17 pages/min), scraped 155 items (at 155 items/min)
> 2018-01-23 17:22:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.active.com/running?page=17> (referer: None)
>
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> -------------------------------------------------- 2018-01-23 17:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.active.com/running?page=18> (referer: None)
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> -------------------------------------------------- 2018-01-23 17:22:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.active.com/running?page=19> (referer: None)
This is probably caused by the fact that there are only 17 pages with the content you are looking for, while you instruct Scrapy to visit all 480 pages of form https://www.active.com/running?page=NNN. A better approach is to check on each page you visit that there is a next page and only in that case yield Request to the next page.
So, I would refactor your code to something like (not tested):
class ListeCourse_level1(scrapy.Spider):
name = 'ListeCAP_ACTIVE'
allowed_domains = ['www.active.com']
base_url = 'https://www.active.com/running'
start_urls = [base_url]
def parse(self, response):
selector = Selector(response)
for uneCourse in response.xpath('//*[#id="lpf-tabs2-a"]/article/div/div/div/a[#itemprop="url"]'):
loader = ItemLoader(ActiveItem(), selector=uneCourse)
loader.add_xpath('nom_evenement', './/div[2]/div/h5[#itemprop="name"]/text()')
loader.default_input_processor = MapCompose(string)
loader.default_output_processor = Join()
yield loader.load_item()
# check for next page link
if response.xpath('//a[contains(#class, "next-page")]'):
next_page = response.meta.get('page_number', 1) + 1
next_page_url = '{}?page={}'.format(base_url, next_page)
yield scrapy.Request(next_page_url, callback=self.parse, meta={'page_number': next_page})
Related
I am trying to perform a horizontal crawling -- starting from the first page and move until reaching the last page. The code is as follows:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from scrapy_project.items import metacriticItem
import datetime
# xpaths
main_xpath = '//div[#class = "title_bump"]//td[#class = "clamp-summary-wrap"]'
mt_url_xpath = './a/#href'
class MovieUrlSpider(CrawlSpider):
name = 'movie_url'
allowed_domains = ['web']
start_urls = (
'https://www.metacritic.com/browse/movies/score/metascore/all',
)
# rules for horizontal crawling
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[#rel="next"]'),
callback='parse_item', follow=True),
)
def parse_item(self, response):
# list of items we want
main = response.xpath(main_xpath)
for i in main:
# create the loader using the response
l = ItemLoader(item = metacriticItem(), selector = i)
# key
l.add_xpath('mt_url', mt_url_xpath)
# housekeeping fields
l.add_value('url', response.url)
l.add_value('spider', self.name)
l.add_value('date', datetime.datetime.now().strftime('%d/%m/%Y'))
yield l.load_item()
This script did not parse any item and it return the following message:
2021-05-12 02:22:24 [scrapy.core.engine] INFO: Spider opened
2021-05-12 02:22:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-05-12 02:22:24 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-05-12 02:22:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/robots.txt> (referer: None)
2021-05-12 02:22:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/movies/score/metascore/all> (referer: None)
2021-05-12 02:22:24 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.metacritic.com': <GET https://www.metacritic.com/browse/movies/score/metascore/all?page=1>
2021-05-12 02:22:24 [scrapy.core.engine] INFO: Closing spider (finished)
I am trying to scrape a site with div elements and iteratively, for each div element I want to scrape some data from it and follow the child links it has and scrape more data from them.
Here is the code of quote.py
import scrapy
from ..items import QuotesItem
class QuoteSpider(scrapy.Spider):
name = 'quote'
baseurl='http://quotes.toscrape.com'
start_urls = [baseurl]
def parse(self, response):
all_div_quotes=response.css('.quote')
for quote in all_div_quotes:
item=QuotesItem()
title = quote.css('.text::text').extract()
author = quote.css('.author::text').extract()
tags = quote.css('.tag::text').extract()
author_details_url=self.baseurl+quote.css('.author+ a::attr(href)').extract_first()
item['title']=title
item['author']=author
item['tags']=tags
request = scrapy.Request(author_details_url,
callback=self.author_born,
meta={'item':item,'next_url':author_details_url})
yield request
def author_born(self, response):
item=response.meta['item']
next_url = response.meta['next_url']
author_born = response.css('.author-born-date::text').extract()
item['author_born']=author_born
yield scrapy.Request(next_url, callback=self.author_birthplace,
meta={'item':item})
def author_birthplace(self,response):
item=response.meta['item']
author_birthplace= response.css('.author-born-location::text').extract()
item['author_birthplace']=author_birthplace
yield item
Here is the code of items.py
import scrapy
class QuotesItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
author_born = scrapy.Field()
author_birthplace = scrapy.Field()
I ran the command scrapy crawl quote -o data.json, but there was no error message and data.json was empty. I was expecting to get all the data in its corresponding field.
Can you please help me?
Take a closer look at your logs, you'll be able to find messages like this:
DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/author/Albert-Einstein>
Scrapy is automatically managing duplicates and trying not to visit one URL twice(for obvious reasons).
In you case you can add dont_filter = True to your requests and will see something like this:
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Steve-Martin/> (referer: http://quotes.toscrape.com/author/Steve-Martin/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Albert-Einstein/> (referer: http://quotes.toscrape.com/author/Albert-Einstein/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Marilyn-Monroe/> (referer: http://quotes.toscrape.com/author/Marilyn-Monroe/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/J-K-Rowling/> (referer: http://quotes.toscrape.com/author/J-K-Rowling/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt/> (referer: http://quotes.toscrape.com/author/Eleanor-Roosevelt/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Andre-Gide/> (referer: http://quotes.toscrape.com/author/Andre-Gide/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Thomas-A-Edison/> (referer: http://quotes.toscrape.com/author/Thomas-A-Edison/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Jane-Austen/> (referer: http://quotes.toscrape.com/author/Jane-Austen/)
Which is kinda strange indeed, because of page yields request to itself.
Overall you could end up with something like this:
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quote'
baseurl = 'http://quotes.toscrape.com'
start_urls = [baseurl]
def parse(self, response):
all_div_quotes = response.css('.quote')
for quote in all_div_quotes:
item = dict()
title = quote.css('.text::text').extract()
author = quote.css('.author::text').extract()
tags = quote.css('.tag::text').extract()
author_details_url = self.baseurl + quote.css('.author+ a::attr(href)').extract_first()
item['title'] = title
item['author'] = author
item['tags'] = tags
print(item)
# Don't filter = True in case of we get two quotes of a single author.
# This is not optimal though. Better decision will be to save author data to self.storage
# And only visit new author info pages if needed, else take info from saved dict.
request = scrapy.Request(author_details_url,
callback=self.author_info,
meta={'item': item},
dont_filter=True)
yield request
def author_info(self, response):
item = response.meta['item']
author_born = response.css('.author-born-date::text').extract()
author_birthplace = response.css('.author-born-location::text').extract()
item['author_born'] = author_born
item['author_birthplace'] = author_birthplace
yield item
I have made a scraper to crawl all categories related to "au-quotidien", on e-commerce website Cdiscount.
The bot is supposed to start on the top menu, then accessing a second layer deep, then a third, and scrape items. Here is my code, as a test :
class CdiscountSpider(scrapy.Spider):
name = "cdis_bot" # how we have to call the bot
start_urls = ["https://www.cdiscount.com/au-quotidien/v-127-0.html"]
def parse(self, response):
for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
regex_top_category = r"\b(?=\w)" + re.escape("au-quotidien") + r"\b(?!\w)"
if re.search(regex_top_category, link):
yield response.follow(link, callback = self.parse_on_categories) #going to one layer deep from landing page
def parse_on_categories(self, response):
for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
yield response.follow(link, callback = self.parse_on_subcategories) #going to two layer deep from landing page
def parse_on_subcategories(self, response):
for link in response.css('div.mvNavSub ul li a::attr(href)').extract():
yield response.follow(link, callback = self.parse_data) #going to three layer deep from landing page
def parse_data(self, response):
links_list = response.css("div.prdtBILDetails a::attr(href)").extract()
regex_ean = re.compile(r'(\d+)\.html')
eans_list = [regex_ean.search(link).group(1) for link in links_list if regex_ean.search(link)]
desc_list = response.css("div.prdtBILTit::text").extract()
price_euros = response.css("span.price::text").extract()
price_cents = response.css("span.price sup::text").extract()
for euro, cent, ean, desc in zip(price_euros, price_cents, eans_list, desc_list):
if len(ean) > 6:
yield{'ean':ean,'price':euro+cent,'desc':desc,'company':"cdiscount",'url':response.url}
My problem is that, only links are retrieved.
For instance :
2018-12-18 14:40:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/legumes-secs/l-127015303.html> (referer: https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/l-1270153.html)
2018-12-18 14:40:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/semoules/l-127015302.html> (referer: https://www.cdiscount.com/au-quotidien/alimentaire/pates-riz-/l-1270153.html)
But I get only a very few scraped items, always on the same category, like this :
{'ean': '2009818241269', 'price': '96€00', 'desc': 'Heidsieck & Co Monopole 75cl x6', 'company': 'cdiscount', 'url': 'https://www.cdiscount.com/vin-champagne/vin-champagne/champagne-brut/l-1293402.html'}
2018-12-18 14:40:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.cdiscount.com/vin-champagne/vin-champagne/champagne-brut/l-1293402.html>
While it seems to me that other categories share the same items selector.
If you could help me to figure out where I am wrong I would be grateful :) thank you
It looks like the responses your parse_data() method is receiving are all vastly different.
For example, these are the first three urls it parses on a sample run:
https://www.cdiscount.com/vin-champagne/vin-champagne/champagne-millesime/l-1293404.html
https://www.cdiscount.com/vin-champagne/coffrets-cadeaux/v-12960-12960.html
https://www.cdiscount.com/au-quotidien/alimentaire/bio/boisson-bio/jus-de-tomates-bio/l-12701271315.html
It's obvious (even from a quick glance) that the structure of each of these pages is different.
In most cases, your eans_list and desc_list are empty, so the zip() call produces no results.
import scrapy
class Pttscrapper2Spider(scrapy.Spider):
name = 'PTTscrapper2'
allowed_domains = ['https://www.ptt.cc']
start_urls = ['https://www.ptt.cc/bbs/HatePolitics/index.html/']
handle_httpstatus_list = [400, 302]
def parse(self, response):
urls = response.css('div.r-ent > div.title > a::attr(href)').extract()
for thread_url in urls:
url = response.urljoin(thread_url)
yield scrapy.Request(url=url, callback=self.parse_details)
next_page_url = response.css('a.wide:nth-child(2)::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
def parse_details(self, response):
yield {
'title' : response.xpath('//head/title/text()').extract(),
'stance' : response.xpath('//*[#id="main-content"]/div[#class="push"]/span[1]/text()').extract(),
'userid' : response.xpath('//*[#id="main-content"]/div[#class="push"]/span[2]/text()').extract(),
'comment' : response.xpath('//*[#id="main-content"]/div[#class="push"]/span[3]/text()').extract(),
'time_of_post' : response.xpath('//*[#id="main-content"]/div[#class="push"]/span[4]/text()').extract(),
}
I've been using the above spider to try and crawl a website, but I when I run the spider, I get these messages:
> 2017-10-05 23:14:27 [scrapy.core.engine] INFO: Spider opened
> 2017-10-05 23:14:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages
> (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-10-05 23:14:27
> [scrapy.extensions.telnet] DEBUG: Telnet console listening on
> 127.0.0.1:6023 2017-10-05 23:14:28 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from:
> <302 https://www.ptt.cc/bbs/HatePolitics/index.html/> Set-Cookie:
> __cfduid=d3ca57dcab04acfaf256438a57c547e4a1507216462; expires=Fri, 05-Oct-18 15:14:22 GMT; path=/; domain=.ptt.cc; HttpOnly
>
> 2017-10-05 23:14:28 [scrapy.core.engine] DEBUG: Crawled (302) <GET
> https://www.ptt.cc/bbs/HatePolitics/index.html/> (referer: None)
> 2017-10-05 23:14:28 [scrapy.core.engine] INFO: Closing spider
> (finished)
What I've been thinking is that my spider can't seem to access the sub forums in the index. I've tested that the selectors point to the correct locations and request.urljoin creates the correct absolute url but can't seem to access the sub forums in a page. It would be great if someone can tell me why the spider is unable to access the links!
Two issues with your scraper. In the start_urls you added a trailing slash to the index.html/, which is wrong. Next allowed_domains will take domain names and not urls.
Change starting code to below and it would work
class Pttscrapper2Spider(scrapy.Spider):
name = 'PTTscrapper2'
allowed_domains = ['www.ptt.cc']
start_urls = ['https://www.ptt.cc/bbs/HatePolitics/index.html']
Logs from the run
2017-10-06 13:16:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ptt.cc/bbs/HatePolitics/M.1507268600.A.57C.html> (referer: https://www.ptt.cc/bbs/HatePolitics/index.html)
2017-10-06 13:16:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ptt.cc/bbs/HatePolitics/M.1507268600.A.57C.html>
{'title': ['[黑特] 先刪文,洪慈庸和高潞那個到底撤案了沒? - 看板 HatePolitics - 批踢踢實業坊'], 'stance': ['推 ', '→ ', '噓 ', '→ ', '→ '], 'userid': ['ABA0525', 'gerund', 'AGODFATHER', 'laman45', 'victoryman'], 'comment': [': 垃圾不分藍綠黃', ': 垃圾靠弟傭 中華民國內最沒資格當立委的爛貨', ': 說什麼東西你個板啊', ': 有確定再說', ': 看起來應該是撤了'], 'time_of_post': ['10/06 13:43\n', '10/06 13:50\n', '10/06 13:57\n', '10/06 13:59\n', ' 10/06 15:27\n']}
2017-10-06 13:16:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ptt.cc/bbs/HatePolitics/M.1507275599.A.657.html> (referer: https://www.ptt.cc/bbs/HatePolitics/index.html)
For each of several Disqus users, whose profile urls are known in advance, I want to scrape their names and usernames of their followers. I'm using scrapy and splash do to so. However, when I'm parsing the responses, it seems that it is always scraping the page of the first user. I tried setting wait to 10 and dont_filter to True, but it isn't working. What should I do now?
Here is my spider:
import scrapy
from disqus.items import DisqusItem
class DisqusSpider(scrapy.Spider):
name = "disqusSpider"
start_urls = ["https://disqus.com/by/disqus_sAggacVY39/", "https://disqus.com/by/VladimirUlayanov/", "https://disqus.com/by/Beasleyhillman/", "https://disqus.com/by/Slick312/"]
splash_def = {"endpoint" : "render.html", "args" : {"wait" : 10}}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url = url, callback = self.parse_basic, dont_filter = True, meta = {
"splash" : self.splash_def,
"base_profile_url" : url
})
def parse_basic(self, response):
name = response.css("h1.cover-profile-name.text-largest.truncate-line::text").extract_first()
disqusItem = DisqusItem(name = name)
request = scrapy.Request(url = response.meta["base_profile_url"] + "followers/", callback = self.parse_followers, dont_filter = True, meta = {
"item" : disqusItem,
"base_profile_url" : response.meta["base_profile_url"],
"splash": self.splash_def
})
print "parse_basic", response.url, request.url
yield request
def parse_followers(self, response):
print "parse_followers", response.meta["base_profile_url"], response.meta["item"]
followers = response.css("div.user-info a::attr(href)").extract()
DisqusItem is defined as follows:
class DisqusItem(scrapy.Item):
name = scrapy.Field()
followers = scrapy.Field()
Here are the results:
2017-08-07 23:09:12 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/disqus_sAggacVY39/ {'name': u'Trailer Trash'}
2017-08-07 23:09:14 [scrapy.extensions.logstats] INFO: Crawled 5 pages (at 5 pages/min), scraped 0 items (at 0 items/min)
2017-08-07 23:09:18 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/VladimirUlayanov/ {'name': u'Trailer Trash'}
2017-08-07 23:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/Beasleyhillman/ {'name': u'Trailer Trash'}
2017-08-07 23:09:40 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/Slick312/ {'name': u'Trailer Trash'}
Here is the file settings.py:
# -*- coding: utf-8 -*-
# Scrapy settings for disqus project
#
BOT_NAME = 'disqus'
SPIDER_MODULES = ['disqus.spiders']
NEWSPIDER_MODULE = 'disqus.spiders'
ROBOTSTXT_OBEY = False
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
DUPEFILTER_DEBUG = True
DOWNLOAD_DELAY = 10
I was able to get it to work using SplashRequest instead of scrapy.Request.
ex:
import scrapy
from disqus.items import DisqusItem
from scrapy_splash import SplashRequest
class DisqusSpider(scrapy.Spider):
name = "disqusSpider"
start_urls = ["https://disqus.com/by/disqus_sAggacVY39/", "https://disqus.com/by/VladimirUlayanov/", "https://disqus.com/by/Beasleyhillman/", "https://disqus.com/by/Slick312/"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse_basic, dont_filter = True, endpoint='render.json',
args={
'wait': 2,
'html': 1
})