How many items has been scraped per start_url

How many items has been scraped per start_url - python

I use scrapy to crawl 1000 urls and store scraped item in a mongodb. I'd to know how many items have been found for each url. From scrapy stats I can see 'item_scraped_count': 3500
However, I need this count for each start_url separately. There is also referer field for each item that I might use to count each url items manually:
2016-05-24 15:15:10 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=6w-_ucPV674> (referer: https://www.youtube.com/results?q=billys&sp=EgQIAhAB)
But I wonder if there is a built-in support from scrapy.

challenge accepted!
there isn't something on scrapy that directly supports this, but you could separate it from your spider code with a Spider Middleware:
middlewares.py
from scrapy.http.request import Request
class StartRequestsCountMiddleware(object):
start_urls = {}
def process_start_requests(self, start_requests, spider):
for i, request in enumerate(start_requests):
self.start_urls[i] = request.url
request.meta.update(start_request_index=i)
yield request
def process_spider_output(self, response, result, spider):
for output in result:
if isinstance(output, Request):
output.meta.update(
start_request_index=response.meta['start_request_index'],
)
else:
spider.crawler.stats.inc_value(
'start_requests/item_scraped_count/{}'.format(
self.start_urls[response.meta['start_request_index']],
),
)
yield output
Remember to activate it on settings.py:
SPIDER_MIDDLEWARES = {
...
'myproject.middlewares.StartRequestsCountMiddleware': 200,
}
Now you should be able to see something like this on your spider stats:
'start_requests/item_scraped_count/START_URL1': ITEMCOUNT1,
'start_requests/item_scraped_count/START_URL2': ITEMCOUNT2,

Related

Scrapy not scraping all pages

I'm new to scrapy and have been trying to develop a spider that scrapes Tripadvisor's things to do page. Trip advisor paginates results with offset so I made it find the last page num, multiply the number of results per page, and loop over a range with a step of 30. However it returns only a fraction of the results its supposed to, and get_details prints out 7 of the 28 pages scraped. I believe what is happening is url redirection on random pages.
Scrapy logs this 301 redirection on the other pages, and it appears to be redirecting to the first page. I tried disabling redirection but that did not work.
2021-03-28 18:46:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-Nashville_Davidson_County_Tennessee.html> from <GET https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-oa90-Nashville_Davidson_County_Tennessee.html>
Here's my code for the spider:
import scrapy
import re
class TripadvisorSpider(scrapy.Spider):
name = "tripadvisor"
start_urls = [
'https://www.tripadvisor.com/Attractions-g55229-Activities-a_allAttractions.true-oa{}-Nashville_Davidson_County_Tennessee.html'
]
def parse(self, response):
num_pages = int(response.css(
'._37Nr884k .DrjyGw-P.IT-ONkaj::text')[-1].get())
for offset in range(0, num_pages * 30, 30):
formatted_url = self.start_urls[0].format(offset)
yield scrapy.Request(formatted_url, callback=self.get_details)
def get_details(self, response):
print('url is ' + response.url)
for listing in response.css('div._19L437XW._1qhi5DVB.CO7bjfl5'):
yield {
'title': listing.css('._392swiRT ._1gpq3zsA._1zP41Z7X::text')[1].get(),
'category': listing.css('._392swiRT ._1fV2VpKV .DrjyGw-P._26S7gyB4._3SccQt-T::text').get(),
'rating': float(re.findall(r"[-+]?\d*\.\d+|\d+", listing.css('svg.zWXXYhVR::attr(title)').get())[0]),
'rating_count': float(listing.css('._392swiRT .DrjyGw-P._26S7gyB4._14_buatE._1dimhEoy::text').get().replace(',', '')),
'url': listing.css('._3W_31Rvp._1nUIPWja._17LAEUXp._2b3s5IMB a::attr(href)').get(),
'main_image': listing.css('._1BR0J4XD').attrib['src']
}
Is there a way to get scrapy working for each page? What is causing this problem exactly?

Found a solution. Discovered I needed to handle the redirection manually and disable Scrapy's default middleware.
Here is the custom middleware I added to middlewares.py
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.selector import Selector
from scrapy.utils.response import get_meta_refresh
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
url = response.url
if response.status in [301, 307]:
reason = 'redirect %d' % response.status
return self._retry(request, reason, spider) or response
interval, redirect_url = get_meta_refresh(response)
# handle meta redirect
if redirect_url:
reason = 'meta'
return self._retry(request, reason, spider) or response
hxs = Selector(response)
# test for captcha page
captcha = hxs.xpath(
".//input[contains(#id, 'captchacharacters')]").extract()
if captcha:
reason = 'capcha'
return self._retry(request, reason, spider) or response
return response
It is an updated version from this question's top answer.
Scrapy retry or redirect middleware

How can I implement custom proxy on Scrapy?

I'm trying to implement custom scraper API but as a begging I think I'm doing wrong. But I follow their documentation to setup everything. Here is a documentation link
from scrapy import Spider
from scrapy.http import Request
from .config import API
from scraper_api import ScraperAPIClient
client = ScraperAPIClient(API)
class GlassSpider(Spider):
name = 'glass'
allowed_domains = ['glassdoor.co.uk']
start_urls = [client.scrapyGet(url='https://www.glassdoor.co.uk/Job/russian-jobs-SRCH_KE0,7.htm?fromAge=1')]
def parse(self, response):
jobs = response.xpath('//*[contains(#class, "react-job-listing")]')
for job in jobs:
job_url = job.xpath('.//*[contains(#class, "jobInfoItem jobTitle")]/#href').extract_first()
absulate_job_url = response.urljoin(job_url)
yield Request(client.scrapyGet(url=absulate_job_url),
callback=self.parse_jobpage,
meta={
"Job URL": absulate_job_url
})
def parse_jobpage(self, response):
absulate_job_url = response.meta.get('Job URL')
job_description = "".join(line for line in response.xpath('//*[contains(#class, "desc")]//text()').extract())
yield {
"Job URL": absulate_job_url,
"Job Description": job_description
}
That's the output I'm receiving.... Please what's wrong with my code. Please fix it for me. So I can follow and get the point. Thank you.
2020-10-01 23:01:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://api.scraperapi.com/?url=https%3A%2F%2Fwww.glassdoor.co.uk%2FJob%2F
russian-jobs-SRCH_KE0%2C7.htm%3FfromAge%3D1&api_key=bec9dd9f2be095dfc6158a7e609&scraper_sdk=python>
(referer: None)
2020-10-01 23:01:45 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET
https://api.scraperapi.c
om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6
7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61
5c8a7e639scraper_sdk=python>

I'm not familiar with this particular lib, but from your execution logs the issue is that your request is beign filtered, since it's consider offsite.
[scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET https://api.scraperapi.c om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6 7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61 5c8a7e639scraper_sdk=python>
Since scraperapi will make your request go through their domain and that's outside of what you defined in your allowed_domains it's filtered as an offsite request. To avoid this issue you can remove this line entirely:
allowed_domains = ['glassdoor.co.uk']
or try include 'api.scraperapi.com' in it.

Issue with scrapy spider

I am trying to get volume-weighted average prices for stocks from the moneycontrol.com website. The parse function is running without any issues but the parse_links function is not getting called. Am i missing something here?
# -*- coding: utf-8 -*-
import scrapy
class MoneycontrolSpider(scrapy.Spider):
name = "moneycontrol"
allowed_domains = ["https://www.moneycontrol.com"]
start_urls = ["https://www.moneycontrol.com/india/stockpricequote"]
def parse(self,response):
for link in response.css('td.last > a::attr(href)').extract():
if(link):
yield scrapy.Request(link, callback=self.parse_links,method='GET')
def parse_links(self, response):
VWAP= response.xpath('//*[#id="n_vwap_val"]/text()').extract_first()
print(VWAP)
with open('quotes.txt','a+') as f:
f.write('VWAP: {}'.format(VWAP) + '\n')

If you read the log output, the error becomes obvious.
2018-09-08 19:52:38 [py.warnings] WARNING: c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py:59: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://www.moneycontrol.com in allowed_domains.
warnings.warn("allowed_domains accepts only domains, not URLs. Ignoring URL entry %s in allowed_domains." % domain, URLWarning)
2018-09-08 19:52:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-08 19:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.moneycontrol.com/india/stockpricequote> (referer: None)
2018-09-08 19:52:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.moneycontrol.com': <GET http://www.moneycontrol.com/india/stockpricequote/chemicals/aartiindustries/AI45>
So just fix your allowed_domains, and you should be fine:
allowed_domains = ["moneycontrol.com"]

scrapy-splash active content selector works in shell but not with spider

I just started using scrapy-splash to retrieve the number of bookings from opentable.com. The following works fine in the shell:
$ scrapy shell 'http://localhost:8050/render.html?url=https://www.opentable.com/new-york-restaurant-listings&timeout=10&wait=0.5'
...
In [1]: response.css('div.booking::text').extract()
Out[1]:
['Booked 59 times today',
'Booked 20 times today',
'Booked 17 times today',
'Booked 29 times today',
'Booked 29 times today',
...
]
However, this simple spider returns an empty list:
class TableSpider(scrapy.Spider):
name = 'opentable'
start_urls = ['https://www.opentable.com/new-york-restaurant-listings']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='render.html',
args={'wait': 1.5},
)
def parse(self, response):
yield {'bookings': response.css('div.booking::text').extract()}
when invoked with:
$ scrapy crawl opentable
...
DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
{'bookings': []}
I've already unsuccessfully tried
docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode
and increased wait times.

I think your problem is in middlewares, first of all you need to add some settings
# settings.py
# uncomment `DOWNLOADER_MIDDLEWARES` and add this settings to it
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# url of splash server
SPLASH_URL = 'http://localhost:8050'
# and some splash variables
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
And now run docker
sudo docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode
If i do all these steps a get back:
scrapy crawl opentable
...
2018-06-23 11:23:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.opentable.com/new-york-restaurant-listings via http://localhost:8050/render.html> (referer: None)
2018-06-23 11:23:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
{'bookings': [
'Booked 44 times today',
'Booked 24 times today',
'and many others Booked values'
]}

This is not working because this content of the web is using JS.
You can adopt serveral solutions:
1) Use selenium.
2) If you see the API of the page, if you call this url <GET https://www.opentable.com/injector/stats/v1/restaurants/<restaurant_id>/reservations> you will have the number of current reservaitions of this specific restaurant (restaurant_id).

python method is not called

I have following class method in a scrapy spider. parse_category yields a Request object that has callback to parse_product. Sometimes a category page redirects to a product page. So here I detect if a category page is a product page. If it is, I just call the parse_product method. But for some reason it does not call the method.
def parse(self, response):
hxs = HtmlXPathSelector(response)
anchors = hxs.select('//div[#id="panelMfr"]/div/ul/li[position() != last()]/a')
for anchor in anchors[2:3]:
url = anchor.select('#href').extract().pop()
cat = anchor.select('text()').extract().pop().strip()
yield Request(urljoin(get_base_url(response), url), callback=self.parse_category, meta={"category": cat})
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
base_url = get_base_url(response)
# check if its a redirected product page
if (hxs.select(self.product_name_xpath)):
self.log("Category-To-Product Redirection")
self.parse_product(response) # <<---- This line is not called.
self.log("Product Parsed")
return
products_xpath = '//div[#class="productName"]/a/#href'
products = hxs.select(products_xpath).extract()
for url in products:
yield Request(urljoin(base_url, url), callback=self.parse_product, meta={"category": response.meta['category']})
next_page = hxs.select('//table[#class="nav-back"]/tr/td/span/a[contains(text(), "Next")]/text()').extract()
if next_page:
url = next_page[0]
yield Request(urljoin(base_url, url), callback=self.parse_category, meta={"category": response.meta['category']})
def parse_product(self, response):
hxs = HtmlXPathSelector(response)
base_url = get_base_url(response)
self.log("Inside parse_product")
In the log I see Category-To-Product Redirection and Product Parsed is printed but Inside parse_product is missing. Whats did I do wrong here?
2013-07-12 21:31:34+0100 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/category.aspx> (referer: None)
2013-07-12 21:31:34+0100 [example.com] DEBUG: Redirecting (302) to <GET http://www.example.com/productinfo.aspx?catref=AM6901> from <GET http://www.example.com/products/Inks-Toners/Apple>
2013-07-12 21:31:35+0100 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/productinfo.aspx?catref=AM6901> (referer: http://www.example.com/category.aspx)
2013-07-12 21:31:35+0100 [example.com] DEBUG: Category-To-Product Redirection
2013-07-12 21:31:35+0100 [example.com] DEBUG: Product Parsed
2013-07-12 21:31:35+0100 [example.com] INFO: Closing spider (finished)
2013-07-12 21:31:35+0100 [-] ERROR: ERROR:root:SPIDER CLOSED: No. of products: 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How many items has been scraped per start_url - python

Related

Scrapy not scraping all pages

How can I implement custom proxy on Scrapy?

Issue with scrapy spider

scrapy-splash active content selector works in shell but not with spider

python method is not called

Categories

Resources