Running Scrapy for multiple times on same URL - python

I'd like to crawl a certain url which returns a random response each time it's called. Code below returns what I want but I'd like to run it for long time so that I can use the data for an NLP application. This code only runs for once with scrapy crawl the though I expect it to run more due to last if statement.
Is Unix's start command is what I'm looking for? I tried it but it felt a bit slow. If I had to use start command would opening many tabs in terminal and running same command with start prefix be a good practice or it just throttles the speed?
class TheSpider(scrapy.Spider):
name = 'the'
allowed_domains = ['https://websiteiwannacrawl.com']
start_urls = ['https://websiteiwannacrawl.com']
def parse(self, response):
info = {}
info['text'] = response.css('.pd-text').extract()
yield info
next_page = 'https://websiteiwannacrawl.com'
if next_page is not None:
yield scrapy.Request(next_page, callback=self.parse)

dont_filter
indicates that this request should not be filtered by the scheduler.
This is used when you want to perform an identical request multiple
times, to ignore the duplicates filter. Use it with care, or you will
get into crawling loops. Default to False
You should add this in your Request
yield scrapy.Request(next_page, dont_filter=True)
it's not about your question but for callback=self.parse please read Parse Method

Related

How does Scrapy proceed with the urls given in the urls variable under start_requests?

Just wondering why when I have url = ['site1', 'site2'] and I run scrapy from script using .crawl() twice, in a row like
def run_spiders():
process.crawl(Spider)
process.crawl(Spider)
the output is:
site1info
site1info
site2info
site2info
as opposed to
site1info
site2info
site1info
site2info
Because as soon as you call process.start(), requests are handled asynchronously. The order is not guaranteed.
In fact, even if you only call process.crawl() once, you may sometimes get:
site2info
site1info
To run spiders sequentially from Python, see this other answer.
start_request uses the yield functionality. yield queues the requests. To understand it fully read this StackOverflow answer.
Here is the code example of how it works with start_urls in the start_request method.
start_urls = [
"url1.com",
"url2.com",
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse)
For custom request ordering this priority feature can be used.
def start_requests(self):
yield scrapy.Request(self.start_urls[0], callback=self.parse)
yield scrapy.Request(self.start_urls[1], callback=self.parse, priority=1)
the one with the higher number of priority will be yielded first from the queue. By default, priority is 0.

What changes need to be done to get HTTP Status code of domain using Scrapy?

I have this code available from my previous experiment.
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://example.com/']
def parse(self, response):
for title in response.css('h2'):
yield {'Agent-name': title.css('a ::text').extract_first()}
next_page = response.css('li.col-md-3 ln-t > div.cs-team team-grid > figure > a ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
I am not understanding how to modify this code to take input as a list of URL from a text file (May be 200+ domains) and check the HTTP status of the domains and store it in a file. I am trying this to check whether the domains are live or not.
What I am expecting to have output is:
example.com,200
example1.com,300
example2.com,503
I want to give file as an input to scrapy script and it should give me the above output. I have tried to look at the questions: How to detect HTTP response status code and set a proxy accordingly in scrapy? and Scrapy and response status code: how to check against it?
But find no luck. Hence, I am thinking to modify my code and get it done. How I can do that? Please help me.
For each response object you could be able to get the url and status code thx to response object properties. So for each link you send request to, you can get the status code using response.status.
Does it work as you want like that ?
def parse(self, response):
#file choosen to get output (appending mode):
file.write(u"%s : %s\n" % (response.url, response.status))
#if response.status in [400, ...]: do smthg
for title in response.css('h2'):
yield {'Agent-name': title.css('a ::text').extract_first()}
next_page = response.css('li.col-md-3 ln-t > div.cs-team team-grid > figure > a ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

Scrapy Start_request parse

I am writing a scrapy script to search and scrape result from a website. I need to search items from website and parse each url from the search results. I started with Scrapy's start_requests where i'd pass the search query and redirect to another function parse which will retrieve the urls from the search result. Finally i called another function parse_item to parse the results. I'm able to extract the all the search results url but i'm not being able to parse the results ( parse_item is not working). Here is the code:
# -*- coding: utf-8 -*-
from scrapy.http.request import Request
from scrapy.spider import BaseSpider
class xyzspider(BaseSpider):
name = 'dspider'
allowed_domains = ["www.example.com"]
mylist = ['Search item 1','Search item 2']
url = 'https://example.com/search?q='
def start_requests(self):
for i in self.mylist:
i = i.replace(' ','+')
starturl = self.url+ i
yield Request(starturl,self.parse)
def parse(self,response):
itemurl = response.xpath(".//section[contains(#class, 'search-results')]/a/#href").extract()
for j in itemurl:
print j
yield Request(j,self.parse_item)
def parse_item(self,response):
print "hello"
'''rating = response.xpath(".//ul(#class = 'ratings')/li[1]/span[1]/text()").extract()
print rating'''
Could anyone please help me. Thank you.
I was getting a Filtered offsite request error. I changed the allowed domain from allowed_domains = www.xyz.com to
xyz.com and it worked perfectly.
Your code looks good. So you might need to use the Request attribute dont_filter set to True:
yield Request(j,self.parse_item, dont_filter=True)
From the docs:
dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
Anyway I recommend you to have a look at the item Pipelines.
Those are used to process scraped items using the command:
yield my_object
Item pipelines are used to post-process everything yielded by the spider.

Multiple pages per item in Scrapy

Disclaimer: I'm fairly new to Scrapy.
To put my question plainly: How can I retrieve an Item property from a link on a page and get the results back into the same Item?
Given the following sample Spider:
class SiteSpider(Spider):
site_loader = SiteLoader
...
def parse(self, response):
item = Place()
sel = Selector(response)
bl = self.site_loader(item=item, selector=sel)
bl.add_value('domain', self.parent_domain)
bl.add_value('origin', response.url)
for place_property in item.fields:
parse_xpath = self.template.get(place_property)
# parse_xpath will look like either:
# '//path/to/property/text()'
# or
# {'url': '//a[#id="Location"]/#href',
# 'xpath': '//div[#class="directions"]/span[#class="address"]/text()'}
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
parse_xpath = response.meta['parse_xpath']
place_property = response.meta['place_property']
sel = Selector(response)
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
return loader
I'm running these spiders against multiple sites, and most of them have the data I need on one page and it works just fine. However, some sites have certain properties on a sub-page (ex., the "address" data existing at the "Get Directions" link).
The "yield Request" line is really where I have the problem. I see the items move through the pipeline, but they're missing those properties that are found at other URLs (IOW, those properties that get to "yield Request"). The get_url_property callback is basically just looking for an xpath within the new response variable, and adding that to the item loader instance.
Is there a way to do what I'm looking for, or is there a better way? I would like to avoid making a synchronous call to get the data I need (if that's even possible here), but if that's the best way, then maybe that's the right approach. Thanks.
If I understand you correctly, you have (at least) two different cases:
The crawled page links to another page containing the data (1+ further request necessary)
The crawled page contains the data (No further request necessary)
In your current code, you call yield bl.load_item() for both cases, but in the parse callback. Note that the request you yield is executed some later point in time, thus the item is incomplete and that's why you're missing the place_property key from the item for the first case.
Possible Solution
A possible solution (If I understood you correctly) Is to exploit the asynchronous behavior of Scrapy. Only minor changes to your code are involved.
For the first case, you pass the item loader to another request, which will then yield it. This is what you do in the isinstance if clause. You'll need to change the return value of the get_url_property callback to actually yield the loaded item.
For the second case, you can return the item directly,
thus simply yield the item in the else clause.
The following code contains the changes to your example.
Does this solve your problem?
def parse(self, response):
# ...
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
# ...
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
yield loader.load_item()
Related to that problem is the question of chaining requests, for which I have noted a similar solution.

Crawling multiple starting urls with different depth

I'm trying to get Scrapy 0.12 to change it's "maximum depth" setting for different url in the start_urls variable in the spider.
If I understand correctly the documentation there's no way because the DEPTH_LIMIT setting is global for the entire framework and there's no notion of "requests originated from the initial one".
Is there a way to circumvent this? Is it possible to have multiple instances of the same spider initialized with each starting url and different depth limits?
Sorry, looks like i didn't understand you question correctly from the beginning. Correcting my answer:
Responses have depth key in meta. You can check it and take appropriate action.
class MySpider(BaseSpider):
def make_requests_from_url(self, url):
return Request(url, dont_filter=True, meta={'start_url': url})
def parse(self, response):
if response.meta['start_url'] == '???' and response.meta['depth'] > 10:
# do something here for exceeding limit for this start url
else:
# find links and yield requests for them with passing the start url
yield Request(other_url, meta={'start_url': response.meta['start_url']})
http://doc.scrapy.org/en/0.12/topics/spiders.html#scrapy.spider.BaseSpider.make_requests_from_url

Categories

Resources