How to fetch the Response object of a Request synchronously on Scrapy? - python

I believe using "callback" method is asynchronous, please correct me if I'm wrong. I'm still new with Python so please bear with me.
Anyway, I'm trying to make a method to check if a file exists and here is my code:
def file_exists(self, url):
res = False;
response = Request(url, method='HEAD', dont_filter=True)
if response.status == 200:
res = True
return res
I thought the Request() method will return a Response object but it still returns a Request object, to capture the Response, I have to create a different method for the callback.
Is there a way to get the Response object within the code block where you call the Response() method?

If anyone is still interested in a possible solution – I managed it by doing a request with "requests" sort of "inside" a scrapy function like this:
import requests
request_object = requests.get(the_url_you_like_to_get)
response_object = scrapy.Selector(request_object )
item['attribute'] = response_object .xpath('//path/you/like/to/get/text()').extract_first()
and then proceed.

Request objects don't generate anything.
Scrapy uses asynchronous Downloader engine which takes these Request objects and generate Response objects.
if any method in your spider returns a Request object it is automatically scheduled in the downloader and returns a Response object to specified callback(i.e. Request(url, callback=self.my_callback)).
Check out more at scrapy's architecture overview
Now depends when and where you are doing it you can schedule requests by telling the downloader to schedule some requests:
self.crawler.engine.schedule(Request(url, callback=self.my_callback), spider)
If you run this from a spider spider here can most likely be self here and self.crawler is inherited from scrapy.Spider.
Alternatively you can always block asynchronous stack by using something like requests like:
def parse(self, response):
image_url = response.xpath('//img/#href').extract_first()
if image_url:
image_head = requests.head(image_url)
if 'image' in image_head.headers['Content-Type']:
item['image'] = image_url
It will slow your spider down but it's significantly easier to implement and manage.

Scrapy uses Request and Response objects for crawling web sites.
Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
Unless you are manually using a Downloader, it seems like the way you're using the framework is incorrect. I'd read a bit more about how you can create proper spiders here.
As for file exists, your spider can store relevant information in a database or other data structure when parsing the scraped data in its parse*() method, and you can later query it in your own code.

Related

How to extend scrapy feed export to send crawl results in POST requests - open and store not called?

I have written a spider and it works fine. I now wish to send the results of that scrape to a rest API via POST request.
I'm under the impression that extending the feed export functionality to have it make POST requests instead of writing to a file/sending it to S3/etc.
I am sure that my settings.py and configuration is correct, but for some reason the open and close functions are never called, even when I literally copy and paste and rename preexisting feed export classes from feedexport.py.
Please see code below:
settings.py
EXTENSIONS = {'project.extensions.ExportToApi.ExportScrape':400}
FEED_STORAGES = {'http': 'project.extensions.ExportToApi.ExportScrape'}
FEED_API_URL = 'http://localhost:8000'
(there's a simpleserver at localhost:8000 there printing out all the GET and POST requests it receives)
/project/extensions/ExportToApi.py
from scrapy.extensions.feedexport import BlockingFeedStorage
class ExportScrape(BlockingFeedStorage):
def __init__(self, crawler, uri):
self.crawler = crawler
self.url = uri
#classmethod
def from_crawler(cls, crawler):
uri_from_settings = crawler.settings['FEED_API_URL']
return cls(crawler, uri_from_settings)
def _store_in_thread(self, file):
file.seek(0)
import requests
r = requests.post(self.url, data=file)
file.close()
When added, print statements are run from __init__ and from_crawler, but not _store_in_thread, nor open or store if they are implemented.
POST requests to rest api - is directly the same type of http requests as regular requests scrapy send.
One of options is to.. inside parse method return Request object with empty callback to rest API endpoint instead of item object of dict (and withiout any custom feed storages)
def parse(self, response)
...
#yield item
...
yield Request(
method='post',
url=server_url,
body=converted_item_data,
callback=self.not_parse)
def not_parse(self, response):
pass

How to crawl dynamic web with api url returning null?

I have a task to crawl all Pulitzer Winner, and I found this page has all I want: https://www.pulitzer.org/prize-winners-by-year/2018.
But I got the following problems,
Problem 1: How to crawl a dynamic page? I use python/urllib2.urlopen, to get the page's content, but this dynamic page doesn't return the real content from this.
Problem 2: I then found an API URL from devtool: https://www.pulitzer.org/cache/api/1/winners/year/166/raw.json. But when I sent a GET request from urllib2.urlopen, I always get null. How does it happen? Or how can I handle with it?
If this is too naive for you, please name some words so that I can learn it from Google.
Thanks in advance!
One way to handle is to create a session using requests module. This way, it passes necessary session details required for next api call, you also have to pass one more parameter Referer to the header. This differentiates which year you are looking for in the api call.
import requests
s = requests.session()
url = "https://www.pulitzer.org/prize-winners-by-year/2017"
resp1 = s.get(url)
headers = {'Referer': 'https://www.pulitzer.org/prize-winners-by-year/2017'}
api = "https://www.pulitzer.org/cache/api/1/winners/year/166/raw.json"
data = s.get(api,headers=headers)
now you can extract the data from the response in data.

Scrapy middleware to replace single request with multiple requests

I want a middleware that will take a single Request and transform it into a generator of two different requests. As far as I can tell, the downloader middleware process_request() method can only return a single Request, not a generator of them. Is there a nice way to split an arbitrary request into multiple requests?
It seems that spider middleware process_start_requests actually happens after the start_requests Requests are sent through the downloader. For example, if I set start_urls = ['https://localhost/'] and
def process_start_requests(self, start_requests, spider):
yield Request('https://stackoverflow.com')
it will fail with ConnectionRefusedError, having tried and failed the localhost request.
I don't know what would be the logic behind transforming a request (before being sent) into multiple requests, but you can still generate several requests (or even items) from a middleware, with this:
def process_request(self, request, spider):
for a in range(10):
spider.crawler.engine.crawl(
Request(url='myurl', callback=callback_method),
spider)

Scrapy CrawlSpider retry scrape

For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like:
def parse_page(self, response):
url = response.url
# Check to make sure the page is loaded
if 'var PageIsLoaded = false;' in response.body:
self.logger.warning('parse_page encountered an incomplete rendering of {}'.format(url))
yield Request(url, self.parse, dont_filter=True)
return
...
# Normal parsing logic
However, it seems like when the retry logic gets called and a new Request is issued, the pages and the links they contain don't get crawled or scraped. My thought was that by using self.parse which the CrawlSpider uses to apply the crawl rules and dont_filter=True, I could avoid the duplicate filter. However with DUPEFILTER_DEBUG = True, I can see that the retry requests get filtered away.
Am I missing something, or is there a better way to handle this? I'd like to avoid the complication of doing dynamic js rendering using something like splash if possible, and this only happens intermittently.
I would think about having a custom Retry Middleware instead - similar to a built-in one.
Sample implementation (not tested):
import logging
logger = logging.getLogger(__name__)
class RetryMiddleware(object):
def process_response(self, request, response, spider):
if 'var PageIsLoaded = false;' in response.body:
logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
return self._retry(request) or response
return response
def _retry(self, request):
logger.debug("Retrying %(request)s", {'request': request})
retryreq = request.copy()
retryreq.dont_filter = True
return retryreq
And don't forget to activate it.

Scrapy - how to manage cookies/sessions

I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.
This is basically a simplified version of what I'm trying to do:
The way the website works:
When you visit the website you get a session cookie.
When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.
My script:
My spider has a start url of searchpage_url
The searchpage is requested by parse() and the search form response gets passed to search_generator()
search_generator() then yields lots of search requests using FormRequest and the search form response.
Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?
I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?
I'm confused, any clarification would be greatly received!
EDIT:
Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.
I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.
Is this what you should do in this situation?
Three years later, I think this is exactly what you were looking for:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar
Just use something like this in your spider's start_requests method:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)
And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
from scrapy.http.cookies import CookieJar
...
class Spider(BaseSpider):
def parse(self, response):
'''Parse category page, extract subcategories links.'''
hxs = HtmlXPathSelector(response)
subcategories = hxs.select(".../#href")
for subcategorySearchLink in subcategories:
subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
yield Request(subcategorySearchLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True})
'''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
This is needed because the site uses sessions to remember the search parameters.'''
def extractItemLinks(self, response):
'''Extract item links from subcategory page and go to next page.'''
hxs = HtmlXPathSelector(response)
for itemLink in hxs.select(".../a/#href"):
itemLink = urlparse.urljoin(response.url, itemLink)
print 'Requesting item page %s' % itemLink
yield Request(...)
nextPageLink = self.getFirst(".../#href", hxs)
if nextPageLink:
nextPageLink = urlparse.urljoin(response.url, nextPageLink)
self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
request = Request(nextPageLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
yield request
else:
self.log('Whole subcategory scraped.', log.DEBUG)
def parse(self, response):
# do something
yield scrapy.Request(
url= "http://new-page-to-parse.com/page/4/",
cookies= {
'h0':'blah',
'taeyeon':'pretty'
},
callback= self.parse
)
Scrapy has a downloader middleware CookiesMiddleware implemented to support cookies. You just need to enable it. It mimics how the cookiejar in browser works.
When a request goes through CookiesMiddleware, it reads cookies for this domain and set it on header Cookie.
When a response returns, CookiesMiddleware read cookies sent from server on resp header Set-Cookie. And save/merge it into the cookiejar on the mw.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned?
Every spider has its only download middleware. So spiders have separate cookiejars.
Normally, all requests from one Spider shares one cookiejar. But CookiesMiddleware have options to customize this behavior
Request.meta["dont_merge_cookies"] = True tells the mw this very req doesn't read Cookie from cookiejar. And don't merge Set-Cookie from resp into the cookiejar. It's a req level switch.
CookiesMiddleware supports multiple cookiejars. You have to control which cookiejar to use on the request level. Request.meta["cookiejar"] = custom_cookiejar_name.
Please the docs and relate source code of CookiesMiddleware.
I think the simplest approach would be to run multiple instances of the same spider using the search query as a spider argument (that would be received in the constructor), in order to reuse the cookies management feature of Scrapy. So you'll have multiple spider instances, each one crawling one specific search query and its results. But you need to run the spiders yourself with:
scrapy crawl myspider -a search_query=something
Or you can use Scrapyd for running all the spiders through the JSON API.
There are a couple of Scrapy extensions that provide a bit more functionality to work with sessions:
scrapy-sessions allows you to attache statically defined profiles (Proxy and User-Agent) to your sessions, process Cookies and rotate profiles on demand
scrapy-dynamic-sessions almost the same but allows you randomly pick proxy and User-Agent and handle retry request due to any errors

Categories

Resources