Scrapy - how to manage cookies/sessions - python

I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.
This is basically a simplified version of what I'm trying to do:
The way the website works:
When you visit the website you get a session cookie.
When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.
My script:
My spider has a start url of searchpage_url
The searchpage is requested by parse() and the search form response gets passed to search_generator()
search_generator() then yields lots of search requests using FormRequest and the search form response.
Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?
I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?
I'm confused, any clarification would be greatly received!
EDIT:
Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.
I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.
Is this what you should do in this situation?

Three years later, I think this is exactly what you were looking for:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar
Just use something like this in your spider's start_requests method:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)
And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)

from scrapy.http.cookies import CookieJar
...
class Spider(BaseSpider):
def parse(self, response):
'''Parse category page, extract subcategories links.'''
hxs = HtmlXPathSelector(response)
subcategories = hxs.select(".../#href")
for subcategorySearchLink in subcategories:
subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
yield Request(subcategorySearchLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True})
'''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
This is needed because the site uses sessions to remember the search parameters.'''
def extractItemLinks(self, response):
'''Extract item links from subcategory page and go to next page.'''
hxs = HtmlXPathSelector(response)
for itemLink in hxs.select(".../a/#href"):
itemLink = urlparse.urljoin(response.url, itemLink)
print 'Requesting item page %s' % itemLink
yield Request(...)
nextPageLink = self.getFirst(".../#href", hxs)
if nextPageLink:
nextPageLink = urlparse.urljoin(response.url, nextPageLink)
self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
request = Request(nextPageLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
yield request
else:
self.log('Whole subcategory scraped.', log.DEBUG)

def parse(self, response):
# do something
yield scrapy.Request(
url= "http://new-page-to-parse.com/page/4/",
cookies= {
'h0':'blah',
'taeyeon':'pretty'
},
callback= self.parse
)

Scrapy has a downloader middleware CookiesMiddleware implemented to support cookies. You just need to enable it. It mimics how the cookiejar in browser works.
When a request goes through CookiesMiddleware, it reads cookies for this domain and set it on header Cookie.
When a response returns, CookiesMiddleware read cookies sent from server on resp header Set-Cookie. And save/merge it into the cookiejar on the mw.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned?
Every spider has its only download middleware. So spiders have separate cookiejars.
Normally, all requests from one Spider shares one cookiejar. But CookiesMiddleware have options to customize this behavior
Request.meta["dont_merge_cookies"] = True tells the mw this very req doesn't read Cookie from cookiejar. And don't merge Set-Cookie from resp into the cookiejar. It's a req level switch.
CookiesMiddleware supports multiple cookiejars. You have to control which cookiejar to use on the request level. Request.meta["cookiejar"] = custom_cookiejar_name.
Please the docs and relate source code of CookiesMiddleware.

I think the simplest approach would be to run multiple instances of the same spider using the search query as a spider argument (that would be received in the constructor), in order to reuse the cookies management feature of Scrapy. So you'll have multiple spider instances, each one crawling one specific search query and its results. But you need to run the spiders yourself with:
scrapy crawl myspider -a search_query=something
Or you can use Scrapyd for running all the spiders through the JSON API.

There are a couple of Scrapy extensions that provide a bit more functionality to work with sessions:
scrapy-sessions allows you to attache statically defined profiles (Proxy and User-Agent) to your sessions, process Cookies and rotate profiles on demand
scrapy-dynamic-sessions almost the same but allows you randomly pick proxy and User-Agent and handle retry request due to any errors

Related

How to prevent 301 redirect for web crawler

I'm fairly new to web scraping, and am just testing it out on a few web pages. I've successfully scraped several Amazon searches, however in this case I get a 301 redirect, causing a different page to be scraped.
I've tried adding a line (handle_httpstatus_list = [301]) to prevent the redirect. This causes no data to be scraped at all.
On reading the documentation for scrapy, I thought perhaps editing the middlewares could solve this problem? However, was still unsure about how to go about doing this.
import scrapy
class BooksSpider(scrapy.Spider):
name = 'books'
handle_httpstatus_list = [301]
start_urls = ['https://www.amazon.com/s?i=stripbooks&rh=n%3A2%2Cp_30%3AIndependently+published%2Cp_n_feature_browse-bin%3A2656022011&s=daterank&Adv-Srch-Books-Submit.x=50&Adv-Srch-Books-Submit.y=10&field-datemod=8&field-dateop=During&field-dateyear=2019&unfiltered=1&ref=sr_adv_b']
def parse(self, response):
SET_SELECTOR = '.s-result-item'
for car in response.css(SET_SELECTOR):
NAME = '.a-size-medium ::text'
TITLE = './/h2/a/span/text()'
LINK = './/h2/a/#href'
yield {
'name': car.css(NAME).extract(),
'title': car.xpath(TITLE).extract(),
'link': car.xpath(LINK).get()
}
NEXT_PAGE_SELECTOR = '.a-last a ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
next_page = response.urljoin(next_page)
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
I'm sorry about the broad answer I'm giving here, but since you don't provided much information nor the stack trace of your crawler, I will try to cover what I think is a very likely scenario why you're having this problem, and give you pointers on those directions.
What's most likely happening is than the website is looking for some condition to be met (a wrong page, or cookies, or user-agent, a referrer, request headers), in case you are having a problem of session//cookie management, please refer to this post here about that topic.
Also, given than your already identified a redirect, please take a look on handling redirects, and also check the usage of middlewares to handle behaviors in your scraper.
If by any chance you're having issues with your request headers or the user-agent setting, here you can find better information about the user-agent and settings in general, or check the response object structure to create one that fits your scenario.
Obviously, never forget to check the official documentation for broader information on any package, they are very useful.

How to fetch the Response object of a Request synchronously on Scrapy?

I believe using "callback" method is asynchronous, please correct me if I'm wrong. I'm still new with Python so please bear with me.
Anyway, I'm trying to make a method to check if a file exists and here is my code:
def file_exists(self, url):
res = False;
response = Request(url, method='HEAD', dont_filter=True)
if response.status == 200:
res = True
return res
I thought the Request() method will return a Response object but it still returns a Request object, to capture the Response, I have to create a different method for the callback.
Is there a way to get the Response object within the code block where you call the Response() method?
If anyone is still interested in a possible solution – I managed it by doing a request with "requests" sort of "inside" a scrapy function like this:
import requests
request_object = requests.get(the_url_you_like_to_get)
response_object = scrapy.Selector(request_object )
item['attribute'] = response_object .xpath('//path/you/like/to/get/text()').extract_first()
and then proceed.
Request objects don't generate anything.
Scrapy uses asynchronous Downloader engine which takes these Request objects and generate Response objects.
if any method in your spider returns a Request object it is automatically scheduled in the downloader and returns a Response object to specified callback(i.e. Request(url, callback=self.my_callback)).
Check out more at scrapy's architecture overview
Now depends when and where you are doing it you can schedule requests by telling the downloader to schedule some requests:
self.crawler.engine.schedule(Request(url, callback=self.my_callback), spider)
If you run this from a spider spider here can most likely be self here and self.crawler is inherited from scrapy.Spider.
Alternatively you can always block asynchronous stack by using something like requests like:
def parse(self, response):
image_url = response.xpath('//img/#href').extract_first()
if image_url:
image_head = requests.head(image_url)
if 'image' in image_head.headers['Content-Type']:
item['image'] = image_url
It will slow your spider down but it's significantly easier to implement and manage.
Scrapy uses Request and Response objects for crawling web sites.
Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
Unless you are manually using a Downloader, it seems like the way you're using the framework is incorrect. I'd read a bit more about how you can create proper spiders here.
As for file exists, your spider can store relevant information in a database or other data structure when parsing the scraped data in its parse*() method, and you can later query it in your own code.

Persist Session Cookie in Scrapy

I am scraping a site that has an accept terms form that I need to click through. When I click the button I am redirected to the resource that needs to be scraped. I have the basic mechanics working, that is the initial click through works and I get a session and all goes well until the session times out. Then for some reason Scrapy does get redirected but the response URL doesn't get updated so I get duplicate items since I am using the URL to check for duplication.
For example the URL I am requesting is:
https://some-internal-web-page/Records/Details/119ce2b7-35b4-4c63-8bd2-2bfbf77299a8
But when the session expires I get:
https://some-internal-web-page/?returnUrl=%2FRecords%2FDetails%2F119ce2b7-35b4-4c63-8bd2-2bfbf77299a8
Here is my code:
# function to get through accept dialog
def parse(self, response):
yield FormRequest.from_response(response, formdata={"value":"Accept"}, callback=self.after_accept)
# function to parse markup
def after_accept(self, response):
global latest_inspection_date
urls = ['http://some-internal-web-page/Records?SearchText=&SortMode=MostRecentlyHired&page=%s&PageSize=25' % page for page in xrange(1,500)]
for u in urls:
yield Request( u, callback=self.parse_list )
So my question is, how do I persist and/or refresh the session cookie so that I don't get the redirect URL instead of the URL I need.
Cookies are enabled by default and passed through every callback, make sure you have it enabled with COOKIES_ENABLED=True in settings.py.
you can also enable debugging logs for it with COOKIES_DEBUG=True (False by default), and check if the cookies are being passed correctly, so maybe your problem is about something else.

Wait until the webpage loads in Scrapy

I am using scrapy script to load URL using "yield".
MyUrl = "www.example.com"
request = Request(MyUrl, callback=self.mydetail)
yield request
def mydetail(self, response):
item['Description'] = response.xpath(".//table[#class='list']//text()").extract()
return item
The URL seems to take minimum 5 seconds to load. So I want Scrapy to wait for some time to load the entire text in item['Description'].
I tried "DOWNLOAD_DELAY" in settings.py but no use.
Make a brief view on firebug or another tool to capture responses for Ajax requests, which were made by javascript code. You are able to make a chain of responses to catch those ajax requests which appear after uploading of the page.There are several related questions: parse ajax content,
retreive final page,
parse dynamic content.

Scrapy: how to debug scrapy lost requests

I have a scrapy spider, but it doesn't return requests sometimes.
I've found that by adding log messages before yielding request and after getting response.
Spider has iterating over a pages and parsing link for item scrapping on each page.
Here is a part of code
SampleSpider(BaseSpider):
....
def parse_page(self, response):
...
request = Request(target_link, callback=self.parse_item_general)
request.meta['date_updated'] = date_updated
self.log('parse_item_general_send {url}'.format(url=request.url), level=log.INFO)
yield request
def parse_item_general(self, response):
self.log('parse_item_general_recv {url}'.format(url=response.url), level=log.INFO)
sel = Selector(response)
...
I've compared number of each log messages and "parse_item_general_send" is more than "parse_item_general_recv"
There's no 400 or 500 errors in final statistics, all responses status code is only 200. It looks like requests just disappears.
I've also added these parameters to minimize possible errors:
CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 0.8
Because of asynchronous nature of twisted, I don't know how to debug this bug.
I've found a similar question: Python Scrapy not always downloading data from website, but it hasn't any response
On, the same note as Rho, you can add the setting
DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'
to your "settings.py" which will remove the url caching. This is a tricky issue since there isn't a debug string in the scrapy logs that tells you when it uses a cached result.

Categories

Resources