Scrapy handling session cookie or 302 in asp.net site - python

I am trying to crawl a web application written in asp.net.
I am trying to execute a search and crawl the search results page. Lets say that the search page is http://search.site.com/search/search.aspx
Now my crawler is pretty straight forward
class SitesearchSpider(Spider):
name = 'sitecrawl'
allowed_domains = ['search.site.org']
start_urls = [
"http://search.site.org/Search/Search.aspx"
]
def parse(self, response):
self.log("Calling Parse Method", level=log.INFO)
response = response.replace(body=response.body.replace("disabled",""))
return [FormRequest(
url="http://search.site.org/Search/Search.aspx",
formdata={'ctl00$phContent$ucUnifiedSearch$txtIndvl': '2441386'},
callback=self.after_search)]
def after_search(self, response):
self.log("In after search", level=log.INFO)
if "To begin your search" in response.body:
self.log("unable to get result")
else:
self.log(response.body)
But no matter what the same page (search.aspx) is only returned to the after_search callback instead of the expected searchresults.aspx with the results
This is what seems to happen in the browser
Search term is entered in one of the fields
Search button is clicked
On form submit to the same page (search.aspx) i see that it returns a 302 redirect to the search results page
The search results page then displays
I see that the asp.net session cookie is being used here because once a search is made, i can take the search results page URL something like
http://search.site.com/search/searchresults.aspx?key=searchkey&anothersearchparam=12 and open any tabs and the results load directly
If i open a new session and paste the URL then i am getting redirected to the search page
Now i went through the docs and I am not sure if I have to deal with the 302 or the aspnet session cookie. Any help would be appreciated.

You don't have to deal with 302, scrapy does itselr.
You can debug cookie, set DEBUG_COOKIE = 1 on settings
Did you check that what other params are send in post or get method when you search from browser, you have to pass them all in form data.
I suggest you to use fron _response, like:
return [FormRequest.from_response(
response,
formdata={'ctl00$phContent$ucUnifiedSearch$txtIndvl': '2441386'},

Related

Scrapy to bypass an alert message with form authentication

Is it possible for Scrapy to crawl an alert message?
The link for example, http://domainhere/admin, once loaded in an actual browser, an alert message with form is present to fill up the username and password.
Or is there a way to inspect the form in an alert message to know what parameters to be filled up?
PS: I do have credentials for this website, I just want to automate processes through web crawling.
Thanks.
What I did to achieved this, was by doing the following:
Observed what after authentication data needed to proceed with the page.
Using Chrome's developers' tool in the Network tab, I checked the Request Headers. Upon observation, Authorization is needed.
To verify step #2, I used Postman. Using the Authorization in Postman, Basic Auth type, filling up the username and password will generate the same value for the Authorization header. After sending a POST request, it loaded the desired page and bypassed the authentication.
Having the same value for the Authorization under Request Headers, store the value in the Scraper class.
Use the scrapy.Request function with headers parameter.
Code:
import scrapy
class TestScraper(scrapy.Spider):
handle_httpstatus_list = [401]
name = "Test"
allowed_domains = ["xxx.xx.xx"]
start_urls = ["http://testdomain/test"]
auth = "Basic [Key Here]"
def parse(self, response):
return scrapy.Request(
"http://testdomain/test",
headers={'Authorization': self.auth},
callback=self.after_login
)
def after_login(self, response):
self.log(response.body)
Now, you can crawl the page after authentication process.

Persist Session Cookie in Scrapy

I am scraping a site that has an accept terms form that I need to click through. When I click the button I am redirected to the resource that needs to be scraped. I have the basic mechanics working, that is the initial click through works and I get a session and all goes well until the session times out. Then for some reason Scrapy does get redirected but the response URL doesn't get updated so I get duplicate items since I am using the URL to check for duplication.
For example the URL I am requesting is:
https://some-internal-web-page/Records/Details/119ce2b7-35b4-4c63-8bd2-2bfbf77299a8
But when the session expires I get:
https://some-internal-web-page/?returnUrl=%2FRecords%2FDetails%2F119ce2b7-35b4-4c63-8bd2-2bfbf77299a8
Here is my code:
# function to get through accept dialog
def parse(self, response):
yield FormRequest.from_response(response, formdata={"value":"Accept"}, callback=self.after_accept)
# function to parse markup
def after_accept(self, response):
global latest_inspection_date
urls = ['http://some-internal-web-page/Records?SearchText=&SortMode=MostRecentlyHired&page=%s&PageSize=25' % page for page in xrange(1,500)]
for u in urls:
yield Request( u, callback=self.parse_list )
So my question is, how do I persist and/or refresh the session cookie so that I don't get the redirect URL instead of the URL I need.
Cookies are enabled by default and passed through every callback, make sure you have it enabled with COOKIES_ENABLED=True in settings.py.
you can also enable debugging logs for it with COOKIES_DEBUG=True (False by default), and check if the cookies are being passed correctly, so maybe your problem is about something else.

Wait until the webpage loads in Scrapy

I am using scrapy script to load URL using "yield".
MyUrl = "www.example.com"
request = Request(MyUrl, callback=self.mydetail)
yield request
def mydetail(self, response):
item['Description'] = response.xpath(".//table[#class='list']//text()").extract()
return item
The URL seems to take minimum 5 seconds to load. So I want Scrapy to wait for some time to load the entire text in item['Description'].
I tried "DOWNLOAD_DELAY" in settings.py but no use.
Make a brief view on firebug or another tool to capture responses for Ajax requests, which were made by javascript code. You are able to make a chain of responses to catch those ajax requests which appear after uploading of the page.There are several related questions: parse ajax content,
retreive final page,
parse dynamic content.

Using Scrapy with authenticated (logged in) user session

In the Scrapy docs, there is the following example to illustrate how to use an authenticated session in Scrapy:
class LoginSpider(BaseSpider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# continue scraping with authenticated session...
I've got that working, and it's fine. But my question is: What do you have to do to continue scraping with authenticated session, as they say in the last line's comment?
In the code above, the FormRequest that is being used to authenticate has the after_login function set as its callback. This means that the after_login function will be called and passed the page that the login attempt got as a response.
It is then checking that you are successfully logged in by searching the page for a specific string, in this case "authentication failed". If it finds it, the spider ends.
Now, once the spider has got this far, it knows that it has successfully authenticated, and you can start spawning new requests and/or scrape data. So, in this case:
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
# ...
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
return Request(url="http://www.example.com/tastypage/",
callback=self.parse_tastypage)
def parse_tastypage(self, response):
hxs = HtmlXPathSelector(response)
yum = hxs.select('//img')
# etc.
If you look here, there's an example of a spider that authenticates before scraping.
In this case, it handles things in the parse function (the default callback of any request).
def parse(self, response):
hxs = HtmlXPathSelector(response)
if hxs.select("//form[#id='UsernameLoginForm_LoginForm']"):
return self.login(response)
else:
return self.get_section_links(response)
So, whenever a request is made, the response is checked for the presence of the login form. If it is there, then we know that we need to login, so we call the relevant function, if it's not present, we call the function that is responsible for scraping the data from the response.
I hope this is clear, feel free to ask if you have any other questions!
Edit:
Okay, so you want to do more than just spawn a single request and scrape it. You want to follow links.
To do that, all you need to do is scrape the relevant links from the page, and spawn requests using those URLs. For example:
def parse_page(self, response):
""" Scrape useful stuff from page, and spawn new requests
"""
hxs = HtmlXPathSelector(response)
images = hxs.select('//img')
# .. do something with them
links = hxs.select('//a/#href')
# Yield a new request for each link we found
for link in links:
yield Request(url=link, callback=self.parse_page)
As you can see, it spawns a new request for every URL on the page, and each one of those requests will call this same function with their response, so we have some recursive scraping going on.
What I've written above is just an example. If you want to "crawl" pages, you should look into CrawlSpider rather than doing things manually.

Scrapy - how to manage cookies/sessions

I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.
This is basically a simplified version of what I'm trying to do:
The way the website works:
When you visit the website you get a session cookie.
When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.
My script:
My spider has a start url of searchpage_url
The searchpage is requested by parse() and the search form response gets passed to search_generator()
search_generator() then yields lots of search requests using FormRequest and the search form response.
Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?
I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?
I'm confused, any clarification would be greatly received!
EDIT:
Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.
I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.
Is this what you should do in this situation?
Three years later, I think this is exactly what you were looking for:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar
Just use something like this in your spider's start_requests method:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)
And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
from scrapy.http.cookies import CookieJar
...
class Spider(BaseSpider):
def parse(self, response):
'''Parse category page, extract subcategories links.'''
hxs = HtmlXPathSelector(response)
subcategories = hxs.select(".../#href")
for subcategorySearchLink in subcategories:
subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
yield Request(subcategorySearchLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True})
'''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
This is needed because the site uses sessions to remember the search parameters.'''
def extractItemLinks(self, response):
'''Extract item links from subcategory page and go to next page.'''
hxs = HtmlXPathSelector(response)
for itemLink in hxs.select(".../a/#href"):
itemLink = urlparse.urljoin(response.url, itemLink)
print 'Requesting item page %s' % itemLink
yield Request(...)
nextPageLink = self.getFirst(".../#href", hxs)
if nextPageLink:
nextPageLink = urlparse.urljoin(response.url, nextPageLink)
self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
request = Request(nextPageLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
yield request
else:
self.log('Whole subcategory scraped.', log.DEBUG)
def parse(self, response):
# do something
yield scrapy.Request(
url= "http://new-page-to-parse.com/page/4/",
cookies= {
'h0':'blah',
'taeyeon':'pretty'
},
callback= self.parse
)
Scrapy has a downloader middleware CookiesMiddleware implemented to support cookies. You just need to enable it. It mimics how the cookiejar in browser works.
When a request goes through CookiesMiddleware, it reads cookies for this domain and set it on header Cookie.
When a response returns, CookiesMiddleware read cookies sent from server on resp header Set-Cookie. And save/merge it into the cookiejar on the mw.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned?
Every spider has its only download middleware. So spiders have separate cookiejars.
Normally, all requests from one Spider shares one cookiejar. But CookiesMiddleware have options to customize this behavior
Request.meta["dont_merge_cookies"] = True tells the mw this very req doesn't read Cookie from cookiejar. And don't merge Set-Cookie from resp into the cookiejar. It's a req level switch.
CookiesMiddleware supports multiple cookiejars. You have to control which cookiejar to use on the request level. Request.meta["cookiejar"] = custom_cookiejar_name.
Please the docs and relate source code of CookiesMiddleware.
I think the simplest approach would be to run multiple instances of the same spider using the search query as a spider argument (that would be received in the constructor), in order to reuse the cookies management feature of Scrapy. So you'll have multiple spider instances, each one crawling one specific search query and its results. But you need to run the spiders yourself with:
scrapy crawl myspider -a search_query=something
Or you can use Scrapyd for running all the spiders through the JSON API.
There are a couple of Scrapy extensions that provide a bit more functionality to work with sessions:
scrapy-sessions allows you to attache statically defined profiles (Proxy and User-Agent) to your sessions, process Cookies and rotate profiles on demand
scrapy-dynamic-sessions almost the same but allows you randomly pick proxy and User-Agent and handle retry request due to any errors

Categories

Resources