I am scraping a site that has an accept terms form that I need to click through. When I click the button I am redirected to the resource that needs to be scraped. I have the basic mechanics working, that is the initial click through works and I get a session and all goes well until the session times out. Then for some reason Scrapy does get redirected but the response URL doesn't get updated so I get duplicate items since I am using the URL to check for duplication.
For example the URL I am requesting is:
https://some-internal-web-page/Records/Details/119ce2b7-35b4-4c63-8bd2-2bfbf77299a8
But when the session expires I get:
https://some-internal-web-page/?returnUrl=%2FRecords%2FDetails%2F119ce2b7-35b4-4c63-8bd2-2bfbf77299a8
Here is my code:
# function to get through accept dialog
def parse(self, response):
yield FormRequest.from_response(response, formdata={"value":"Accept"}, callback=self.after_accept)
# function to parse markup
def after_accept(self, response):
global latest_inspection_date
urls = ['http://some-internal-web-page/Records?SearchText=&SortMode=MostRecentlyHired&page=%s&PageSize=25' % page for page in xrange(1,500)]
for u in urls:
yield Request( u, callback=self.parse_list )
So my question is, how do I persist and/or refresh the session cookie so that I don't get the redirect URL instead of the URL I need.
Cookies are enabled by default and passed through every callback, make sure you have it enabled with COOKIES_ENABLED=True in settings.py.
you can also enable debugging logs for it with COOKIES_DEBUG=True (False by default), and check if the cookies are being passed correctly, so maybe your problem is about something else.
Related
Is it possible for Scrapy to crawl an alert message?
The link for example, http://domainhere/admin, once loaded in an actual browser, an alert message with form is present to fill up the username and password.
Or is there a way to inspect the form in an alert message to know what parameters to be filled up?
PS: I do have credentials for this website, I just want to automate processes through web crawling.
Thanks.
What I did to achieved this, was by doing the following:
Observed what after authentication data needed to proceed with the page.
Using Chrome's developers' tool in the Network tab, I checked the Request Headers. Upon observation, Authorization is needed.
To verify step #2, I used Postman. Using the Authorization in Postman, Basic Auth type, filling up the username and password will generate the same value for the Authorization header. After sending a POST request, it loaded the desired page and bypassed the authentication.
Having the same value for the Authorization under Request Headers, store the value in the Scraper class.
Use the scrapy.Request function with headers parameter.
Code:
import scrapy
class TestScraper(scrapy.Spider):
handle_httpstatus_list = [401]
name = "Test"
allowed_domains = ["xxx.xx.xx"]
start_urls = ["http://testdomain/test"]
auth = "Basic [Key Here]"
def parse(self, response):
return scrapy.Request(
"http://testdomain/test",
headers={'Authorization': self.auth},
callback=self.after_login
)
def after_login(self, response):
self.log(response.body)
Now, you can crawl the page after authentication process.
I'm using a Scrapy spider that authenticates with a login form upon launching. It then scrapes with this authenticated session.
During development I usually run the spider many times to test it out. Authenticating at the beginning of each run spams the login form of the website. The website will often force a password reset in response and I suspect it will ban the account if this continues.
Because the cookies last a number of hours, there's no good reason to log in this often during development. To get around the password reset problem, what would be the best way to re-use an authenticated session/cookies between runs while developing? Ideally the spider would only attempt to authenticate if the persisted session has expired.
Edit:
My structure is like:
def start_requests(self):
yield scrapy.Request(self.base, callback=self.log_in)
def log_in(self, response):
#response.headers includes 'Set-Cookie': 'JSESSIONID=xx'; Path=/cas/; Secure; HttpOnly'
yield scrapy.FormRequest.from_response(response,
formdata={'username': 'xxx',
'password':''},
callback=self.logged_in)
def logged_in(self, response):
#request.headers and subsequent requests all have headers fields 'Cookie': 'JSESSIONID=xxx';
#response.headers has no mention of cookies
#request.cookies is empty
When I run the same page request in Chrome, under the 'Cookies' tab there are ~20 fields listed.
The documentation seems thin here. I've tried setting a field 'Cookie': 'JSESSIONID=xxx' on the headers dict of all outgoing requests based on the values returned by a successful login, but this bounces back to the login screen
Turns out that for an ad-hoc development solution, this is easier to do than I thought. Get the cookie string with cookieString = request.headers['Cookie'], save, then on subsequent outgoing requests load it up and do:
request.headers.appendlist('Cookie', cookieString)
I am trying to crawl a web application written in asp.net.
I am trying to execute a search and crawl the search results page. Lets say that the search page is http://search.site.com/search/search.aspx
Now my crawler is pretty straight forward
class SitesearchSpider(Spider):
name = 'sitecrawl'
allowed_domains = ['search.site.org']
start_urls = [
"http://search.site.org/Search/Search.aspx"
]
def parse(self, response):
self.log("Calling Parse Method", level=log.INFO)
response = response.replace(body=response.body.replace("disabled",""))
return [FormRequest(
url="http://search.site.org/Search/Search.aspx",
formdata={'ctl00$phContent$ucUnifiedSearch$txtIndvl': '2441386'},
callback=self.after_search)]
def after_search(self, response):
self.log("In after search", level=log.INFO)
if "To begin your search" in response.body:
self.log("unable to get result")
else:
self.log(response.body)
But no matter what the same page (search.aspx) is only returned to the after_search callback instead of the expected searchresults.aspx with the results
This is what seems to happen in the browser
Search term is entered in one of the fields
Search button is clicked
On form submit to the same page (search.aspx) i see that it returns a 302 redirect to the search results page
The search results page then displays
I see that the asp.net session cookie is being used here because once a search is made, i can take the search results page URL something like
http://search.site.com/search/searchresults.aspx?key=searchkey&anothersearchparam=12 and open any tabs and the results load directly
If i open a new session and paste the URL then i am getting redirected to the search page
Now i went through the docs and I am not sure if I have to deal with the 302 or the aspnet session cookie. Any help would be appreciated.
You don't have to deal with 302, scrapy does itselr.
You can debug cookie, set DEBUG_COOKIE = 1 on settings
Did you check that what other params are send in post or get method when you search from browser, you have to pass them all in form data.
I suggest you to use fron _response, like:
return [FormRequest.from_response(
response,
formdata={'ctl00$phContent$ucUnifiedSearch$txtIndvl': '2441386'},
Is it possible to sign into a website (and allow that website to be visited with any browser without the user having to sign in) via a background process without user interaction and allow the user to browse the site without logging in from any browser?
I'd guess that I would need to register the created session with each web browser on the user's system, but is there any other (possibly simpler) way of doing this?
Think of it like automatically signing into Gmail in the background and being able to browse it without ever seeing a login page.
yes is possible.I suggest two ways to solve your problem. Both of them uses HTTP requests. You should check more info about HTTP request.
1) the easiest way and recommended one for only login Requests: HTTP for Humans
2) python scrapy, but scrapy is for crawling or screen scraping.
check this example:
Login spider example
class LoginSpider(BaseSpider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# continue scraping with authenticated session...
more info here
There is a library in python called urllib2 which will let you do what you need. Look in the python docs or here:
http://www.voidspace.org.uk/python/articles/urllib2.shtml#openers-and-handlers
or this:
http://www.doughellmann.com/PyMOTW/urllib2/
I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.
This is basically a simplified version of what I'm trying to do:
The way the website works:
When you visit the website you get a session cookie.
When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.
My script:
My spider has a start url of searchpage_url
The searchpage is requested by parse() and the search form response gets passed to search_generator()
search_generator() then yields lots of search requests using FormRequest and the search form response.
Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?
I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?
I'm confused, any clarification would be greatly received!
EDIT:
Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.
I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.
Is this what you should do in this situation?
Three years later, I think this is exactly what you were looking for:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar
Just use something like this in your spider's start_requests method:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)
And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
from scrapy.http.cookies import CookieJar
...
class Spider(BaseSpider):
def parse(self, response):
'''Parse category page, extract subcategories links.'''
hxs = HtmlXPathSelector(response)
subcategories = hxs.select(".../#href")
for subcategorySearchLink in subcategories:
subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
yield Request(subcategorySearchLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True})
'''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
This is needed because the site uses sessions to remember the search parameters.'''
def extractItemLinks(self, response):
'''Extract item links from subcategory page and go to next page.'''
hxs = HtmlXPathSelector(response)
for itemLink in hxs.select(".../a/#href"):
itemLink = urlparse.urljoin(response.url, itemLink)
print 'Requesting item page %s' % itemLink
yield Request(...)
nextPageLink = self.getFirst(".../#href", hxs)
if nextPageLink:
nextPageLink = urlparse.urljoin(response.url, nextPageLink)
self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
request = Request(nextPageLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
yield request
else:
self.log('Whole subcategory scraped.', log.DEBUG)
def parse(self, response):
# do something
yield scrapy.Request(
url= "http://new-page-to-parse.com/page/4/",
cookies= {
'h0':'blah',
'taeyeon':'pretty'
},
callback= self.parse
)
Scrapy has a downloader middleware CookiesMiddleware implemented to support cookies. You just need to enable it. It mimics how the cookiejar in browser works.
When a request goes through CookiesMiddleware, it reads cookies for this domain and set it on header Cookie.
When a response returns, CookiesMiddleware read cookies sent from server on resp header Set-Cookie. And save/merge it into the cookiejar on the mw.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned?
Every spider has its only download middleware. So spiders have separate cookiejars.
Normally, all requests from one Spider shares one cookiejar. But CookiesMiddleware have options to customize this behavior
Request.meta["dont_merge_cookies"] = True tells the mw this very req doesn't read Cookie from cookiejar. And don't merge Set-Cookie from resp into the cookiejar. It's a req level switch.
CookiesMiddleware supports multiple cookiejars. You have to control which cookiejar to use on the request level. Request.meta["cookiejar"] = custom_cookiejar_name.
Please the docs and relate source code of CookiesMiddleware.
I think the simplest approach would be to run multiple instances of the same spider using the search query as a spider argument (that would be received in the constructor), in order to reuse the cookies management feature of Scrapy. So you'll have multiple spider instances, each one crawling one specific search query and its results. But you need to run the spiders yourself with:
scrapy crawl myspider -a search_query=something
Or you can use Scrapyd for running all the spiders through the JSON API.
There are a couple of Scrapy extensions that provide a bit more functionality to work with sessions:
scrapy-sessions allows you to attache statically defined profiles (Proxy and User-Agent) to your sessions, process Cookies and rotate profiles on demand
scrapy-dynamic-sessions almost the same but allows you randomly pick proxy and User-Agent and handle retry request due to any errors