In the Scrapy docs, there is the following example to illustrate how to use an authenticated session in Scrapy:
class LoginSpider(BaseSpider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# continue scraping with authenticated session...
I've got that working, and it's fine. But my question is: What do you have to do to continue scraping with authenticated session, as they say in the last line's comment?
In the code above, the FormRequest that is being used to authenticate has the after_login function set as its callback. This means that the after_login function will be called and passed the page that the login attempt got as a response.
It is then checking that you are successfully logged in by searching the page for a specific string, in this case "authentication failed". If it finds it, the spider ends.
Now, once the spider has got this far, it knows that it has successfully authenticated, and you can start spawning new requests and/or scrape data. So, in this case:
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
# ...
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
return Request(url="http://www.example.com/tastypage/",
callback=self.parse_tastypage)
def parse_tastypage(self, response):
hxs = HtmlXPathSelector(response)
yum = hxs.select('//img')
# etc.
If you look here, there's an example of a spider that authenticates before scraping.
In this case, it handles things in the parse function (the default callback of any request).
def parse(self, response):
hxs = HtmlXPathSelector(response)
if hxs.select("//form[#id='UsernameLoginForm_LoginForm']"):
return self.login(response)
else:
return self.get_section_links(response)
So, whenever a request is made, the response is checked for the presence of the login form. If it is there, then we know that we need to login, so we call the relevant function, if it's not present, we call the function that is responsible for scraping the data from the response.
I hope this is clear, feel free to ask if you have any other questions!
Edit:
Okay, so you want to do more than just spawn a single request and scrape it. You want to follow links.
To do that, all you need to do is scrape the relevant links from the page, and spawn requests using those URLs. For example:
def parse_page(self, response):
""" Scrape useful stuff from page, and spawn new requests
"""
hxs = HtmlXPathSelector(response)
images = hxs.select('//img')
# .. do something with them
links = hxs.select('//a/#href')
# Yield a new request for each link we found
for link in links:
yield Request(url=link, callback=self.parse_page)
As you can see, it spawns a new request for every URL on the page, and each one of those requests will call this same function with their response, so we have some recursive scraping going on.
What I've written above is just an example. If you want to "crawl" pages, you should look into CrawlSpider rather than doing things manually.
Related
First let me say that I'm a novice at Scrapy!
I have a website that requires a login prior to being able to scrape any data with Scrapy. The data that I will be scraping is generated by JavaScript once logged in.
I have successfully been able to login using Scrapy. My question is now that I have logged in and have the necessary cookies to continue making requests of the website how do I transfer those cookies to Splash when envoking a SplashRequest on the report page that I want to scrape with Scrapy? The documentation that I've read is difficult for me to understand and seems way too generic. I've looked for examples but have come up blank.
Is my thought process wrong that I should log in with Scrapy then pass the cookies to Splash or should I just be doing this all through Splash? If so, how do I pass username and password variables in Splash?
Here is my Scrapy code
import scrapy
from scrapy.http import FormRequest
from scrapy_splash import SplashRequest
class mySpider(scrapy.Spider):
login_url = 'https://example.com/'
name = 'reports'
start_urls = [
login_url
]
def parse(self, response):
return FormRequest.from_response(response,formdata={
'username': 'XXXXXX',
'password': 'YYYYYY'
},callback = self.start_requests)
def start_requests(self):
url = 'https://example.com/reports'
yield SplashRequest(url=url, callback=self.start_scraping)
def start_scraping(self, response):
labels = response.css('label::text').extract()
yield {'labeltext': labels}
This is simplified for the moment just to return random labels so that I know I'm logged in and Scrapy is seeing the report. What is happening is it is logging on but of course once I envoke Splash to render the javaScript report Splash is being redirected to login rather than going to the example.com/reports website. Any help or points in the right direction would be MUCH appreciated.
TIA
OK, as usual, after spending hours searching and several more experimenting I found the answer and am now behind the login scraping data using Scrapy from a JS created table. Also as usual I was over complicating things.
Below is my code that is based on the above and simplistically logs in using Splash and then starts scraping.
This utilizes the tool SplashFormRequest rather than Scrapy's FormRequest to login using Splash.
import scrapy
from scrapy_splash import SplashFormRequest
from ..items import UnanetTestItem
class MySpider(scrapy.Spider):
login_url = 'https://example.com'
name = 'Example'
start_urls = [
login_url
]
def parse(self, response):
return SplashFormRequest.from_response(
response,
formdata={
'username': 'username',
'password': 'password'
},
callback = self.start_scraping)
def start_scraping(self, response):
#whatever you want to do from here.
I'm trying to use a Scrapy Spider to solve a problem (a programming question from HackThisSite):
(1) I have to log in a website, giving a username and a password (already done)
(2) After that, I have to access an image with a given URL (the image is only accessible to logged in users)
(3) Then, without saving the image in the hard disk, I have to read its information in a kind of buffer
(4) And the result of the function will fill a form and send the data to the website server (I already know how to do this step)
So, I can resume to question to: would it be possible (using a spider) to read an image accessible only by logged-in users and process it in the spider code?
I tried to research different methods, using item pipelines is not a good approach (I don't want to download the file).
The code that I already have is:
class ProgrammingQuestion2(Spider):
name = 'p2'
start_urls = ['https://www.hackthissite.org/']
def parse(self, response):
formdata_hts = {'username': <MY_USER_NAME>,
'password': <MY_PASSWORD>,
'btn_submit': 'Login'}
return FormRequest.from_response(response,
formdata=formdata_hts, callback=self.redirect_to_page)
def redirect_to_page(self, response):
yield Request(url='https://www.hackthissite.org/missions/prog/2/',
callback=self.solve_question_2)
def solve_question_2(self, response):
open_in_browser(response)
img_url = 'https://www.hackthissite.org/missions/prog/2/PNG'
# What can I do here?
I expect to solve this problem using Scrapy functions, otherwise it would be necessary to log in the website (sending the form data) again.
You can make a scrapy request to crawl the image and then callback to some other endpoint:
def parse_page(self, response):
img_url = 'https://www.hackthissite.org/missions/prog/2/PNG'
yield Request(img_url, callback=self.parse_image)
def parse_image(self, response):
image_bytes = response.body
form_data = form_from_image(image_bytes)
# make form request
I am scraping a site that has an accept terms form that I need to click through. When I click the button I am redirected to the resource that needs to be scraped. I have the basic mechanics working, that is the initial click through works and I get a session and all goes well until the session times out. Then for some reason Scrapy does get redirected but the response URL doesn't get updated so I get duplicate items since I am using the URL to check for duplication.
For example the URL I am requesting is:
https://some-internal-web-page/Records/Details/119ce2b7-35b4-4c63-8bd2-2bfbf77299a8
But when the session expires I get:
https://some-internal-web-page/?returnUrl=%2FRecords%2FDetails%2F119ce2b7-35b4-4c63-8bd2-2bfbf77299a8
Here is my code:
# function to get through accept dialog
def parse(self, response):
yield FormRequest.from_response(response, formdata={"value":"Accept"}, callback=self.after_accept)
# function to parse markup
def after_accept(self, response):
global latest_inspection_date
urls = ['http://some-internal-web-page/Records?SearchText=&SortMode=MostRecentlyHired&page=%s&PageSize=25' % page for page in xrange(1,500)]
for u in urls:
yield Request( u, callback=self.parse_list )
So my question is, how do I persist and/or refresh the session cookie so that I don't get the redirect URL instead of the URL I need.
Cookies are enabled by default and passed through every callback, make sure you have it enabled with COOKIES_ENABLED=True in settings.py.
you can also enable debugging logs for it with COOKIES_DEBUG=True (False by default), and check if the cookies are being passed correctly, so maybe your problem is about something else.
I'm trying to scrape a site with Scrapy. Spider managed to login (I can download profile page etc). However when spider tries to access a page with data it get response with status 302 and is redirected to the main page of the site.
When I'm doing this manually with browser everything is good and browser get 200 response and page with data.
What I've done so far:
- analysed request headers with Firebug and reproduced exactly within spider
- analysed cookies. Some of them were missing in spider because are generated by JS, added manually. So cookies are also identical.
- by deleting cookies one by one, found 1 cookie that is responsible for session login. When delete it via firebug I got the same server behaviour as spider faces - 302 and redirect to main page. Funny thing is that this cookie is received and send by spider perfectly, according to debug info
- IP is the same (running from the same machine, spider under vagrant)
NEED HELP/ADVICE: What do i miss? Requests are very similar and server response is different What could be the reason?
Thanks in advance
Code is
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy_test.items import Website
from scrapy.http import Request
import scrapy
import json
import time
class DmozSpider(Spider):
name = "mims_clean"
custom_settings = {"DOWNLOADER_MIDDLEWARES": {'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400}}
allowed_domains = []
start_urls = [
"https://sso.mims.com"
]
def parse(self, response):
self.logger.info("started")
SessionId = response.xpath('//input[#id="SessionId"]/#value').extract()[0]
self.logger.info("SessionId is " + SessionId)
print("initial")
return scrapy.FormRequest.from_response(
response,
formdata={'SignInEmailAddress': '26rqtdywe6tdg#gmail.com',
'SignInPassword': 'pass777',
'SessionId' : SessionId},
callback=self.after_login, dont_filter=True
)
# here everything is ok, and after login we're redirected to profile page
# same as in browser
def after_login(self, response):
# check login succeed before going on
self.logger.debug("entered after_login")
if "error" in response.body:
self.logger.error("Login failed")
print('Error found')
return
else:
return Request(url="https://www.mims.com/Hongkong/Browse/Alphabet/A?cat=company",
callback=self.after_login2, dont_filter=True)
# we try to achieve data page
def after_login2(self, response):
#but got redirected to interin tracking page
# same as in browserreturn
scrapy.FormRequest.from_response(
response,
callback=self.after_login3, dont_filter=True, dont_click=True
)
# submit a form (in browser is done via JS)
def after_login3(self, response):
# got to another form with prepopulated fields (same as in browser)
request = scrapy.FormRequest.from_response(
response,
callback=self.after_login4, dont_filter=True, dont_click = True
)
return request
#submit the form and got to the main page with all the cookies
def after_login4(self, response):
#from the main page try to go to data page
request = Request(url="https://mims.com/Hongkong/Browse/Alphabet/A?cat=drug",
callback=self.after_login5, dont_filter=True, headers={'Host':'mims.com',
'Connection':'keep-alive'})
return request
def after_login5(self, response):
# in the real bwser here I got 200 response and catalog to scrape
# with scrapy I got 302 and get redirected to the 1st form
# in fact its a loop to after_login2
#All the rest is a hopeless try to submit forms for the 2nd time :))
return scrapy.FormRequest.from_response(
response,
callback=self.after_login6, dont_filter=True
)
def after_login6(self, response):
return scrapy.FormRequest.from_response(
response,
callback=self.parse_final, dont_filter=True)
def parse_final(self, response):
self.logger.debug("entered parse_tastypage")
f = open('parse_final.html', 'w')
f.write('<base href="https://sso.mims.com">')
f.write(response.body)
f.close()
return
For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like:
def parse_page(self, response):
url = response.url
# Check to make sure the page is loaded
if 'var PageIsLoaded = false;' in response.body:
self.logger.warning('parse_page encountered an incomplete rendering of {}'.format(url))
yield Request(url, self.parse, dont_filter=True)
return
...
# Normal parsing logic
However, it seems like when the retry logic gets called and a new Request is issued, the pages and the links they contain don't get crawled or scraped. My thought was that by using self.parse which the CrawlSpider uses to apply the crawl rules and dont_filter=True, I could avoid the duplicate filter. However with DUPEFILTER_DEBUG = True, I can see that the retry requests get filtered away.
Am I missing something, or is there a better way to handle this? I'd like to avoid the complication of doing dynamic js rendering using something like splash if possible, and this only happens intermittently.
I would think about having a custom Retry Middleware instead - similar to a built-in one.
Sample implementation (not tested):
import logging
logger = logging.getLogger(__name__)
class RetryMiddleware(object):
def process_response(self, request, response, spider):
if 'var PageIsLoaded = false;' in response.body:
logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
return self._retry(request) or response
return response
def _retry(self, request):
logger.debug("Retrying %(request)s", {'request': request})
retryreq = request.copy()
retryreq.dont_filter = True
return retryreq
And don't forget to activate it.