I'm trying to use a Scrapy Spider to solve a problem (a programming question from HackThisSite):
(1) I have to log in a website, giving a username and a password (already done)
(2) After that, I have to access an image with a given URL (the image is only accessible to logged in users)
(3) Then, without saving the image in the hard disk, I have to read its information in a kind of buffer
(4) And the result of the function will fill a form and send the data to the website server (I already know how to do this step)
So, I can resume to question to: would it be possible (using a spider) to read an image accessible only by logged-in users and process it in the spider code?
I tried to research different methods, using item pipelines is not a good approach (I don't want to download the file).
The code that I already have is:
class ProgrammingQuestion2(Spider):
name = 'p2'
start_urls = ['https://www.hackthissite.org/']
def parse(self, response):
formdata_hts = {'username': <MY_USER_NAME>,
'password': <MY_PASSWORD>,
'btn_submit': 'Login'}
return FormRequest.from_response(response,
formdata=formdata_hts, callback=self.redirect_to_page)
def redirect_to_page(self, response):
yield Request(url='https://www.hackthissite.org/missions/prog/2/',
callback=self.solve_question_2)
def solve_question_2(self, response):
open_in_browser(response)
img_url = 'https://www.hackthissite.org/missions/prog/2/PNG'
# What can I do here?
I expect to solve this problem using Scrapy functions, otherwise it would be necessary to log in the website (sending the form data) again.
You can make a scrapy request to crawl the image and then callback to some other endpoint:
def parse_page(self, response):
img_url = 'https://www.hackthissite.org/missions/prog/2/PNG'
yield Request(img_url, callback=self.parse_image)
def parse_image(self, response):
image_bytes = response.body
form_data = form_from_image(image_bytes)
# make form request
Related
First let me say that I'm a novice at Scrapy!
I have a website that requires a login prior to being able to scrape any data with Scrapy. The data that I will be scraping is generated by JavaScript once logged in.
I have successfully been able to login using Scrapy. My question is now that I have logged in and have the necessary cookies to continue making requests of the website how do I transfer those cookies to Splash when envoking a SplashRequest on the report page that I want to scrape with Scrapy? The documentation that I've read is difficult for me to understand and seems way too generic. I've looked for examples but have come up blank.
Is my thought process wrong that I should log in with Scrapy then pass the cookies to Splash or should I just be doing this all through Splash? If so, how do I pass username and password variables in Splash?
Here is my Scrapy code
import scrapy
from scrapy.http import FormRequest
from scrapy_splash import SplashRequest
class mySpider(scrapy.Spider):
login_url = 'https://example.com/'
name = 'reports'
start_urls = [
login_url
]
def parse(self, response):
return FormRequest.from_response(response,formdata={
'username': 'XXXXXX',
'password': 'YYYYYY'
},callback = self.start_requests)
def start_requests(self):
url = 'https://example.com/reports'
yield SplashRequest(url=url, callback=self.start_scraping)
def start_scraping(self, response):
labels = response.css('label::text').extract()
yield {'labeltext': labels}
This is simplified for the moment just to return random labels so that I know I'm logged in and Scrapy is seeing the report. What is happening is it is logging on but of course once I envoke Splash to render the javaScript report Splash is being redirected to login rather than going to the example.com/reports website. Any help or points in the right direction would be MUCH appreciated.
TIA
OK, as usual, after spending hours searching and several more experimenting I found the answer and am now behind the login scraping data using Scrapy from a JS created table. Also as usual I was over complicating things.
Below is my code that is based on the above and simplistically logs in using Splash and then starts scraping.
This utilizes the tool SplashFormRequest rather than Scrapy's FormRequest to login using Splash.
import scrapy
from scrapy_splash import SplashFormRequest
from ..items import UnanetTestItem
class MySpider(scrapy.Spider):
login_url = 'https://example.com'
name = 'Example'
start_urls = [
login_url
]
def parse(self, response):
return SplashFormRequest.from_response(
response,
formdata={
'username': 'username',
'password': 'password'
},
callback = self.start_scraping)
def start_scraping(self, response):
#whatever you want to do from here.
I wouldn't like to crawl If the same as crawl data before in scrapy framework.
In order to solve this problem, I think that when crawl was done, put date-time in the DB and do not crawl if Last-Modified response HTTP has not been updated since that date-time.
My questions are the following two.
How do you think about this way?Is there better Idea?
Could you teach me if there is a code that can reference the Last-Modified response HTTP control with scrapy framework?
Thank you for reading my question.
Not all website return Last-Modified header, if you are certain your's does you can try having HEAD request first to check the headers and match with your DB info and then have GET request to crawl data:
def parse(self, response):
urls = [] # some urls
for url in urls:
yield Request(url, method='HEAD', self.check)
def check(self, response):
date = response.headers['Last-Modified']
#check date to your db
if db_date > date: # or whatever is your case
yield Request(response.url, self.success)
def success(self, response):
yield item
For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like:
def parse_page(self, response):
url = response.url
# Check to make sure the page is loaded
if 'var PageIsLoaded = false;' in response.body:
self.logger.warning('parse_page encountered an incomplete rendering of {}'.format(url))
yield Request(url, self.parse, dont_filter=True)
return
...
# Normal parsing logic
However, it seems like when the retry logic gets called and a new Request is issued, the pages and the links they contain don't get crawled or scraped. My thought was that by using self.parse which the CrawlSpider uses to apply the crawl rules and dont_filter=True, I could avoid the duplicate filter. However with DUPEFILTER_DEBUG = True, I can see that the retry requests get filtered away.
Am I missing something, or is there a better way to handle this? I'd like to avoid the complication of doing dynamic js rendering using something like splash if possible, and this only happens intermittently.
I would think about having a custom Retry Middleware instead - similar to a built-in one.
Sample implementation (not tested):
import logging
logger = logging.getLogger(__name__)
class RetryMiddleware(object):
def process_response(self, request, response, spider):
if 'var PageIsLoaded = false;' in response.body:
logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
return self._retry(request) or response
return response
def _retry(self, request):
logger.debug("Retrying %(request)s", {'request': request})
retryreq = request.copy()
retryreq.dont_filter = True
return retryreq
And don't forget to activate it.
In the Scrapy docs, there is the following example to illustrate how to use an authenticated session in Scrapy:
class LoginSpider(BaseSpider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# continue scraping with authenticated session...
I've got that working, and it's fine. But my question is: What do you have to do to continue scraping with authenticated session, as they say in the last line's comment?
In the code above, the FormRequest that is being used to authenticate has the after_login function set as its callback. This means that the after_login function will be called and passed the page that the login attempt got as a response.
It is then checking that you are successfully logged in by searching the page for a specific string, in this case "authentication failed". If it finds it, the spider ends.
Now, once the spider has got this far, it knows that it has successfully authenticated, and you can start spawning new requests and/or scrape data. So, in this case:
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
# ...
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
return Request(url="http://www.example.com/tastypage/",
callback=self.parse_tastypage)
def parse_tastypage(self, response):
hxs = HtmlXPathSelector(response)
yum = hxs.select('//img')
# etc.
If you look here, there's an example of a spider that authenticates before scraping.
In this case, it handles things in the parse function (the default callback of any request).
def parse(self, response):
hxs = HtmlXPathSelector(response)
if hxs.select("//form[#id='UsernameLoginForm_LoginForm']"):
return self.login(response)
else:
return self.get_section_links(response)
So, whenever a request is made, the response is checked for the presence of the login form. If it is there, then we know that we need to login, so we call the relevant function, if it's not present, we call the function that is responsible for scraping the data from the response.
I hope this is clear, feel free to ask if you have any other questions!
Edit:
Okay, so you want to do more than just spawn a single request and scrape it. You want to follow links.
To do that, all you need to do is scrape the relevant links from the page, and spawn requests using those URLs. For example:
def parse_page(self, response):
""" Scrape useful stuff from page, and spawn new requests
"""
hxs = HtmlXPathSelector(response)
images = hxs.select('//img')
# .. do something with them
links = hxs.select('//a/#href')
# Yield a new request for each link we found
for link in links:
yield Request(url=link, callback=self.parse_page)
As you can see, it spawns a new request for every URL on the page, and each one of those requests will call this same function with their response, so we have some recursive scraping going on.
What I've written above is just an example. If you want to "crawl" pages, you should look into CrawlSpider rather than doing things manually.
I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.
This is basically a simplified version of what I'm trying to do:
The way the website works:
When you visit the website you get a session cookie.
When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.
My script:
My spider has a start url of searchpage_url
The searchpage is requested by parse() and the search form response gets passed to search_generator()
search_generator() then yields lots of search requests using FormRequest and the search form response.
Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?
I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?
I'm confused, any clarification would be greatly received!
EDIT:
Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.
I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.
Is this what you should do in this situation?
Three years later, I think this is exactly what you were looking for:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar
Just use something like this in your spider's start_requests method:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)
And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
from scrapy.http.cookies import CookieJar
...
class Spider(BaseSpider):
def parse(self, response):
'''Parse category page, extract subcategories links.'''
hxs = HtmlXPathSelector(response)
subcategories = hxs.select(".../#href")
for subcategorySearchLink in subcategories:
subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
yield Request(subcategorySearchLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True})
'''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
This is needed because the site uses sessions to remember the search parameters.'''
def extractItemLinks(self, response):
'''Extract item links from subcategory page and go to next page.'''
hxs = HtmlXPathSelector(response)
for itemLink in hxs.select(".../a/#href"):
itemLink = urlparse.urljoin(response.url, itemLink)
print 'Requesting item page %s' % itemLink
yield Request(...)
nextPageLink = self.getFirst(".../#href", hxs)
if nextPageLink:
nextPageLink = urlparse.urljoin(response.url, nextPageLink)
self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
request = Request(nextPageLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
yield request
else:
self.log('Whole subcategory scraped.', log.DEBUG)
def parse(self, response):
# do something
yield scrapy.Request(
url= "http://new-page-to-parse.com/page/4/",
cookies= {
'h0':'blah',
'taeyeon':'pretty'
},
callback= self.parse
)
Scrapy has a downloader middleware CookiesMiddleware implemented to support cookies. You just need to enable it. It mimics how the cookiejar in browser works.
When a request goes through CookiesMiddleware, it reads cookies for this domain and set it on header Cookie.
When a response returns, CookiesMiddleware read cookies sent from server on resp header Set-Cookie. And save/merge it into the cookiejar on the mw.
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned?
Every spider has its only download middleware. So spiders have separate cookiejars.
Normally, all requests from one Spider shares one cookiejar. But CookiesMiddleware have options to customize this behavior
Request.meta["dont_merge_cookies"] = True tells the mw this very req doesn't read Cookie from cookiejar. And don't merge Set-Cookie from resp into the cookiejar. It's a req level switch.
CookiesMiddleware supports multiple cookiejars. You have to control which cookiejar to use on the request level. Request.meta["cookiejar"] = custom_cookiejar_name.
Please the docs and relate source code of CookiesMiddleware.
I think the simplest approach would be to run multiple instances of the same spider using the search query as a spider argument (that would be received in the constructor), in order to reuse the cookies management feature of Scrapy. So you'll have multiple spider instances, each one crawling one specific search query and its results. But you need to run the spiders yourself with:
scrapy crawl myspider -a search_query=something
Or you can use Scrapyd for running all the spiders through the JSON API.
There are a couple of Scrapy extensions that provide a bit more functionality to work with sessions:
scrapy-sessions allows you to attache statically defined profiles (Proxy and User-Agent) to your sessions, process Cookies and rotate profiles on demand
scrapy-dynamic-sessions almost the same but allows you randomly pick proxy and User-Agent and handle retry request due to any errors