Reproduced everything and still spider gets 302 response, manually is 200 - python

I'm trying to scrape a site with Scrapy. Spider managed to login (I can download profile page etc). However when spider tries to access a page with data it get response with status 302 and is redirected to the main page of the site.
When I'm doing this manually with browser everything is good and browser get 200 response and page with data.
What I've done so far:
- analysed request headers with Firebug and reproduced exactly within spider
- analysed cookies. Some of them were missing in spider because are generated by JS, added manually. So cookies are also identical.
- by deleting cookies one by one, found 1 cookie that is responsible for session login. When delete it via firebug I got the same server behaviour as spider faces - 302 and redirect to main page. Funny thing is that this cookie is received and send by spider perfectly, according to debug info
- IP is the same (running from the same machine, spider under vagrant)
NEED HELP/ADVICE: What do i miss? Requests are very similar and server response is different What could be the reason?
Thanks in advance
Code is
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy_test.items import Website
from scrapy.http import Request
import scrapy
import json
import time
class DmozSpider(Spider):
name = "mims_clean"
custom_settings = {"DOWNLOADER_MIDDLEWARES": {'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400}}
allowed_domains = []
start_urls = [
"https://sso.mims.com"
]
def parse(self, response):
self.logger.info("started")
SessionId = response.xpath('//input[#id="SessionId"]/#value').extract()[0]
self.logger.info("SessionId is " + SessionId)
print("initial")
return scrapy.FormRequest.from_response(
response,
formdata={'SignInEmailAddress': '26rqtdywe6tdg#gmail.com',
'SignInPassword': 'pass777',
'SessionId' : SessionId},
callback=self.after_login, dont_filter=True
)
# here everything is ok, and after login we're redirected to profile page
# same as in browser
def after_login(self, response):
# check login succeed before going on
self.logger.debug("entered after_login")
if "error" in response.body:
self.logger.error("Login failed")
print('Error found')
return
else:
return Request(url="https://www.mims.com/Hongkong/Browse/Alphabet/A?cat=company",
callback=self.after_login2, dont_filter=True)
# we try to achieve data page
def after_login2(self, response):
#but got redirected to interin tracking page
# same as in browserreturn
scrapy.FormRequest.from_response(
response,
callback=self.after_login3, dont_filter=True, dont_click=True
)
# submit a form (in browser is done via JS)
def after_login3(self, response):
# got to another form with prepopulated fields (same as in browser)
request = scrapy.FormRequest.from_response(
response,
callback=self.after_login4, dont_filter=True, dont_click = True
)
return request
#submit the form and got to the main page with all the cookies
def after_login4(self, response):
#from the main page try to go to data page
request = Request(url="https://mims.com/Hongkong/Browse/Alphabet/A?cat=drug",
callback=self.after_login5, dont_filter=True, headers={'Host':'mims.com',
'Connection':'keep-alive'})
return request
def after_login5(self, response):
# in the real bwser here I got 200 response and catalog to scrape
# with scrapy I got 302 and get redirected to the 1st form
# in fact its a loop to after_login2
#All the rest is a hopeless try to submit forms for the 2nd time :))
return scrapy.FormRequest.from_response(
response,
callback=self.after_login6, dont_filter=True
)
def after_login6(self, response):
return scrapy.FormRequest.from_response(
response,
callback=self.parse_final, dont_filter=True)
def parse_final(self, response):
self.logger.debug("entered parse_tastypage")
f = open('parse_final.html', 'w')
f.write('<base href="https://sso.mims.com">')
f.write(response.body)
f.close()
return

Related

Passing session cookies established in Scrapy to Splash to utilize in scraping js page

First let me say that I'm a novice at Scrapy!
I have a website that requires a login prior to being able to scrape any data with Scrapy. The data that I will be scraping is generated by JavaScript once logged in.
I have successfully been able to login using Scrapy. My question is now that I have logged in and have the necessary cookies to continue making requests of the website how do I transfer those cookies to Splash when envoking a SplashRequest on the report page that I want to scrape with Scrapy? The documentation that I've read is difficult for me to understand and seems way too generic. I've looked for examples but have come up blank.
Is my thought process wrong that I should log in with Scrapy then pass the cookies to Splash or should I just be doing this all through Splash? If so, how do I pass username and password variables in Splash?
Here is my Scrapy code
import scrapy
from scrapy.http import FormRequest
from scrapy_splash import SplashRequest
class mySpider(scrapy.Spider):
login_url = 'https://example.com/'
name = 'reports'
start_urls = [
login_url
]
def parse(self, response):
return FormRequest.from_response(response,formdata={
'username': 'XXXXXX',
'password': 'YYYYYY'
},callback = self.start_requests)
def start_requests(self):
url = 'https://example.com/reports'
yield SplashRequest(url=url, callback=self.start_scraping)
def start_scraping(self, response):
labels = response.css('label::text').extract()
yield {'labeltext': labels}
This is simplified for the moment just to return random labels so that I know I'm logged in and Scrapy is seeing the report. What is happening is it is logging on but of course once I envoke Splash to render the javaScript report Splash is being redirected to login rather than going to the example.com/reports website. Any help or points in the right direction would be MUCH appreciated.
TIA
OK, as usual, after spending hours searching and several more experimenting I found the answer and am now behind the login scraping data using Scrapy from a JS created table. Also as usual I was over complicating things.
Below is my code that is based on the above and simplistically logs in using Splash and then starts scraping.
This utilizes the tool SplashFormRequest rather than Scrapy's FormRequest to login using Splash.
import scrapy
from scrapy_splash import SplashFormRequest
from ..items import UnanetTestItem
class MySpider(scrapy.Spider):
login_url = 'https://example.com'
name = 'Example'
start_urls = [
login_url
]
def parse(self, response):
return SplashFormRequest.from_response(
response,
formdata={
'username': 'username',
'password': 'password'
},
callback = self.start_scraping)
def start_scraping(self, response):
#whatever you want to do from here.

How to not crawl If the same as crawl data before in scrapy framework

I wouldn't like to crawl If the same as crawl data before in scrapy framework.
In order to solve this problem, I think that when crawl was done, put date-time in the DB and do not crawl if Last-Modified response HTTP has not been updated since that date-time.
My questions are the following two.
How do you think about this way?Is there better Idea?
Could you teach me if there is a code that can reference the Last-Modified response HTTP control with scrapy framework?
Thank you for reading my question.
Not all website return Last-Modified header, if you are certain your's does you can try having HEAD request first to check the headers and match with your DB info and then have GET request to crawl data:
def parse(self, response):
urls = [] # some urls
for url in urls:
yield Request(url, method='HEAD', self.check)
def check(self, response):
date = response.headers['Last-Modified']
#check date to your db
if db_date > date: # or whatever is your case
yield Request(response.url, self.success)
def success(self, response):
yield item

Scrapy CrawlSpider retry scrape

For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like:
def parse_page(self, response):
url = response.url
# Check to make sure the page is loaded
if 'var PageIsLoaded = false;' in response.body:
self.logger.warning('parse_page encountered an incomplete rendering of {}'.format(url))
yield Request(url, self.parse, dont_filter=True)
return
...
# Normal parsing logic
However, it seems like when the retry logic gets called and a new Request is issued, the pages and the links they contain don't get crawled or scraped. My thought was that by using self.parse which the CrawlSpider uses to apply the crawl rules and dont_filter=True, I could avoid the duplicate filter. However with DUPEFILTER_DEBUG = True, I can see that the retry requests get filtered away.
Am I missing something, or is there a better way to handle this? I'd like to avoid the complication of doing dynamic js rendering using something like splash if possible, and this only happens intermittently.
I would think about having a custom Retry Middleware instead - similar to a built-in one.
Sample implementation (not tested):
import logging
logger = logging.getLogger(__name__)
class RetryMiddleware(object):
def process_response(self, request, response, spider):
if 'var PageIsLoaded = false;' in response.body:
logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
return self._retry(request) or response
return response
def _retry(self, request):
logger.debug("Retrying %(request)s", {'request': request})
retryreq = request.copy()
retryreq.dont_filter = True
return retryreq
And don't forget to activate it.

Scrapy handling session cookie or 302 in asp.net site

I am trying to crawl a web application written in asp.net.
I am trying to execute a search and crawl the search results page. Lets say that the search page is http://search.site.com/search/search.aspx
Now my crawler is pretty straight forward
class SitesearchSpider(Spider):
name = 'sitecrawl'
allowed_domains = ['search.site.org']
start_urls = [
"http://search.site.org/Search/Search.aspx"
]
def parse(self, response):
self.log("Calling Parse Method", level=log.INFO)
response = response.replace(body=response.body.replace("disabled",""))
return [FormRequest(
url="http://search.site.org/Search/Search.aspx",
formdata={'ctl00$phContent$ucUnifiedSearch$txtIndvl': '2441386'},
callback=self.after_search)]
def after_search(self, response):
self.log("In after search", level=log.INFO)
if "To begin your search" in response.body:
self.log("unable to get result")
else:
self.log(response.body)
But no matter what the same page (search.aspx) is only returned to the after_search callback instead of the expected searchresults.aspx with the results
This is what seems to happen in the browser
Search term is entered in one of the fields
Search button is clicked
On form submit to the same page (search.aspx) i see that it returns a 302 redirect to the search results page
The search results page then displays
I see that the asp.net session cookie is being used here because once a search is made, i can take the search results page URL something like
http://search.site.com/search/searchresults.aspx?key=searchkey&anothersearchparam=12 and open any tabs and the results load directly
If i open a new session and paste the URL then i am getting redirected to the search page
Now i went through the docs and I am not sure if I have to deal with the 302 or the aspnet session cookie. Any help would be appreciated.
You don't have to deal with 302, scrapy does itselr.
You can debug cookie, set DEBUG_COOKIE = 1 on settings
Did you check that what other params are send in post or get method when you search from browser, you have to pass them all in form data.
I suggest you to use fron _response, like:
return [FormRequest.from_response(
response,
formdata={'ctl00$phContent$ucUnifiedSearch$txtIndvl': '2441386'},

Using Scrapy with authenticated (logged in) user session

In the Scrapy docs, there is the following example to illustrate how to use an authenticated session in Scrapy:
class LoginSpider(BaseSpider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# continue scraping with authenticated session...
I've got that working, and it's fine. But my question is: What do you have to do to continue scraping with authenticated session, as they say in the last line's comment?
In the code above, the FormRequest that is being used to authenticate has the after_login function set as its callback. This means that the after_login function will be called and passed the page that the login attempt got as a response.
It is then checking that you are successfully logged in by searching the page for a specific string, in this case "authentication failed". If it finds it, the spider ends.
Now, once the spider has got this far, it knows that it has successfully authenticated, and you can start spawning new requests and/or scrape data. So, in this case:
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
# ...
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
return Request(url="http://www.example.com/tastypage/",
callback=self.parse_tastypage)
def parse_tastypage(self, response):
hxs = HtmlXPathSelector(response)
yum = hxs.select('//img')
# etc.
If you look here, there's an example of a spider that authenticates before scraping.
In this case, it handles things in the parse function (the default callback of any request).
def parse(self, response):
hxs = HtmlXPathSelector(response)
if hxs.select("//form[#id='UsernameLoginForm_LoginForm']"):
return self.login(response)
else:
return self.get_section_links(response)
So, whenever a request is made, the response is checked for the presence of the login form. If it is there, then we know that we need to login, so we call the relevant function, if it's not present, we call the function that is responsible for scraping the data from the response.
I hope this is clear, feel free to ask if you have any other questions!
Edit:
Okay, so you want to do more than just spawn a single request and scrape it. You want to follow links.
To do that, all you need to do is scrape the relevant links from the page, and spawn requests using those URLs. For example:
def parse_page(self, response):
""" Scrape useful stuff from page, and spawn new requests
"""
hxs = HtmlXPathSelector(response)
images = hxs.select('//img')
# .. do something with them
links = hxs.select('//a/#href')
# Yield a new request for each link we found
for link in links:
yield Request(url=link, callback=self.parse_page)
As you can see, it spawns a new request for every URL on the page, and each one of those requests will call this same function with their response, so we have some recursive scraping going on.
What I've written above is just an example. If you want to "crawl" pages, you should look into CrawlSpider rather than doing things manually.

Categories

Resources