Scrapy download error and remove_request error - python

Author note: You might think that this post is lacking context or information, that is only because I don't know where to start. I'll gladly edit with additional information at your request.
Running scrapy I see the following error amongst all the link I am scraping:
ERROR: Error downloading <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/__init__.py", line 75, in _deactivate
self.active.remove(request)
KeyError: <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
2016-01-19 15:57:20 [scrapy] INFO: Error while removing request from slot
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 140, in <lambda>
d.addBoth(lambda _: slot.remove_request(request))
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 38, in remove_request
self.inprogress.remove(request)
KeyError: <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
When I run scrappy simply on that single URL using:
scrappy shell http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html
No errors are occurring. I am scrapping thousands of similar links with no problem but I see this issue on ~10 links. I am using the default 180 seconds download timeout from scrappy.
I don't see anything wrong with these links in my web browser too.
The parsing is initiated by the request:
request = Request(url_nrd,meta = {'item' : item},callback=self.parse_player,dont_filter=True)
Which is handled in the functions:
def parse_player(self, response):
if response.status == 404:
#doing stuff here
yield item
else:
#doing stuff there
request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
yield request
def parse_more(self, response):
#parsing more stuff here
return item
Also: I didn't change defaults settings for download retries in scrappy (but I don't see any retries in my log files either).
Additional notes:
After my scraping completed and since dont_filter=True I can see that links that failed to download with the previous error at some point, didn't fail when called in previous and subsequent requests.
Possible answer:
I see that I am getting a KeyError on one of the spiders and that de-allocation of that spider failed (remove_request). Is it possible that it is because I am setting dont_filter=True and doing several requests on the same URL and that the key of the spider seems to be that URL? That the spider was de-allocated by a previous, concurrent request on the same URL?
In that case how to have a unique key per request and not indexed on the URL?
EDIT
I think my code in parse_player was the problem, I don't know for sure because I edited my code since, but I recall seeing a bad indent on yield request.
def parse_player(self, response):
if response.status == 404:
#doing stuff here
yield item
else:
paths = sel.xpath('some path extractor here')
for path in paths:
if (some_condition):
#doing stuff there
request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
# Bad indent of yield request here!
yield request
Let me know if you think that might have caused the issue.

And if you simply ignore the errors ??
def parse_player(self, response):
if response.status == 200:
paths = sel.xpath('some path extractor here')
for path in paths:
if (some_condition):
#doing stuff there
request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
# Bad indent of yield request here!
yield request

Related

Unable to get rid of some error raised by process_exception

I'm trying not to show/get some error thrown by scrapy within process_response in RetryMiddleware. The error the script encounters when max retry limit is crossed. I used proxies within middleware. The weird thing is that the exception the script throws is already within the EXCEPTIONS_TO_RETRY list. It is completely okay that the script may sometimes cross the number of max retries without any success. However, I just do not wish to see that error even when it is there, meaning suppress or bypass it.
The error is like:
Traceback (most recent call last):
File "middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
This is how process_response within RetryMiddleware looks like:
class RetryMiddleware(object):
cus_retry = 3
EXCEPTIONS_TO_RETRY = (defer.TimeoutError, TimeoutError, DNSLookupError, \
ConnectionRefusedError, ConnectionDone, ConnectError, \
ConnectionLost, TCPTimedOutError, TunnelError, ResponseFailed)
def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
and not request.meta.get('dont_retry', False):
return self._retry(request, exception, spider)
def _retry(self, request, reason, spider):
retries = request.meta.get('cus_retry',0) + 1
if retries<=self.cus_retry:
r = request.copy()
r.meta['cus_retry'] = retries
r.meta['proxy'] = f'https://{ip:port}'
r.dont_filter = True
return r
else:
print("done retrying")
How can I get rid of the errors in EXCEPTIONS_TO_RETRY?
PS: The error the script encounters when max retry limit is reached no matter whatever site I choose.
Maybe the problem is not on your side but there might be something wrong with the third party site. Maybe there is a connection error on their server or maybe it is secure so like no one can access it.
Cause the error even says that the error is with the party able it is shut down or not working properly maybe first check if the third party site is working when requested. Try contacting them if you can.
Cause the error is not in your end it's on the party's end as the error says.
This question is similar to Scrapy - Set TCP Connect Timeout
When max retry is reached, method like parse_error() should handle any error if it is there within your spider:
def start_requests(self):
for start_url in self.start_urls:
yield scrapy.Request(start_url,errback=self.parse_error,callback=self.parse,dont_filter=True)
def parse_error(self, failure):
# print(repr(failure))
pass
However, I thought of suggesting a completely different approach here. If you go the following route, you don't need any custom middleware at all. Everything including retrying logic is already there within the spider.
class mySpider(scrapy.Spider):
name = "myspider"
start_urls = [
"some url",
]
proxies = [] #list of proxies here
max_retries = 5
retry_urls = {}
def parse_error(self, failure):
proxy = f'https://{ip:port}'
retry_url = failure.request.url
if retry_url not in self.retry_urls:
self.retry_urls[retry_url] = 1
else:
self.retry_urls[retry_url] += 1
if self.retry_urls[retry_url] <= self.max_retries:
yield scrapy.Request(retry_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10}, errback=self.parse_error,dont_filter=True)
else:
print("gave up retrying")
def start_requests(self):
for start_url in self.start_urls:
proxy = f'https://{ip:port}'
yield scrapy.Request(start_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10},errback=self.parse_error,dont_filter=True)
def parse(self,response):
for item in response.css().getall():
print(item)
Don't forget to add the following line to get the aforesaid result from the above suggestion:
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
}
I'm using scrapy 2.3.0 by the way.
Try fixing the code in the scraper itself. Sometimes have a bad parse function can lead to a error of the kind you're describing. Once I fixed the code, it went away for me.

Capturing HTTP Errors using scrapy

I'm trying to scrape a website for broken links, so far I have this code which is successfully logging in and crawling the site, but it's only recording HTTP status 200 codes:
class HttpStatusSpider(scrapy.Spider):
name = 'httpstatus'
handle_httpstatus_all = True
link_extractor = LinkExtractor()
def start_requests(self):
"""This method ensures we login before we begin spidering"""
# Little bit of magic to handle the CSRF protection on the login form
resp = requests.get('http://localhost:8000/login/')
tree = html.fromstring(resp.content)
csrf_token = tree.cssselect('input[name=csrfmiddlewaretoken]')[0].value
return [FormRequest('http://localhost:8000/login/', callback=self.parse,
formdata={'username': 'mischa_cs',
'password': 'letmein',
'csrfmiddlewaretoken': csrf_token},
cookies={'csrftoken': resp.cookies['csrftoken']})]
def parse(self, response):
item = HttpResponseItem()
item['url'] = response.url
item['status'] = response.status
item['referer'] = response.request.headers.get('Referer', '')
yield item
for link in self.link_extractor.extract_links(response):
r = Request(link.url, self.parse)
r.meta.update(link_text=link.text)
yield r
The docs and these answers lead me to believe that handle_httpstatus_all = True should cause scrapy to pass errored requests to my parse method, but so far I've not been able to capture any.
I've also experimented with handle_httpstatus_list and a custom errback handler in a different iteration of the code.
What do I need to change to capture the HTTP error codes scrapy is encountering?
handle_httpstatus_list can be defined on the spider level, but handle_httpstatus_all can only be defined on the Request level, including it on the meta argument.
I would still recommend using an errback for these cases, but if everything is controlled, it shouldn't create new problems.
So, I don't know if this is the proper scrapy way, but it does allow me to handle all HTTP status codes (including 5xx).
I disabled the HttpErrorMiddleware by adding this snippet to my scrapy project's settings.py:
SPIDER_MIDDLEWARES = {
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
}

SharePlum error : "Can't get User Info List"

I'm trying to use SharePlum which is a Python module for SharePoint but when I try to connect to my SharePoint, SharePlum raises me this error:
Traceback (most recent call last):
File "C:/Users/me/Desktop/Sharpoint/sharpoint.py", line 13, in site = Site(sharepoint_url, auth=auth)
File "C:\Users\me\AppData\Local\Programs\Python\Python36\lib\site-packages\shareplum\shareplum.py", line 46, in init self.users = self.GetUsers()
File "C:\Users\me\AppData\Local\Programs\Python\Python36\lib\site-packages\shareplum\shareplum.py", line 207, in GetUsers raise Exception("Can't get User Info List")
Exception: Can't get User Info List
Here is the very short code that I have written:
auth = HttpNtlmAuth(username, password)
site = Site(sharepoint_url, auth=auth)
This error seems to indicate bad username/password but I'm pretty sure that the one I have are correct...
Ok, it seems that I found the solution for my problem, it's about the Sharepoint URL that I gave.
If we take this example : https://www.mysharepoint.com/Your/SharePoint/DocumentLibrary
You have to remove the last part : /DocumentLibrary.
Why remove this part precisely ?
In fact, when you go deep enough in your Sharepoint, your url will look like something like : https://www.mysharepoint.com/Your/SharePoint/DocumentLibrary/Forms/AllItems.aspx?RootFolder=%2FYour%2FSharePoint%2DocumentLibrary%2FmyPersonnalFolder&FolderCTID=0x0120008BBC54784D92004D1E23F557873CC707&View=%7BE149526D%2DFD1B%2D4BFA%2DAA46%2D90DE0770F287%7D
You can see that the right of the path is in RootFolder=%2FYour%2FSharePoint%2DocumentLibrary%2Fmy%20personnal%20folder and not in the "normal" URL anymore (if it were, it will be like that https://www.mysharepoint.com/Your/SharePoint/DocumentLibrary/myPersonnalFolder/).
What you have to remove is the end of the "normal" URL so in this case, /DocumentLibrary.
So my correct Sharepoint URL to input in SharePlum will be https://www.mysharepoint.com/Your/SharePoint/
I'm pretty new to Sharepoint so I'm not really sure that this I the right answer to this problem for the others persons, may someone who know Sharepoint better than me can confirm ?
I know this is not actual solution for your problem and I would add just comment but it was too long so I will post as answer.
I can't replicate your issue, but by looking into source code of shareplum.py you can see why program throws the error. In line 196 of shareplum.py there is if clause (if response.status_code == 200:) which checks if the request to access your sharepoint url was successful (than it has status code 200) and if request failed (than it has some other status code) than it throws exception (Can't get User Info List). If you want to find out more about your problem go to your shareplum.py file ("C:\Users\me\AppData\Local\Programs\Python\Python36\lib\site-packages\shareplum\shareplum.py") and add this line print('{} {} Error: {} for url: {}'.format(response.status_code, 'Client'*(400 <= response.status_code < 500) + 'Server'*(500 <= response.status_code < 600), response.reason, response.url)) before line 207 ('raise Exception("Can't get User Info List")'). Then your shareplum.py should look like this:
# Parse Response
if response.status_code == 200:
envelope = etree.fromstring(response.text.encode('utf-8'))
listitems = envelope[0][0][0][0][0]
data = []
for row in listitems:
# Strip the 'ows_' from the beginning with key[4:]
data.append({key[4:]: value for (key, value) in row.items() if key[4:]})
return {'py': {i['ImnName']: i['ID']+';#'+i['ImnName'] for i in data},
'sp': {i['ID']+';#'+i['ImnName'] : i['ImnName'] for i in data}}
else:
print('{} {} Error: {} for url: {}'.format(response.status_code, 'Client'*(400 <= response.status_code < 500) + 'Server'*(500 <= response.status_code < 600), response.reason, response.url))
raise Exception("Can't get User Info List")
Now just run your program again and it should print out why it isn't working.
I know it is best not to change files in Python modules, but if you know what you change then there is no problem so when you are finished just delete the added line.
Also when you find out status code you can search it online, just type it in google or search on List_of_HTTP_status_codes.

Requests CookieJar empty even thought the page have it

I'm on Python 3.5.1, using requests, the relevant part of the code is as follows:
req = requests.post(self.URL, data={"username": username, "password": password})
self.cookies = {"MOODLEID1_": req.cookies["MOODLEID1_"], "MoodleSession": req.cookies["MoodleSession"]}
self.URL has the correct page, and the POST is working as intended, I did some print to check that, and it passed.
My output:
Traceback (most recent call last):
File "D:/.../main.py", line 14, in <module>
m.login('first.last', 'pa$$w0rd!')
File "D:\...\moodle2.py", line 14, in login
self.cookies = {"MOODLEID1_": req.cookies["MOODLEID1_"], "MoodleSession": req.cookies["MoodleSession"]}
File "D:\...\venv\lib\site-packages\requests\cookies.py", line 287, in __getitem__
return self._find_no_duplicates(name)
File "D:\...\venv\lib\site-packages\requests\cookies.py", line 345, in _find_no_duplicates
raise KeyError('name=%r, domain=%r, path=%r' % (name, domain, path))
KeyError: "name='MOODLEID1_', domain=None, path=None"
I'm trying to debug during runtime to check what req.cookies has. But what I get is surprising, at least for me. If you put a breakpoint on self.cookies = {...} and run [(c.name, c.value, c.domain) for c in req.cookies] I get an empty list, like there isn't any cookie in there.
The site does send cookies, checking with a Chrome extension, I found 2, "MOODLEID1_" and "MoodleSession", so why I'm not getting them?
The response doesn't appear to contain any cookies. Look for one or more Set-Cookie headers in req.headers.
Cookies stored in a browser are there because a response included a Set-Cookie header for each of those cookies. You'll have to find what response the server sets those cookies with; apparently it is not this response.
If you need to retain those cookies (once set) across requests, do use a requests.Session() object; this'll retain any cookies returned by responses and send them out again as appropriate with new requests.

XML parser syntax error

So I'm working with a block of code which communicates with the Flickr API.
I'm getting a 'syntax error' in xml.parsers.expat.ExpatError (below). Now I can't figure out how it'd be a syntax error in a Python module.
I saw another similar question on SO regarding the Wikipedia API which seemed to return HTML intead of XML. Flickr API returns XML; and I'm also getting the same error when there shouldn't be a response from Flickr (such as flickr.galleries.addPhoto)
CODE:
def _dopost(method, auth=False, **params):
#uncomment to check you aren't killing the flickr server
#print "***** do post %s" % method
params = _prepare_params(params)
url = '%s%s/%s' % (HOST, API, _get_auth_url_suffix(method, auth, params))
payload = 'api_key=%s&method=%s&%s'% \
(API_KEY, method, urlencode(params))
#another useful debug print statement
#print url
#print payload
return _get_data(minidom.parse(urlopen(url, payload)))
TRACEBACK:
Traceback (most recent call last):
File "TESTING.py", line 30, in <module>
flickr.galleries_create('test_title', 'test_descriptionn goes here.')
File "/home/vlad/Documents/Computers/Programming/LEARNING/curatr/flickr.py", line 1006, in galleries_create
primary_photo_id=primary_photo_id)
File "/home/vlad/Documents/Computers/Programming/LEARNING/curatr/flickr.py", line 1066, in _dopost
return _get_data(minidom.parse(urlopen(url, payload)))
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: syntax error: line 1, column 62
(Code from http://code.google.com/p/flickrpy/ under New BSD licence)
UPDATE:
print urlopen(url, payload) == <addinfourl at 43340936 whose fp = <socket._fileobject object at 0x29400d0>>
Doing a urlopen(url, payload).read() returns HTML which is hard to read in a terminal :P but I managed to make out a 'You are not signed in.'
The strange part is that Flickr shouldn't return anything here, or if permissions are a problem, it should return a 99: User not logged in / Insufficient permissions error as it does with the GET function (which I'd expect would be in valid XML).
I'm signed in to Flickr (in the browser) and the program is properly authenticated with delete permissions (dangerous, but I wanted to avoid permission problems.)
SyntaxError normally means an error in Python syntax, but I think here that expatbuilder is overloading it to mean an XML syntax error. Put a try:except block around it, and print out the contents of payload and to work out what's wrong with the first line of it.
My guess would be that flickr is rejecting your request for some reason and giving back a plain-text error message, which has an invalid xml character at column 62, but it could be any number of things. You probably want to check the http status code before parsing it.
Also, it's a bit strange this method is called _dopost but you seem to actually be sending an http GET. Perhaps that's why it's failing.
This seems to fix my problem:
url = '%s%s/?api_key=%s&method=%s&%s'% \
(HOST, API, API_KEY, method, _get_auth_url_suffix(method, auth, params))
payload = '%s' % (urlencode(params))
It seems that the API key and method had to be in the URL not in the payload. (Or maybe only one needed to be there, but anyways, it works :-)

Categories

Resources