When I was crawling data from a webpage, I got a max retry error. And although I've searched for it online, the Errno code seems to be different from what I had.
requests.exceptions.ConnectionError: HTTPConnectionPool(host={},port={}): Max retries exceeded with url: {} (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at {}>: Failed to establish a new connection: [Errno 60] Operation timed out'
I'm crawling websites with different remote host addresses, and only for this one address my script went wrong. In other cases it worked as usual. I tried to add time.sleep() but it didn't help with this error. So I think this error is not because I'm sending too many requests to the server.
I'd appreciate any help. Thank you!
The url it is failing on:
http://222.175.25.10:8403/ajax/npublic/NData.ashx?jsoncallback=jQuery1111054523240929524232_1457362751668&Method=GetMonitorDataList&entCode=37150001595&subType=&subID=&year=2016&itemCode=&dtStart=2015-01-01&dtEnd=2015-12-31&monitoring=1&bReal=false&page=1&rows=500&_=1457362751769
(Because the pages I'm crawling are generated by js so I reconstructed the url myself. )
Update: it is working now! The reason seems to be just that the website timed out.
Related
I have this host: http://retsau.torontomls.net:7890/and I want to access http://retsau.torontomls.net:7890/rets-treb3pv/server/login, how can I accomplish this using Python Requests? All my attempts till now have failed.
I also followed the solution here - Python Requests - Use navigate site by servers IP and came up with this -
response = requests.get(http://206.152.41.279/rets-treb3pv/server/login, headers={'Host': retsau.torontomls.net})
but that resulted in this error:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='206.152.41.279', port=80): Max retries exceeded with url: /rets-treb3pv/server/login (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10a4f6d84>: Failed to establish a new connection: [Errno 60] Operation timed out',))
Funny thing is everything seems to work perfectly fine on Postman, I am able to access all sorts of URLs on that server, from logging in to searching for something.
You left out the port number (7890) from the URL to your get call:
response = requests.get('http://206.152.41.279:7890/rets-treb3pv/server/login', headers={'Host': 'retsau.torontomls.net'})
# ^^^^ Add this
Also, unless you actually have a specific reason for accessing the site by IP address, it would make more sense to put the FQDN in the URL rather than the Host header:
response = requests.get('http://retsau.torontomls.net:7890/rets-treb3pv/server/login')
I have a uWSGI server running in a linux VM node where multiple requests are made to this.
At only some point there are some errors like ReadTimeout, HTTPConnectionPool and recovered automatically.
ConnectionError: HTTPConnectionPool(host='10.1.1.1', port=8000): Max retries exceeded with url: /app_servers (Caused by NewConnectionError('<requests.packages.urllib3.connection.
HTTPConnection object at 0x7f16e8a89190>: Failed to establish a new connection: [Errno 101] Network is unreachable',))
Is it due to requests exceeded ? or some network lookup issue.
I tried using netstat and sar command to identify the root cause, but CPU and IO stats are fine.
No of establisbed connected(ESTABLISHED) and CLOSE_WAIT state requests are also less. Not sure how to check for the past time.
How to check the number of http connection made at that point of time or why the HTTPConnectionPool (Max url exceeds)error occurs
I have the follow code:
res = requests.get(url)
I use multi-thread method that will have the follow error:
ConnectionError: HTTPConnectionPool(host='bjtest.com', port=80): Max retries exceeded with url: /rest/data?method=check&test=123 (Caused by : [Errno 104] Connection reset by peer)
I have used the follow method, but it still have the error:
s = requests.session()
s.keep_alive = False
OR
res = requests.get(url, headers={'Connection': 'close'})
So, I should how do it?
BTW, the url is OK, but it only can be visited internal, so the url have no problem. Thanks!
you run your script on Mac? I also meet similar problem, you can execute ulimit -n to check how many files you can handle in a time.
you can use below to enlarge the configuration.
resource.setrlimit(resource.RLIMIT_NOFILE, (the number you reset,resource.RLIM_INFINITY))
hoping can help you.
my blog which associated with your problem
I got a similar case, hopefully it can save some time to you:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8001): Max retries exceeded with url: /enroll/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10f96ecc0>: Failed to establish a new connection: [Errno 61] Connection refused'))
The problem was actually silly... the localhost was down at port 8001! Restarting the server solved it.
The error message (which is admittedly a little confusing) actually means that requests failed to connect to your requested URL at all.
In this case that's because your url is http://bjtest.com/rest/data?method=check&test=123, which isn't a real website.
It has nothing to do with the format you made the request in. Fix your url and it should (presumably) work for you.
I'm trying to concurrently download a bunch of urls with both the requests module and python's built in multiprocessing library. When using the two together, i'm experiencing some errors which definitely do not look right. I sent out 100 requests with 100 threads and usually 50 of them end in success while the other 50 receive this message:
TTPConnectionPool(host='www.reuters.com', port=80): Max retries exceeded with url:
/video/2013/10/07/breakingviews-batistas-costly-bluster?videoId=274054858&feedType=VideoRSS&feedName=Business&videoChannel=5&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+reuters%2FUSVideoBusiness+%28Video+%2F+US+%2F+Business%29 (Caused by <class 'socket.gaierror'>: [Errno 8] nodename nor servname provided, or not known)
Neither the max retries nor the nodename not provided lines look right.
Here is my requests setup:
import requests
req_kwargs = {
'headers' : {'User-Agent': 'np/0.0.1'},
'timeout' : 7,
'allow_redirects' : True
}
# I left out the multiprocessing code but that part isn't important
resp = requests.get(some_url, req_kwargs**)
Does anyone know how to prevent or at least move further in debugging this?
Thank you.
I think it may be caused by high visit frequency that the site doesn't allow.
Try the following:
Just use a lower visit frequency to crawl that site and when you receive the same error again, visit the site in your web browser to see if the spider has been forbidden by the site.
Use a proxy pool to crawl the site to prevent the site deeming your visit frequency high and forbidding your spider.
Enrich your http request headers and make it like emitted by a web browser.
[Errno 8] nodename nor servname provided, or not known
Simply implies it can't resolve www.reuters.com
either place the ip resolution in the hosts file or domain
does anyone have any tips on how to troubleshoot connection errors on appengine ?
I've been running a cron job on appengine that makes couple of PayPal API calls for almost about a month, and just recently started seeing the deanline exceeded errors, which seems to be due to ConnectionError. The api.paypal.com is our live/production server and none of our other monitors are failing - so this makes me think that it's something specific to appengine.
Any tips on how to troubleshoot this ? We don't really see anything on our server side - so unless something else upstream at network level is blocking the connections.
File "/base/data/home/apps/s~pp-stashboard/10.371471577236231745/requests/adapters.py", line 356, in send
raise ConnectionError(e)
HTTPSConnectionPool(host='api.paypal.com', port=443): Max retries exceeded
with url: /v1/oauth2/token (Caused by :
Deadline exceeded while waiting for HTTP response from URL:
https://api.paypal.com/v1/oauth2/token)
Thanks for your help!