How to increase PSAW Max Retries - python

I keep receiving the following error when trying to collect a large data from pushshift.io using PSAW.
Exception: Unable to connect to pushshift.io. Max retries exceeded.
How can I increase the "max retries" so that this won't happen?
my_reddit_submissions=api.search_submissions(before=int(end_epoch.timestamp()),
post_hint='image',
filter=['id','full_link','title', 'url', 'subreddit', 'author_fullname'],
limit = frequency
)
for submission_x in my_reddit_submissions:
data_new=data_new.append(submission_x.d_, ignore_index=True)
BTW, my code works fine till a point...

You Should take a look at this question [Might help] : Max retries exceeded with URL in requests
This Exception Is Raised When The Server Actively refuses to communicate with you. This May happen if you request too many times to the server in a short period of time.
To OverCome This, you should wait for a few seconds before retrying
Here is an example :
import time
import requests
with requests.Session() as session:
while 1: # Infinite Loop used to send infinite requests to the server without waiting time
try:
response = session.get("https://www.example.com")
except requests.exceptions.ConnectionError:
response.status_code = "Connection Refused By The Server"
time.sleep(2) # Sleeping For 2 seconds to resolve the server overload error
print(response.status_code)

Related

How solve python requests error: "Max retries exceeded with url"

I have the follow code:
res = requests.get(url)
I use multi-thread method that will have the follow error:
ConnectionError: HTTPConnectionPool(host='bjtest.com', port=80): Max retries exceeded with url: /rest/data?method=check&test=123 (Caused by : [Errno 104] Connection reset by peer)
I have used the follow method, but it still have the error:
s = requests.session()
s.keep_alive = False
OR
res = requests.get(url, headers={'Connection': 'close'})
So, I should how do it?
BTW, the url is OK, but it only can be visited internal, so the url have no problem. Thanks!
you run your script on Mac? I also meet similar problem, you can execute ulimit -n to check how many files you can handle in a time.
you can use below to enlarge the configuration.
resource.setrlimit(resource.RLIMIT_NOFILE, (the number you reset,resource.RLIM_INFINITY))
hoping can help you.
my blog which associated with your problem
I got a similar case, hopefully it can save some time to you:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8001): Max retries exceeded with url: /enroll/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10f96ecc0>: Failed to establish a new connection: [Errno 61] Connection refused'))
The problem was actually silly... the localhost was down at port 8001! Restarting the server solved it.
The error message (which is admittedly a little confusing) actually means that requests failed to connect to your requested URL at all.
In this case that's because your url is http://bjtest.com/rest/data?method=check&test=123, which isn't a real website.
It has nothing to do with the format you made the request in. Fix your url and it should (presumably) work for you.

requests exception: max retries exceeded, Errno 60 operation timed out

When I was crawling data from a webpage, I got a max retry error. And although I've searched for it online, the Errno code seems to be different from what I had.
requests.exceptions.ConnectionError: HTTPConnectionPool(host={},port={}): Max retries exceeded with url: {} (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at {}>: Failed to establish a new connection: [Errno 60] Operation timed out'
I'm crawling websites with different remote host addresses, and only for this one address my script went wrong. In other cases it worked as usual. I tried to add time.sleep() but it didn't help with this error. So I think this error is not because I'm sending too many requests to the server.
I'd appreciate any help. Thank you!
The url it is failing on:
http://222.175.25.10:8403/ajax/npublic/NData.ashx?jsoncallback=jQuery1111054523240929524232_1457362751668&Method=GetMonitorDataList&entCode=37150001595&subType=&subID=&year=2016&itemCode=&dtStart=2015-01-01&dtEnd=2015-12-31&monitoring=1&bReal=false&page=1&rows=500&_=1457362751769
(Because the pages I'm crawling are generated by js so I reconstructed the url myself. )
Update: it is working now! The reason seems to be just that the website timed out.

Python Session 10054 Connection Aborted Error

I wrote a web scraper using requests module. I open up a session and send subsequent requests using this session. It has 2 phases.
1) Scrape page by page and collect id's in an array.
2) Get details about each id in the array using requests to an ajax server on the same host.
The scraper works fine on my Linux machine. However when I run the bot on Windows 10, phase 1 is completed just fine but after a couple of requests in phase 2 python throws this exception
File "c:\python27\lib\site-packages\requests\adapters.py", line 453, in send
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', error(10054, 'Varolan bir ba\xf0lant\xfd uzaktaki bir ana bilgisayar taraf\xfdndan zorla kapat\xfdld'))
What is different between two OS's which causes this? How can I overcome this problem?
Having modified my request code like below using retrying module had no positive effects. Now script doesn't throw exceptions but simply hangs doing nothing.
#retry(wait_exponential_multiplier=1000, wait_exponential_max=10000, stop_max_attempt_number=7)
def doReq(self, url):
time.sleep(0.5)
response = self.session.get(url, headers=self.headers)
return response
I still don't know why this problem occurs only in Windows. However, retrying decorator seems to have fixed the problem of socket error. The reason why the script hangs was due to the server not responding to a request. By default requests mode waits forever for a response. By adding a timeout value requests throws a timeout exception and retry decorator catches it and tries again. I know this is a work around rather than a solution but this is the best I've got right now.

Deadline exceeded while waiting for HTTP response from URL: HTTPException

I'm developing an app to listen for twitter hash tags using tweepy. I have uploaded my app to Google App Engine and it's giving me below error.
Last line of Traceback:
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/gae_override/httplib.py", line 524, in getresponse
raise HTTPException(str(e))
HTTPException: Deadline exceeded while waiting for HTTP response from URL: https://stream.twitter.com/1.1/statuses/filter.json?delimited=length
How could I solve this?
You can set the default timeout for url fetch, I believe it's set to 5 seconds by default. That endpoint call might take longer. Perhaps 30 seconds?
urlfetch.fetch(url=url, method=urlfetch.GET, deadline=30)
You can go up to 60 per the docs: https://cloud.google.com/appengine/docs/python/urlfetch/#Python_Fetching_URLs_in_Python
I am running a simple app on GAE which interacts with a jenkins CI server, using jenkinsapi library which depends on requests. I am shipping both jenkinsapi as well as requests with my app, requests is not supported on GAE though it exists in Google Cloud SDK where I took it from.
jenkinsapi sends a sick number of requests to the server, I was getting very often
File "/base/data/home/apps/s~jenkins-watcher/v0-1.382631715892564425/libs/requests-2.3.0-py2.7.egg/requests/adapters.py", line 375, in send
raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='XXXXXXX', port=8080):
Max retries exceeded with url: XXXXXX
(Caused by <class 'gae_override.httplib.HTTPException'>:
Deadline exceeded while waiting for HTTP response from URL: XXXXXXXX
It turned out that number of retries is 0 and timeout was some very low default. Increasing both numbers, for which I had to patch the library, helped and I am not seeing this problem anymore.
Actually, it still happens, but:
Retrying (3 attempts remain) after connection broken by 'HTTPException('Deadline exceeded while waiting for HTTP response from URL ...

Python requests multithreading "Max Retries exceeded with url" Caused by <class 'socket.gaierror'>

I'm trying to concurrently download a bunch of urls with both the requests module and python's built in multiprocessing library. When using the two together, i'm experiencing some errors which definitely do not look right. I sent out 100 requests with 100 threads and usually 50 of them end in success while the other 50 receive this message:
TTPConnectionPool(host='www.reuters.com', port=80): Max retries exceeded with url:
/video/2013/10/07/breakingviews-batistas-costly-bluster?videoId=274054858&feedType=VideoRSS&feedName=Business&videoChannel=5&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+reuters%2FUSVideoBusiness+%28Video+%2F+US+%2F+Business%29 (Caused by <class 'socket.gaierror'>: [Errno 8] nodename nor servname provided, or not known)
Neither the max retries nor the nodename not provided lines look right.
Here is my requests setup:
import requests
req_kwargs = {
'headers' : {'User-Agent': 'np/0.0.1'},
'timeout' : 7,
'allow_redirects' : True
}
# I left out the multiprocessing code but that part isn't important
resp = requests.get(some_url, req_kwargs**)
Does anyone know how to prevent or at least move further in debugging this?
Thank you.
I think it may be caused by high visit frequency that the site doesn't allow.
Try the following:
Just use a lower visit frequency to crawl that site and when you receive the same error again, visit the site in your web browser to see if the spider has been forbidden by the site.
Use a proxy pool to crawl the site to prevent the site deeming your visit frequency high and forbidding your spider.
Enrich your http request headers and make it like emitted by a web browser.
[Errno 8] nodename nor servname provided, or not known
Simply implies it can't resolve www.reuters.com
either place the ip resolution in the hosts file or domain

Categories

Resources