urllib2 post timeout in python - python

I'm possting a file to a webserver using urllib2.py in my script and it keeps timining out. My code:
def postdata(nodemac,filename,timestamp):
try:
wakeup()
socket.setdefaulttimeout(TIMEOUT)
opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler)
host = HOST
func = "post_data"
url = "http://{0}{1}?f={2}&nodemac={3}&time={4}".format(host, URI, func, nodemac, timestamp)
if os.path.isfile(filename):
data = {"data":open(filename,"rb")}
response = opener.open(url, data, timeout=TIMEOUT)
retval = response.read()
if "SUCCESS" in retval:
return 0
else:
print "RETVAL "+retval
return 99
else:
print filename +" is not a file"
return 99
except Exception as e:
print e
return 99
I set TIMEOUT to 20, 60 and 120 but the same thing keeps happening. What's going on with that I'm wondering? Something is foul! The timeout set to 20 used to work just fine and then all of a sudden, today it started to time out on me... does anyone have any clues? I couldn't find anything that brought me further on the web so I thought I'd try here...!
Thanks,
Traceback:
File "gateway.py", line 686, in CloudRun
read_img()
File "gateway.py", line 668, in read_img
retval = database.postimg(mac,fh,timestmp)
File "/root/database.py", line 100, in postimg
response = opener.open(url, data, timeout=TIMEOUT)
File "/usr/lib/python2.7/urllib2.py", line 394, in open
File "/usr/lib/python2.7/urllib2.py", line 412, in _open
File "/usr/lib/python2.7/urllib2.py", line 372, in _call_chain
File "/usr/lib/python2.7/urllib2.py", line 1199, in http_open
File "/usr/lib/python2.7/urllib2.py", line 1174, in do_open
URLError: <urlopen error timed out>

The timeout set to 20 used to work just fine and then all of a sudden, today it started to time out on me... does anyone have any clues?
Well, the first clue is that the behavior changed even though your code didn't. So, it's got to be something with the environment.
The most likely possibility is that the timeouts are a perfectly accurate sign that the server is overloaded, broken, etc. But, since you gave us no information at all about the server you're trying to talk to, it's hard to do much more than guess.
Here are some ideas for tracking down the problem.
First, take that exact same URL, paste it into your browser's address bar, and see what happens. Does it time out? Or take longer than 20 seconds to respond?
Try adding &data=data to the end of the URL.
Try using a simple tool to send the request as a POST. For example, from the command line, try curl -d data http://whatever.
Try logging exactly what headers and post data urllib2 used, and making curl send exactly the same thing. (If this fails, and the previous one worked, you may want to edit your question, or ask a new question, with "Why did this work and that fail?" with the details.)
Try running the same tests from a different computer with a different internet connection.

Related

Why does this code not download the file and the downloader can download it successfully

The problem begins with this link
https://i1.pixiv.net/img-zip-ugoira/img/2017/04/05/00/24/41/62259492_ugoira600x600.zip
the file downloaded with the downloader is complete.
enter image description here
and I try to use python to download the file
from urllib import request
import sys
request.urlretrieve('https://i1.pixiv.net/img-zip-ugoira/img/2017/04/05/00/24/41/62259492_ugoira600x600.zip', '123.zip')
Traceback (most recent call last):
File "C:/Users/ssshooter/PycharmProjects/first/111.py", line 3, in <module>
request.urlretrieve('https://i1.pixiv.net/img-zip-ugoira/img/2017/04/05/00/24/41/62259492_ugoira600x600.zip', '123.zip')
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
It doesn't work.
The differences are:
You're using different SSL information: You're browser has a built-in set of certificate authorities. Python uses a set which comes with the OS. They differ & if the site you're accessing uses one know to your browser but not known to python, the python will throw an exception.
You're accessing using different User-Agents. Your browser is telling the server it's Chrome or IE or whatever. Python is telling the server it's python. For whatever reason, the server may decide it doesn't like that and return Forbidden.
The server may be working harder than you think: while it appears the request is for a simple file, you're really requesting a resource. It may be (though unlikely in this case) that the resource you're requesting results in multiple interactions between the server and your browser -- cookies, javascript, etc -- which are executed successfully in your browser, returned to the server & it then delivers the file. Your python request is not doing any of that.
Your browser (may) have existing state which your python does not. You say you can access the file using your browser, but it could be that works only because you've accessed other resources on the site, or logged in, or whatever. Your browser is communicating that information (perhaps a session_id via cookie?) with the server recognizes. Your python code states with no previous state, so the server forbids that.
Which is it in this case? You'll need to investigate. Can you get wget or curl to work? Debug your browser's access: what headers are being sent, what are you receiving in reply?

"http.client.CannotSendRequest: Request-sent" error

Weird problem here. I have a Python 3 script that runs 24/7 and uses Selenium and Firefox to go to a web page and every 5 minutes downloads a file from a download link (which I can't just download with urllib, or whatever, because even though the link address for the download file remains constant, the data in the file is constantly changing and is different every time the page is reloaded and also depending on the criteria specified). The script runs fine almost all the time but I can't get rid of this one error that pops up every once in a while which terminates the script. Here's the error:
Traceback (most recent call last):
File "/Users/Shared/ROTH_1/Folio/get_F_notes.py", line 248, in <module>
driver.get(search_url)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 187, in get
self.execute(Command.GET, {'url': url})
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/remote/webdriver.py", line 173, in execute
response = self.command_executor.execute(driver_command, params)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute
return self._request(command_info[0], url, body=data)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/selenium/webdriver/remote/remote_connection.py", line 379, in _request
self._conn.request(method, parsed_url.path, body, headers)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 1090, in request
self._send_request(method, url, body, headers)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 1118, in _send_request
self.putrequest(method, url, **skips)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py", line 966, in putrequest
raise CannotSendRequest(self.__state)
http.client.CannotSendRequest: Request-sent
And here is the part of my script where the problem comes in, specifically, the script hits the "except ConnectionRefusedError:" part and, as intended, prints out "WARNING 1 : ConnectionRefusedError: search page did not load; now trying again". However, I get the above error, I think, when the loop begins again and tries to "driver.get(search_url)" again. The script chokes at that point and gives me the above error.
I have researched this quite a bit and it seems possible that the script is trying to reuse the same connection from the first attempt. The fix seems to be to create a new connection. But that is all I have been able to gather and, I have no idea how to create a new connection with Selenium. Do you? Or, is there some other issue here?
search_url = 'https://www.example.com/download_page'
loop_get_search_page = 1
while loop_get_search_page < 7:
if loop_get_search_page == 6:
print('WARNING: tried to load search page 5 times; exiting script to try again later')
##### log out
try:
driver.find_element_by_link_text('Sign Out')
except NoSuchElementException:
print('WARNING: NoSuchElementException: Unable to find the link text for the "Sign Out" button')
driver.quit()
raise SystemExit
try:
driver.get(search_url)
except TimeoutException:
print('WARNING ', loop_get_search_page, ': TimeoutException: search page did not load; now trying again', sep='')
loop_get_search_page += 1
continue
except ConnectionRefusedError:
print('WARNING ', loop_get_search_page, ': ConnectionRefusedError: search page did not load; now trying again')
loop_get_search_page += 1
continue
else:
break
Just ran into this problem myself. In my case, I had another thread running on the side that was also making requests via WebDriver. Turns out WebDriver is not threadsafe.
Check out the discussion at Can Selenium use multi threading in one browser? and the links there for more context.
When I removed the other thread, everything started working as expected.
Is it possible that you're running every 5m in a new thread?
The only way I know of to "create a new connection" is to launch a new instance of the WebDriver. That can get slow if you're doing a lot of requests, but since you're only doing things every 5m, it shouldn't really affect your throughput. As long as you always clean up your WebDriver instance when your dl is done, this might be a good option for you.

Repeated POST request is causing error "socket.error: (99, 'Cannot assign requested address')"

I have a web-service deployed in my box. I want to check the result of this service with various input. Here is the code I am using:
import sys
import httplib
import urllib
apUrl = "someUrl:somePort"
fileName = sys.argv[1]
conn = httplib.HTTPConnection(apUrl)
titlesFile = open(fileName, 'r')
try:
for title in titlesFile:
title = title.strip()
params = urllib.urlencode({'search': 'abcd', 'text': title})
conn.request("POST", "/somePath/", params)
response = conn.getresponse()
data = response.read().strip()
print data+"\t"+title
conn.close()
finally:
titlesFile.close()
This code is giving an error after same number of lines printed (28233). Error message:
Traceback (most recent call last):
File "testService.py", line 19, in ?
conn.request("POST", "/somePath/", params)
File "/usr/lib/python2.4/httplib.py", line 810, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.4/httplib.py", line 833, in _send_request
self.endheaders()
File "/usr/lib/python2.4/httplib.py", line 804, in endheaders
self._send_output()
File "/usr/lib/python2.4/httplib.py", line 685, in _send_output
self.send(msg)
File "/usr/lib/python2.4/httplib.py", line 652, in send
self.connect()
File "/usr/lib/python2.4/httplib.py", line 636, in connect
raise socket.error, msg
socket.error: (99, 'Cannot assign requested address')
I am using Python 2.4.3. I am doing conn.close() also. But why is this error being given?
This is not a python problem.
In linux kernel 2.4 the ephemeral port range is from 32768 through 61000. So number of available ports = 61000-32768+1 = 28233. From what i understood, because the web-service in question is quite fast (<5ms actually) thus all the ports get used up. The program has to wait for about a minute or two for the ports to close.
What I did was to count the number of conn.close(). When the number was 28000 wait for 90sec and reset the counter.
BIGYaN identified the problem correctly and you can verify that by calling "netstat -tn" right after the exception occurs. You will see very many connections with state "TIME_WAIT".
The alternative to waiting for port numbers to become available again is to simply use one connection for all requests. You are not required to call conn.close() after each call of conn.request(). You can simply leave the connection open until you are done with your requests.
I too faced similar issue while executing multiple POST statements using python's request library in Spark. To make it worse, I used multiprocessing over each executor to post to a server. So thousands of connections created in seconds that took few seconds each to change the state from TIME_WAIT and release the ports for the next set of connections.
Out of all the available solutions available over the internet that speak of disabling keep-alive, using with request.Session() et al, I found this answer to be working which makes use of 'Connection' : 'close' configuration as header parameter. You may need to put the header content in a separte line outside the post command though.
headers = {
'Connection': 'close'
}
with requests.Session() as session:
response = session.post('https://xx.xxx.xxx.x/xxxxxx/x', headers=headers, files=files, verify=False)
results = response.json()
print results
This is my answer to the similar issue using the above solution.

Error 429 when invoking Reddit api from Google App Engine

I have been running a cron job on Google App Engine for over a month now without any issues. The job does a variety of things, one being that it uses urllib2 to make a call to retrieve a json response from Reddit as well as a few other sites. About two weeks ago I started seeing errors when invoking Reddit, but no errors when invoking the other sites. The error I am receiving is HTTP error 429.
I have tried executing the same code outside of Google App Engine and do not have any issues. I tried using urlFetch, but receive the same error.
You can see the error when using the app engine's interactive shell with the following code.
import urllib2
data = urllib2.urlopen('http://www.reddit.com/r/Music/.json', timeout=60)
Edit: Not sure why it always fails for me and not someone else. This is the error that I receive:
>>> import urllib2
>>> data = urllib2.urlopen('http://www.reddit.com/r/Music/.json', timeout=60)
Traceback (most recent call last):
File "/base/data/home/apps/s~shell-27/1.356011914885973647/shell.py", line 267, in get
exec compiled in statement_module.__dict__
File "<string>", line 1, in <module>
File "/base/python27_runtime/python27_dist/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/base/python27_runtime/python27_dist/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/base/python27_runtime/python27_dist/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/base/python27_runtime/python27_dist/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/base/python27_runtime/python27_dist/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/base/python27_runtime/python27_dist/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 429: Unknown
similar code running outside of app engine with no problem:
print urllib2.urlopen('http://www.reddit.com/r/Music/.json').read()
At first I thought it had to do with a timeout problem since it was originally working, but since there is not a timeout error but a the strange HttpError code, I'm not sure.
Any ideas?
Reddit rate limits the api pretty severely for the default user agent for the python shell. You need to set a unique user agent with your reddit username in it, like this:
User-Agent: super happy flair bot by /u/spladug
More info about the reddit api here https://github.com/reddit/reddit/wiki/API.
It's possible that Reddit is counting calls based on IP - which means that other applications on GAE which share your IP might already be exhausting the quota.
This might get better if you use Reddit API keys (I don't know if they issue them) or if they agree to rate limit API calls based on the app header.

Python's `urllib2`: Why do I get error 403 when I `urlopen` a Wikipedia page?

I have a strange bug when trying to urlopen a certain page from Wikipedia. This is the page:
http://en.wikipedia.org/wiki/OpenCola_(drink)
This is the shell session:
>>> f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
Traceback (most recent call last):
File "C:\Program Files\Wing IDE 4.0\src\debug\tserver\_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "c:\Python26\Lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "c:\Python26\Lib\urllib2.py", line 397, in open
response = meth(req, response)
File "c:\Python26\Lib\urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "c:\Python26\Lib\urllib2.py", line 435, in error
return self._call_chain(*args)
File "c:\Python26\Lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "c:\Python26\Lib\urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
This happened to me on two different systems in different continents. Does anyone have an idea why this happens?
Wikipedias stance is:
Data retrieval: Bots may not be used
to retrieve bulk content for any use
not directly related to an approved
bot task. This includes dynamically
loading pages from another website,
which may result in the website being
blacklisted and permanently denied
access. If you would like to download
bulk content or mirror a project,
please do so by downloading or hosting
your own copy of our database.
That is why Python is blocked. You're supposed to download data dumps.
Anyways, you can read pages like this in Python 2:
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib2.urlopen( req )
print con.read()
Or in Python 3:
import urllib
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib.request.urlopen( req )
print(con.read())
To debug this, you'll need to trap that exception.
try:
f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
except urllib2.HTTPError, e:
print e.fp.read()
When I print the resulting message, it includes the following
"English
Our servers are currently experiencing
a technical problem. This is probably
temporary and should be fixed soon.
Please try again in a few minutes. "
Often times websites will filter access by checking if they are being accessed by a recognised user agent. Wikipedia is just treating your script as a bot and rejecting it. Try spoofing as a browser. The following link takes to you an article to show you how.
http://wolfprojects.altervista.org/changeua.php
Some websites will block access from scripts to avoid 'unnecessary' usage of their servers by reading the headers urllib sends. I don't know and can't imagine why wikipedia does/would do this, but have you tried spoofing your headers?
As Jochen Ritzel mentioned, Wikipedia blocks bots.
However, bots will not get blocked if they use the PHP api.
To get the Wikipedia page titled "love":
http://en.wikipedia.org/w/api.php?format=json&action=query&titles=love&prop=revisions&rvprop=content
I made a workaround for this using php which is not blocked by the site I needed.
it can be accessed like this:
path='http://phillippowers.com/redirects/get.php?
file=http://website_you_need_to_load.com'
req = urllib2.Request(path)
response = urllib2.urlopen(req)
vdata = response.read()
This will return the html code to you

Categories

Resources