This question already has answers here:
setting the timeout on a urllib2.request() call
(3 answers)
Closed 7 years ago.
I'm trying to fetch a url by making a POST request using Python's urllib2 module. I'm constructing the request in the following way.
handler = urllib2.HTTPHandler()
opener = urllib2.build_opener(handler)
url = 'xyz...'
request = urllib2.Request(url,data='{}')
request.add_header('Content-Type','application/json')
request.get_method = lambda: 'POST'
try:
connection = opener.open(request)
except urllib2.HTTPError as e:
connection = e
except urllib2.URLError as e:
print 'TIMEOUT: ' + e.reason
I want to set a timeout for the open request someplace. Per the docs https://docs.python.org/3.1/library/urllib.request.html
the build_opener() call should return a OpenDirector instance which should have a timeout parameter. But I can't seem to get it to work. Also, the reason I'm constructing a request is because I need to specify an empty body data='{}' in the request and I can't seem to be able to get that going with urlopen either. Any help appreciated.
You can pass timeout as a parameter to the open method call of the opener.
Normal functioning using lambda function to ensure request is POST rather than GET with no body
>>> import urllib2
>>> handler = urllib2.HTTPHandler()
>>> opener = urllib2.build_opener(handler)
>>> request = urllib2.Request('http://httpbin.org/post')
>>> request.get_method = lambda: 'POST'
>>> opener.open(request)
<addinfourl at 4363264800 whose fp = <socket._fileobject object at 0x101b654d0>>
Simply add timeout,
>>> opener.open(request, timeout=0.01)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1197, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error timed out>
Related
I have searched a lot of similar question on SO, but did not find an exact match to my case.
I am trying to download a video using python 2.7
Here is my code for downloading the video
import urllib2
from bs4 import BeautifulSoup as bs
with open('video.txt','r') as f:
last_downloaded_video = f.read()
webpage = urllib2.urlopen('http://*.net/watch/**-'+last_downloaded_video)
soup = bs(webpage)
a = []
for link in soup.find_all('a'):
if link.has_attr('data-video-id'):
a.append(link)
#try just with first data-video-id
id = a[0]['data-video-id']
webpage2 = urllib2.urlopen('http://*/video/play/'+id)
soup = bs(webpage2)
string = str(soup.find_all('script')[2])
print string
url = string.split(': ')[1].split(',')[0]
url = url.replace('"','')
print url
print type(url)
video = urllib2.urlopen(url).read()
filename = "video.mp4"
with open(filename,'wb') as f:
f.write(video)
This code gives an unknown url type error. The traceback is
Traceback (most recent call last):
File "naruto.py", line 26, in <module>
video = urllib2.urlopen(url).read()
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 427, in _open
'unknown_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1247, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: 'http>
However, when i store the same url in a variable and attempt to download it from terminal, no error is shown.
I am confused as to what the problem is.
I got a similar question in python mailing list
It's hard to tell without seeing the HTML from the page that you are scraping, however, a stray ' (single quote) character at the beginning of the URL might be the cause - this causes the same exception:
>>> import urllib2
>>> urllib2.urlopen("'http://blah.com")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "urllib2.py", line 404, in open
response = self._open(req, data)
File "urllib2.py", line 427, in _open
'unknown_open', req)
File "urllib2.py", line 382, in _call_chain
result = func(*args)
File "urllib2.py", line 1249, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: 'http>
So, try cleaning up your URL and remove any stray quotes.
Update after OP feedback:
The results of the print statement indicate that the URL has a single quote character at the beginning and end of the URL string. There should not any quotes of any type surrounding the URL when it is passed to urlopen(). You can remove leading and trailing quotes (both single and double) from the URL string with this:
url = url.strip('\'"')
EDIT: I've majorly edited the content of this post since the original to specify my problem:
I am writing a program to download webcomics, and I'm getting this weird error when downloading a page of the comic. The code I am running essentially boils down to the following line followed by the error. I do not know what is causing this error, and it is confusing me greatly.
>>> urllib.request.urlopen("http://abominable.cc/post/47699281401")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 470, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 580, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 502, in error
result = self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 442, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 685, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python3.4/urllib/request.py", line 464, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 482, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 442, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1211, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.4/urllib/request.py", line 1183, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.4/http/client.py", line 1172, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.4/http/client.py", line 1014, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 37-38: ordinal not in range(128)
The entirety of my program can be found here: https://github.com/nstephenh/pycomic
I was having the same problem. The root cause is that the remote server isn't playing by the rules. HTTP Headers are supposed to be US-ASCII only but apparently the leading http webservers (apache2, nginx) doesn't care and send direct UTF-8 encoded string.
However in http.client the parse_header function fetch the headers as iso-8859, and the default HTTPRedirectHandler in urllib doesn't care to quote the location or URI header, resulting in the aformentioned error.
I was able to 'work around' both thing by overriding the default HTTPRedirectHandler and adding three line to counter the latin1 decoding and add a path quote:
import urllib.request
from urllib.error import HTTPError
from urllib.parse import (
urlparse, quote, urljoin, urlunparse)
class UniRedirectHandler(urllib.request.HTTPRedirectHandler):
# Implementation note: To avoid the server sending us into an
# infinite loop, the request object needs to track what URLs we
# have already seen. Do this by adding a handler-specific
# attribute to the Request object.
def http_error_302(self, req, fp, code, msg, headers):
# Some servers (incorrectly) return multiple Location headers
# (so probably same goes for URI). Use first header.
if "location" in headers:
newurl = headers["location"]
elif "uri" in headers:
newurl = headers["uri"]
else:
return
# fix a possible malformed URL
urlparts = urlparse(newurl)
# For security reasons we don't allow redirection to anything other
# than http, https or ftp.
if urlparts.scheme not in ('http', 'https', 'ftp', ''):
raise HTTPError(
newurl, code,
"%s - Redirection to url '%s' is not allowed" % (msg, newurl),
headers, fp)
if not urlparts.path:
urlparts = list(urlparts)
urlparts[2] = "/"
else:
urlparts = list(urlparts)
# Header should only contain US-ASCII chars, but some servers do send unicode data
# that should be quoted back before reused
# Need to re-encode the string as iso-8859-1 before use of ""quote"" to cancel the effet of parse_header() in http/client.py
urlparts[2] = quote(urlparts[2].encode('iso-8859-1'))
newurl = urlunparse(urlparts)
newurl = urljoin(req.full_url, newurl)
# XXX Probably want to forget about the state of the current
# request, although that might interact poorly with other
# handlers that also use handler-specific request attributes
new = self.redirect_request(req, fp, code, msg, headers, newurl)
if new is None:
return
# loop detection
# .redirect_dict has a key url if url was previously visited.
if hasattr(req, 'redirect_dict'):
visited = new.redirect_dict = req.redirect_dict
if (visited.get(newurl, 0) >= self.max_repeats or
len(visited) >= self.max_redirections):
raise HTTPError(req.full_url, code,
self.inf_msg + msg, headers, fp)
else:
visited = new.redirect_dict = req.redirect_dict = {}
visited[newurl] = visited.get(newurl, 0) + 1
# Don't close the fp until we are sure that we won't use it
# with HTTPError.
fp.read()
fp.close()
return self.parent.open(new, timeout=req.timeout)
http_error_301 = http_error_303 = http_error_307 = http_error_302
[...]
# Change default Redirect Handler in urllib, should be done once at the beginning of the program
opener = urllib.request.build_opener(UniRedirectHandler())
urllib.request.install_opener(opener)
This is python3 code but should be easily adapted for python2 if need be.
I try to send POST data from a Python program to a PHP file that uses basic HTTP authentication. I run this code:
import urllib.parse
from urllib.request import urlopen
path="https://username:password#url_to_my_file.php"
path=path.encode('utf8')
data=urllib.parse.urlencode({"Hello":"There"})
data=data.encode('utf8')
req=urlopen(path,mydata)
req.add_header("Content-type","application/x-www-form-urlencoded")
page=urllib.urlopen(req).read()
I got this error:
req.data=data
AttributeError: 'bytes' object has not attribute 'data'
How can I fix this bug ?
UPDATE:
Following the solution below, I changed my code this way:
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener, Request
import urllib
url="https://www.my_website.com/file.php"
path="http://my_username:my_password#https://www.my_website.com/file.php"
mydata=urllib.parse.urlencode({"Hello":"Test"})
pwmgr = HTTPPasswordMgrWithDefaultRealm()
pwmgr.add_password(None, url, 'my_username', 'my_password')
authhandler = HTTPBasicAuthHandler(pwmgr)
opener = build_opener(authhandler)
req = Request(path, mydata)
req.add_header("Content-type","application/x-www-form-urlencoded")
page = opener.open(req).read()
I got these errors:
Traceback (most recent call last):
File "/usr/local/python3.1.3/lib/python3.1/http/client.py", line 673, in _set_hostport
port = int(host[i+1:])
ValueError: invalid literal for int() with base 10: ''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "a1.py", line 17, in <module>
page = opener.open(req).read()
File "/usr/local/python3.1.3/lib/python3.1/urllib/request.py", line 350, in open
response = self._open(req, data)
File "/usr/local/python3.1.3/lib/python3.1/urllib/request.py", line 368, in _open
'_open', req)
File "/usr/local/python3.1.3/lib/python3.1/urllib/request.py", line 328, in _call_chain
result = func(*args)
File "/usr/local/python3.1.3/lib/python3.1/urllib/request.py", line 1112, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/local/python3.1.3/lib/python3.1/urllib/request.py", line 1065, in do_open
h = http_class(host, timeout=req.timeout) # will parse host:port
File "/usr/local/python3.1.3/lib/python3.1/http/client.py", line 655, in __init__
self._set_hostport(host, port)
File "/usr/local/python3.1.3/lib/python3.1/http/client.py", line 675, in _set_hostport
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
http.client.InvalidURL: nonnumeric port: ''
You opened the URL twice. First with:
req=urlopen(path,mydata)
Then again with:
page=urllib.urlopen(req).read()
If you wanted to create a separate Request object, do so:
from urllib.request import urlopen, Request
req = Request(path, mydata)
req.add_header("Content-type","application/x-www-form-urlencoded")
page = urlopen(req).read()
Note that you should not encode the URL; it should be a str value.
urllib.request will also not parse authentication information from the URL; you'll need to provide that separately by using a password manager:
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
url = "https://url_to_my_file.php"
pwmgr = HTTPPasswordMgrWithDefaultRealm()
pwmgr.add_password(None, url, 'username', 'password')
authhandler = HTTPBasicAuthHandler(pwmgr)
opener = build_opener(authhandler)
req = Request(path, mydata)
req.add_header("Content-type","application/x-www-form-urlencoded")
page = opener.open(req).read()
What i am trying to do is read a line(an ip address), open the website with that address, and then repeat with all the addresses in the file. instead, i get an error. I am new to python, so maybe its a simple mistake. Thanks in advance !!!
CODE:
>>> f = open("proxy.txt","r"); #file containing list of ip addresses
>>> address = (f.readline()).strip(); # to remove \n at end of line
>>>
>>> while line:
proxy = urllib2.ProxyHandler({'http': address })
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
urllib2.urlopen('http://www.google.com')
address = (f.readline()).strip();
ERROR:
Traceback (most recent call last):
File "<pyshell#15>", line 5, in <module>
urllib2.urlopen('http://www.google.com')
File "D:\Programming\Python\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "D:\Programming\Python\lib\urllib2.py", line 394, in open
response = self._open(req, data)
File "D:\Programming\Python\lib\urllib2.py", line 412, in _open
'_open', req)
File "D:\Programming\Python\lib\urllib2.py", line 372, in _call_chain
result = func(*args)
File "D:\Programming\Python\lib\urllib2.py", line 1199, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "D:\Programming\Python\lib\urllib2.py", line 1174, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>
It means that the proxy is unavailable.
Here's a proxy checker that checks several proxies simultaneously:
#!/usr/bin/env python
import fileinput # accept proxies from files or stdin
try:
from gevent.pool import Pool # $ pip install gevent
import gevent.monkey; gevent.monkey.patch_all() # patch stdlib
except ImportError: # fallback on using threads
from multiprocessing.dummy import Pool
try:
from urllib2 import ProxyHandler, build_opener
except ImportError: # Python 3
from urllib.request import ProxyHandler, build_opener
def is_proxy_alive(proxy, timeout=5):
opener = build_opener(ProxyHandler({'http': proxy})) # test redir. and such
try: # send request, read response headers, close connection
opener.open("http://example.com", timeout=timeout).close()
except EnvironmentError:
return None
else:
return proxy
candidate_proxies = (line.strip() for line in fileinput.input())
pool = Pool(20) # use 20 concurrent connections
for proxy in pool.imap_unordered(is_proxy_alive, candidate_proxies):
if proxy is not None:
print(proxy)
Usage:
$ python alive-proxies.py proxy.txt
$ echo user:password#ip:port | python alive-proxies.py
So I was trying to dive into Python because I want to write an Agent for the Plex Media Server. This Agent will access the MyAnimeList.net API with HTTP Authentication (more about that here) my Username and passwords work but I don't have a clue why I still get a 401 error from the server as response.
Here is some code (I'm using python 2.5 because plex said so) :)
import urllib2
username = "someUser"
password = "somePass"
url = "http://myanimelist.net/api/anime/search.xml?q=bleach"
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
top_level_url = "http://myanimelist.net/"
password_mgr.add_password(None, top_level_url, username, password)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(urllib2.HTTPHandler, handler)
request = urllib2.Request(url)
print request.get_full_url()
f = urllib2.urlopen(request).read()
print(f)
and this is what I get as response
> http://myanimelist.net/api/anime/search.xml?q=bleach Traceback (most
> recent call last): File
> "C:\Users\Daraku\Desktop\MAL.bundle\Contents\Code\__init__.py", line
> 16, in <module>
> f = urllib2.urlopen(request).read() File "C:\Python25\lib\urllib2.py", line 121, in urlopen
> return _opener.open(url, data) File "C:\Python25\lib\urllib2.py", line 380, in open
> response = meth(req, response) File "C:\Python25\lib\urllib2.py", line 491, in http_response
> 'http', request, response, code, msg, hdrs) File "C:\Python25\lib\urllib2.py", line 418, in error
> return self._call_chain(*args) File "C:\Python25\lib\urllib2.py", line 353, in _call_chain
> result = func(*args) File "C:\Python25\lib\urllib2.py", line 499, in http_error_default
> raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) HTTPError: HTTP Error 401: Unauthorized
Bear in mind that this is my first time programming in Python and im kind of confused with many examples in the web because they did something with the urllib2 so that it doesn't exists in python 3.0, i think, anymore
Any ideas? Or any better ways to do this?