I was trying to download images with url's that change but got an error.
url_image="http://www.joblo.com/timthumb.php?src=/posters/images/full/"+str(title_2)+"-poster1.jpg&h=333&w=225"
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
headers = {'User-Agent': user_agent}
req = urllib.request.Request(url_image, None, headers)
print(url_image)
#image, h = urllib.request.urlretrieve(url_image)
with urllib.request.urlopen(req) as response:
the_page = response.read()
#print (the_page)
with open('poster.jpg', 'wb') as f:
f.write(the_page)
Traceback (most recent call last):
File "C:\Users\luke\Desktop\scraper\imager finder.py", line 97, in
with urllib.request.urlopen(req) as response:
File "C:\Users\luke\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\luke\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 465, in open
response = self._open(req, data)
File "C:\Users\luke\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 483, in _open
'_open', req)
File "C:\Users\luke\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 443, in _call_chain
result = func(*args)
File "C:\Users\luke\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1268, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Users\luke\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1243, in do_open
r = h.getresponse()
File "C:\Users\luke\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1174, in getresponse
response.begin()
File "C:\Users\luke\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 282, in begin
version, status, reason = self._read_status()
File "C:\Users\luke\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 264, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine:
My advice is to use urlib2. In addition, I've written a nice function (I think) that will also allow gzip encoding (reduce bandwidth) if the server supports it. I use this for downloading social media files, but should work for anything.
I would try to debug your code, but since it's just a snippet (and the error messages are formatted badly), it's hard to know exactly where your error is occurring (it's certainly not line 97 in your code snippet).
This isn't as short as it could be, but it's clear and reusable. This is python 2.7, it looks like you're using 3 - in which case you google some other questions that address how to use urllib2 in python 3.
import urllib2
import gzip
from StringIO import StringIO
def download(url):
"""
Download and return the file specified in the URL; attempt to use
gzip encoding if possible.
"""
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'gzip')
try:
response = urllib2.urlopen(request)
except Exception, e:
raise IOError("%s(%s) %s" % (_ERRORS[1], url, e))
payload = response.read()
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO(payload)
f = gzip.GzipFile(fileobj=buf)
payload = f.read()
return payload
def save_media(filename, media):
file_handle = open(filename, "wb")
file_handle.write(media)
file_handle.close()
title_2 = "10-cloverfield-lane"
media = download("http://www.joblo.com/timthumb.php?src=/posters/images/full/{}-poster1.jpg&h=333&w=225".format(title_2))
save_media("poster.jpg", media)
Related
I learned how to download a picture from a certain URL with python as:
import urllib
imgurl="http://www.digimouth.com/news/media/2011/09/google-logo.jpg"
resource = urllib.urlopen(imgurl)
output = open("test.jpg","wb")
output.write(resource.read())
output.close()
and it worked well, but when i changed the URL to
imgurl="http://farm1.static.flickr.com/96/242125326_607a826afe_o.jpg"
it did not work, and gave the information
File "face_down.py", line 3, in <module>
resource = urllib2.urlopen(imgurl)
File "D:\Python27\another\Lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "D:\Python27\another\Lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "D:\Python27\another\Lib\urllib2.py", line 449, in _open
'_open', req)
File "D:\Python27\another\Lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "D:\Python27\another\Lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "D:\Python27\another\Lib\urllib2.py", line 1197, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 10060] >
and I tried to open the latter image URL, and it could be shown as the former, I have no idea to solve it~~ help~~~~
You can try using requests module. The response will be some bytes. So, you can iterate over those byte chunks and write to the file.
import requests
url = "http://farm1.static.flickr.com/96/242125326_607a826afe_o.jpg"
r = requests.get(url)
path = "filename.jpg"
with open(path, 'wb') as f:
for chunk in r:
f.write(chunk)
I looked up both of the addresses and the second one does not lead anywhere. That is probably the problem.
import urllib
imgurl="webpage url"
openimg = urllib.urlopen(imgurl) #opens image (prepares it)
img = open("test.jpg","wb") #opens the img to read it
img.write(openimg.read()) #prints it to the console
img.close() #closes img
Try the link again in your webpage and if it turns up with "webpage not available" that is probably the problem.
EDIT: I've majorly edited the content of this post since the original to specify my problem:
I am writing a program to download webcomics, and I'm getting this weird error when downloading a page of the comic. The code I am running essentially boils down to the following line followed by the error. I do not know what is causing this error, and it is confusing me greatly.
>>> urllib.request.urlopen("http://abominable.cc/post/47699281401")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 470, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 580, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 502, in error
result = self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 442, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 685, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python3.4/urllib/request.py", line 464, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 482, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 442, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1211, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.4/urllib/request.py", line 1183, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.4/http/client.py", line 1172, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.4/http/client.py", line 1014, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 37-38: ordinal not in range(128)
The entirety of my program can be found here: https://github.com/nstephenh/pycomic
I was having the same problem. The root cause is that the remote server isn't playing by the rules. HTTP Headers are supposed to be US-ASCII only but apparently the leading http webservers (apache2, nginx) doesn't care and send direct UTF-8 encoded string.
However in http.client the parse_header function fetch the headers as iso-8859, and the default HTTPRedirectHandler in urllib doesn't care to quote the location or URI header, resulting in the aformentioned error.
I was able to 'work around' both thing by overriding the default HTTPRedirectHandler and adding three line to counter the latin1 decoding and add a path quote:
import urllib.request
from urllib.error import HTTPError
from urllib.parse import (
urlparse, quote, urljoin, urlunparse)
class UniRedirectHandler(urllib.request.HTTPRedirectHandler):
# Implementation note: To avoid the server sending us into an
# infinite loop, the request object needs to track what URLs we
# have already seen. Do this by adding a handler-specific
# attribute to the Request object.
def http_error_302(self, req, fp, code, msg, headers):
# Some servers (incorrectly) return multiple Location headers
# (so probably same goes for URI). Use first header.
if "location" in headers:
newurl = headers["location"]
elif "uri" in headers:
newurl = headers["uri"]
else:
return
# fix a possible malformed URL
urlparts = urlparse(newurl)
# For security reasons we don't allow redirection to anything other
# than http, https or ftp.
if urlparts.scheme not in ('http', 'https', 'ftp', ''):
raise HTTPError(
newurl, code,
"%s - Redirection to url '%s' is not allowed" % (msg, newurl),
headers, fp)
if not urlparts.path:
urlparts = list(urlparts)
urlparts[2] = "/"
else:
urlparts = list(urlparts)
# Header should only contain US-ASCII chars, but some servers do send unicode data
# that should be quoted back before reused
# Need to re-encode the string as iso-8859-1 before use of ""quote"" to cancel the effet of parse_header() in http/client.py
urlparts[2] = quote(urlparts[2].encode('iso-8859-1'))
newurl = urlunparse(urlparts)
newurl = urljoin(req.full_url, newurl)
# XXX Probably want to forget about the state of the current
# request, although that might interact poorly with other
# handlers that also use handler-specific request attributes
new = self.redirect_request(req, fp, code, msg, headers, newurl)
if new is None:
return
# loop detection
# .redirect_dict has a key url if url was previously visited.
if hasattr(req, 'redirect_dict'):
visited = new.redirect_dict = req.redirect_dict
if (visited.get(newurl, 0) >= self.max_repeats or
len(visited) >= self.max_redirections):
raise HTTPError(req.full_url, code,
self.inf_msg + msg, headers, fp)
else:
visited = new.redirect_dict = req.redirect_dict = {}
visited[newurl] = visited.get(newurl, 0) + 1
# Don't close the fp until we are sure that we won't use it
# with HTTPError.
fp.read()
fp.close()
return self.parent.open(new, timeout=req.timeout)
http_error_301 = http_error_303 = http_error_307 = http_error_302
[...]
# Change default Redirect Handler in urllib, should be done once at the beginning of the program
opener = urllib.request.build_opener(UniRedirectHandler())
urllib.request.install_opener(opener)
This is python3 code but should be easily adapted for python2 if need be.
I try to send POST data from a Python program to a PHP file that uses basic HTTP authentication. I run this code:
import urllib.parse
from urllib.request import urlopen
path="https://username:password#url_to_my_file.php"
path=path.encode('utf8')
data=urllib.parse.urlencode({"Hello":"There"})
data=data.encode('utf8')
req=urlopen(path,mydata)
req.add_header("Content-type","application/x-www-form-urlencoded")
page=urllib.urlopen(req).read()
I got this error:
req.data=data
AttributeError: 'bytes' object has not attribute 'data'
How can I fix this bug ?
UPDATE:
Following the solution below, I changed my code this way:
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener, Request
import urllib
url="https://www.my_website.com/file.php"
path="http://my_username:my_password#https://www.my_website.com/file.php"
mydata=urllib.parse.urlencode({"Hello":"Test"})
pwmgr = HTTPPasswordMgrWithDefaultRealm()
pwmgr.add_password(None, url, 'my_username', 'my_password')
authhandler = HTTPBasicAuthHandler(pwmgr)
opener = build_opener(authhandler)
req = Request(path, mydata)
req.add_header("Content-type","application/x-www-form-urlencoded")
page = opener.open(req).read()
I got these errors:
Traceback (most recent call last):
File "/usr/local/python3.1.3/lib/python3.1/http/client.py", line 673, in _set_hostport
port = int(host[i+1:])
ValueError: invalid literal for int() with base 10: ''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "a1.py", line 17, in <module>
page = opener.open(req).read()
File "/usr/local/python3.1.3/lib/python3.1/urllib/request.py", line 350, in open
response = self._open(req, data)
File "/usr/local/python3.1.3/lib/python3.1/urllib/request.py", line 368, in _open
'_open', req)
File "/usr/local/python3.1.3/lib/python3.1/urllib/request.py", line 328, in _call_chain
result = func(*args)
File "/usr/local/python3.1.3/lib/python3.1/urllib/request.py", line 1112, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/local/python3.1.3/lib/python3.1/urllib/request.py", line 1065, in do_open
h = http_class(host, timeout=req.timeout) # will parse host:port
File "/usr/local/python3.1.3/lib/python3.1/http/client.py", line 655, in __init__
self._set_hostport(host, port)
File "/usr/local/python3.1.3/lib/python3.1/http/client.py", line 675, in _set_hostport
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
http.client.InvalidURL: nonnumeric port: ''
You opened the URL twice. First with:
req=urlopen(path,mydata)
Then again with:
page=urllib.urlopen(req).read()
If you wanted to create a separate Request object, do so:
from urllib.request import urlopen, Request
req = Request(path, mydata)
req.add_header("Content-type","application/x-www-form-urlencoded")
page = urlopen(req).read()
Note that you should not encode the URL; it should be a str value.
urllib.request will also not parse authentication information from the URL; you'll need to provide that separately by using a password manager:
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
url = "https://url_to_my_file.php"
pwmgr = HTTPPasswordMgrWithDefaultRealm()
pwmgr.add_password(None, url, 'username', 'password')
authhandler = HTTPBasicAuthHandler(pwmgr)
opener = build_opener(authhandler)
req = Request(path, mydata)
req.add_header("Content-type","application/x-www-form-urlencoded")
page = opener.open(req).read()
I have a bunch of files in wav. I made a simple script to convert them to flac so I can use it with the google speech api. Here is the python code:
import urllib2
url = "https://www.google.com/speech-api/v1/recognize?client=chromium&lang=en-US"
audio = open('somefile.flac','rb').read()
headers={'Content-Type': 'audio/x-flac; rate=16000', 'User-Agent':'Mozilla/5.0'}
request = urllib2.Request(url, data=audio, headers=headers)
response = urllib2.urlopen(request)
print response.read()
However I am getting this error:
Traceback (most recent call last):
File "transcribe.py", line 7, in <module>
response = urllib2.urlopen(request)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 392, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 370, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1194, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1161, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 32] Broken pipe>
I thought at first that it was because the file is too big. But I recorded myself for 5 seconds and it still does the same.
I dont think google ha released the api yet so it's hard to understand why its failing.
Is there any other good speech-to-text api out there that can be used in either Python or Node?
----- Editing for my attempt with requests:
import json
import requests
url = 'https://www.google.com/speech-api/v1/recognize?client=chromium&lang=en-US'
data = {'file': open('file.flac', 'rb')}
headers = {'Content-Type': 'audio/x-flac; rate=16000', 'User-Agent':'Mozilla/5.0'}
r = requests.post(url, data=data, headers=headers)
# r = requests.post(url, files=data, headers=headers) ## does not work either
# r = requests.post(url, data=open('file.flac', 'rb').read(), headers=headers) ## does not work either
print r.text
Produced the same problem as above.
The API Accepts HTTP POST requests. You're using a HTTP GET Request here. This can be confirmed by loading the URI in your code directly into a browser:
HTTP method GET is not supported by this URL
Error 405
Also, i'd recommend using the requests python library. See http://www.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
Lastly, it seems that the API only accepts segments up to 15 seconds long. Perhaps your error is the file is too large? If you can upload an example flac file, perhaps we could diagnose further.
First, I'm not a python programmer. I'm an old C dog that's learned new Java and PHP tricks, but python looks like a pretty cool language.
I'm getting an error that I can't quite follow. The error follows the code below.
import httplib, urllib
url = "pdb-services-beta.nipr.com"
xml = '<?xml version="1.0"?><!DOCTYPE SCB_Request SYSTEM "http://www.nipr.com/html/SCB_XML_Request.dtd"><SCB_Request Request_Type="Create_Report"><SCB_Login_Data CustomerID="someuser" Passwd="somepass" /><SCB_Create_Report_Request Title=""><Producer_List><NIPR_Num_List_XML><NIPR_Num NIPR_Num="8980608" /><NIPR_Num NIPR_Num="7597855" /><NIPR_Num NIPR_Num="10166016" /></NIPR_Num_List_XML></Producer_List></SCB_Create_Report_Request></SCB_Request>'
params = {}
params['xmldata'] = xml
headers = {}
headers['Content-type'] = 'text/xml'
headers['Accept'] = '*/*'
headers['Content-Length'] = "%d" % len(xml)
connection = httplib.HTTPSConnection(url)
connection.set_debuglevel(1)
connection.request("POST", "/pdb-xml-reports/scb_xmlclient.cgi", params, headers)
response = connection.getresponse()
print response.status, response.reason
data = response.read()
print data
connection.close
Here's the error:
Traceback (most recent call last):
File "C:\Python27\tutorial.py", line 14, in connection.request("POST", "/pdb-xml-reports/scb_xmlclient.cgi", params, headers)
File "C:\Python27\lib\httplib.py", line 958, in request self._send_request(method, url, body, headers)
File "C:\Python27\lib\httplib.py", line 992, in _send_request self.endheaders(body)
File "C:\Python27\lib\httplib.py", line 954, in endheaders self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 818, in _send_output self.send(message_body)
File "C:\Python27\lib\httplib.py", line 790, in send self.sock.sendall(data)
File "C:\Python27\lib\ssl.py", line 229, in sendall v = self.send(data[count:])
TypeError: unhashable type
My log file says that the xmldata parameter is empty.
Any ideas?
I guess params has to be a string when passing to .request, this would explain the error, due to the fact, that a hash is not hashable
Try to encode your params first with
params = urllib.urlencode(params)
You can find another code example too at the bottom of:
http://docs.python.org/release/3.1.5/library/http.client.html
Thanks for the feedback.
I guess I was making this too hard. I went a different route and it seems to work.
import urllib2
URL = "https://pdb-services-beta.nipr.com/pdb-xml-reports/scb_xmlclient.cgi"
DATA = 'xmldata=<?xml version="1.0"?><!DOCTYPE SCB_Request SYSTEM "http://www.nipr.com/html/SCB_XML_Request.dtd"><SCB_Request Request_Type="Create_Report"><SCB_Login_Data CustomerID="someuser" Passwd="somepass" /><SCB_Create_Report_Request Title=""><Producer_List><NIPR_Num_List_XML><NIPR_Num NIPR_Num="8980608" /></NIPR_Num_List_XML></Producer_List></SCB_Create_Report_Request></SCB_Request>'
req = urllib2.Request(url=URL, data=DATA)
f = urllib2.urlopen(req)
print f.read()