I want to fetch a Image(GIF format) from a website.So I use tornado in-build asynchronous http client to do it.My code is like the following:
import tornado.httpclient
import tornado.ioloop
import tornado.gen
import tornado.web
tornado.httpclient.AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")
http_client = tornado.httpclient.AsyncHTTPClient()
class test(tornado.web.RequestHandler):
#tornado.gen.coroutine
def get(self):
content = yield http_client.fetch('http://www.baidu.com/img/bdlogo.gif')
print('=====', type(content.body))
application = tornado.web.Application([
(r'/', test)
])
application.listen(80)
tornado.ioloop.IOLoop.instance().start()
So when I visit the server it should fetch a gif file.However It catch a exception.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 8: invalid start byte
ERROR:tornado.application:Uncaught exception GET / (127.0.0.1)
HTTPRequest(protocol='http', host='127.0.0.1', method='GET', uri='/', version='HTTP/1.1', remote_ip='127.0.0.1', headers={'Accept-Language': 'zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3', 'Accept-Encoding': 'gzip, deflate', 'Host': '127.0.0.1', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130922 Firefox/17.0', 'Connection': 'keep-alive', 'Cache-Control': 'max-age=0', 'If-None-Match': '"da39a3ee5e6b4b0d3255bfef95601890afd80709"'})
Traceback (most recent call last):
File "/usr/lib/python3.2/site-packages/tornado/web.py", line 1144, in _when_complete
if result.result() is not None:
File "/usr/lib/python3.2/site-packages/tornado/concurrent.py", line 129, in result
raise_exc_info(self.__exc_info)
File "<string>", line 3, in raise_exc_info
File "/usr/lib/python3.2/site-packages/tornado/stack_context.py", line 302, in wrapped
ret = fn(*args, **kwargs)
File "/usr/lib/python3.2/site-packages/tornado/gen.py", line 550, in inner
self.set_result(key, result)
File "/usr/lib/python3.2/site-packages/tornado/gen.py", line 476, in set_result
self.run()
File "/usr/lib/python3.2/site-packages/tornado/gen.py", line 505, in run
yielded = self.gen.throw(*exc_info)
File "test.py", line 12, in get
content = yield http_client.fetch('http://www.baidu.com/img/bdlogo.gif')
File "/usr/lib/python3.2/site-packages/tornado/gen.py", line 496, in run
next = self.yield_point.get_result()
File "/usr/lib/python3.2/site-packages/tornado/gen.py", line 395, in get_result
return self.runner.pop_result(self.key).result()
File "/usr/lib/python3.2/concurrent/futures/_base.py", line 393, in result
return self.__get_result()
File "/usr/lib/python3.2/concurrent/futures/_base.py", line 352, in __get_result
raise self._exception
tornado.curl_httpclient.CurlError: HTTP 599: Failed writing body (0 != 1024)
ERROR:tornado.access:500 GET / (127.0.0.1) 131.53ms
It seems to attempt to decode my binary file as UTF-8 text, which is unnecessary.IF I comment
tornado.httpclient.AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")
out, which will use a simple http client instead of pycurl, it works well.(It tell me that the type of "content" is bytes)
So if it return a bytes object, why it tries to decode it? I think the problems is the pycurl or the wrapper of pycurl in tornado, right?
My python version is 3.2.5, tornado 3.1.1, pycurl 7.19.
Thanks!
pycurl 7.19 doesn't support Python 3. Ubuntu (and possibly other Linux distributions) ship a modified version of pycurl that partially works with Python 3, but it doesn't work with Tornado (https://github.com/facebook/tornado/issues/671), and fails with an exception that looks like the one you're seeing here.
Until there's a new version of pycurl that officially supports Python 3 (or you use the change suggested in that Tornado bug report), I'm afraid you'll need to either go back to Python 2.7 or use Tornado's simple_httpclient instead.
Related
I'm trying to download a PDF file using the requests module, the code is as below:
import requests
url = "<url of the pdf>"
r = requests.get(url, stream=True, timeout=(60, 120), headers={'Connection': 'keep-alive','User-Agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10136'})
print(r.headers)
print(r.status_code)
try:
with open('blah.pdf', 'wb') as f:
for chunk in r:
# print(chunk)
f.write(chunk)
except Exception as e:
print(e)
Output is given below:
{'Cache-Control': 'private', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/pdf', 'Server': 'Microsoft-IIS/7.5', 'X-AspNet-Version': '4.0.30319', 'X-Powered-By': 'ASP.NET', 'Date': 'Wed, 02 Oct 2019 05:17:11 GMT', 'Set-Cookie': 'bbb=rd102o00000000000000000000ffff978433aao80; path=/; Httponly; Secure'}
200
('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))
Here is the full stack trace:
Traceback (most recent call last):
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
yield
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 755, in read_chunked
chunk = self._handle_chunk(amt)
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 709, in _handle_chunk
self._fp._safe_read(2) # Toss the CRLF at the end of the chunk.
File "/storage/anaconda3/lib/python3.7/http/client.py", line 612, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/storage/anaconda3/lib/python3.7/site-packages/requests/models.py", line 750, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 560, in stream
for line in self.read_chunked(amt, decode_content=decode_content):
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
self._original_response.close()
File "/storage/anaconda3/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 12, in <module>
for chunk in r:
File "/storage/anaconda3/lib/python3.7/site-packages/requests/models.py", line 753, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))
When I open that pdf on Web Browsers such as Google Chrome, chrome's builtin pdf plugin can load it properly and it is possible to read on the browser. However, If I try to download it by clicking on the download icon I get Failed - Network Error Firefox can't load/download it. (Both Firefox & Chrome are upgraded to latest version) When I test it on a windows machine Microsoft edge was able to download the pdf though...
The above code, If I test it with some other pdfs such as this one:
https://adobe.com/content/dam/acom/en/accessibility/products/acrobat/pdfs/acrobat-x-accessibility-checker.pdf
It works perfectly.
I've tried some command-line tools such as curl, wget, aria2c (with proper headers set like browser request) all failed to download the pdf.
wget output:
connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/pdf]
Saving to: ‘blah.pdf’
<pdf_url> [ <=> ] 101.68K 66.1KB/s in 1.5s
2019-10-02 11:29:50 (69.1 KB/s) - Read error at byte 108786 (Success).
The file downloaded using wget is corrupted.
Another thing that I've tried is to inspect it using mitm and chromedriver+selenium combination.
The automated chrome browser can't load the pdf and shows an error:
502 Bad Gateway
HttpSyntaxException('Malformed chunked body',)
How can I download this pdf using requests module? Any help will be very much appreciated.
I figured out the issue after a couple of days. The server was closing the connection improperly so, the python libraries were throwing IncompleteReadError. I managed to download it using the curl installed in the system with the argument --compressed and all the necessary headers:
from subprocess import call
pdf_url = ""
pdf_filename = ""
call(["curl", pdf_url,
'-H', 'Connection: keep-alive',
'-H', 'Cache-Control: max-age=0',
'-H', 'Upgrade-Insecure-Requests: 1',
'-H', 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
'-H', 'Sec-Fetch-Mode: navigate',
'-H', 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'-H', 'Sec-Fetch-Site: cross-site',
'-H', 'Accept-Encoding: gzip, deflate, br',
'-H', 'Accept-Language: en-US,en;q=0.9,bn;q=0.8',
'-H', 'Cookie: bbb=rd102o00000000000000000000ffff978432aao80',
'--compressed', '--output', pdf_filename])
Using the call method of the subprocess module. Even though curl shows an error message like below:
curl: (18) transfer closed with outstanding read data remaining
But, the downloaded pdf works and can be opened with any pdf viewer.
I had the same problem as you, I do not know the answer in why it happened it just did. I made this workaround with urrlib:
urllib.request.urlretrieve(url, 'foo_file.txt', data=your_queries)
What urlretrieve method does is fetch the data from the link and make a copy of it in the file name you specify and the path you specify as second argument. You can also change the type to .pdf, .json, whatever.
You have more info here: https://docs.python.org/3.7/library/urllib.request.html#module-urllib.request
Am trying to use this Python Module. https://github.com/coursera-dl/edx-dl
Please excuse my basic knowledge.
Installed Anaconda 3 Windows 10 then:
pip install edx-dl
pip install --upgrade youtube-dl
Then to get courses did:
edx-dl -u user#user.com --list-courses
edx-dl -u user#user.com COURSE_URL
This all worked however once downloads actually started was getting:
Got SSL/Connection error: HTTP Error 403: Forbidden
Fiddler showed that it was being blocked by by Cloudfare I suspect due to User-Agent
I the installed Fake_UserAgent https://pypi.python.org/pypi/fake-useragent and added:
from fake_useragent import UserAgent #added this
def edx_get_headers():
"""
Build the Open edX headers to create future requests.
"""
logging.info('Building initial headers for future requests.')
headers = {
'User-Agent': 'edX-downloader/0.01',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Content-Type': 'application/x-www-form-urlencoded;charset=utf-8',
'Referer': EDX_HOMEPAGE,
'X-Requested-With': 'XMLHttpRequest',
'X-CSRFToken': _get_initial_token(EDX_HOMEPAGE),
}
ua = UserAgent() #added this
headers['User-Agent'] = ua.ie #added this
It then downloaded a pdf and an xls but got another error due to request.py adding a header so added fake to requests.py and commented out the default header as below.
from fake_useragent import UserAgent
ub = UserAgent()
self.addheaders = [('User-Agent', ub.ie)]
# self.addheaders = [('User-Agent', self.version), ('Accept', '*/*')] [('User-Agent', self.version), ('Accept', '*/*')]
The new error is below. I can't work out how to troubleshoot further. I suspect it can't find a file / path possibly due to Windows.
[download] https://youtube.com/watch?v=bKkrDLwDnDE => Downloaded\Implementing_ETL_with_SQL_Server_Integration_Services\02-Module_1__ETL_Processing\01-%(title)s-%(id)s.%(ext)s
Downloading video with URL https://youtube.com/watch?v=bKkrDLwDnDE from YouTube.
Traceback (most recent call last):
File "edx-dl.py", line 6, in <module>
edx_dl.main()
File "c:\edx-dl-master\edx-dl-master\edx_dl\edx_dl.py", line 1080, in main
download(args, selections, filtered_units, headers)
File "c:\edx-dl-master\edx-dl-master\edx_dl\edx_dl.py", line 857, in download
headers)
File "c:\edx-dl-master\edx-dl-master\edx_dl\edx_dl.py", line 819, in download_unit
headers)
File "c:\edx-dl-master\edx-dl-master\edx_dl\edx_dl.py", line 801, in download_video
skip_or_download(youtube_downloads, headers, args)
File "c:\edx-dl-master\edx-dl-master\edx_dl\edx_dl.py", line 788, in skip_or_download
f(url, filename, headers, args)
File "c:\edx-dl-master\edx-dl-master\edx_dl\edx_dl.py", line 721, in download_url
download_youtube_url(url, filename, headers, args)
File "c:\edx-dl-master\edx-dl-master\edx_dl\edx_dl.py", line 761, in download_youtube_url
execute_command(cmd, args)
File "c:\edx-dl-master\edx-dl-master\edx_dl\utils.py", line 37, in execute_command
subprocess.check_call(cmd)
File "C:\Users\anton\Anaconda3\lib\subprocess.py", line 286, in check_call
retcode = call(*popenargs, **kwargs)
File "C:\Users\anton\Anaconda3\lib\subprocess.py", line 267, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Users\anton\Anaconda3\lib\subprocess.py", line 709, in __init__
restore_signals, start_new_session)
File "C:\Users\anton\Anaconda3\lib\subprocess.py", line 997, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
Same issue as here however no resolution or assistance had been provided so thought I would try here instead.
https://github.com/coursera-dl/edx-dl/issues/368
Advice on how to learn to troubleshoot this would be appreciated.
Debugged the code and found that couldn't find youtube-dl.
Checked echo %PATH% and realised I had path to:
C:...\Anaconda3\ but not to C:...\Anaconda3\Scripts\ (this is location of youtube_dl.exe).
I had added this path but not rebooted.
Rebooted and now resolved.
There is another easy solution and no need to use Fake_UserAgent, just use other downloaders, like wget.
Install fresh edx_dl.
If you are on Windows download wget, save it for example on H drive.
Change download_url function like this:
def download_url(url, filename, headers, args):
"""
Downloads the given url in filename.
"""
if is_youtube_url(url):
download_youtube_url(url, filename, headers, args)
else:
# jcline
cmd = (["h:\wget.exe", url, '-c', '-O', filename, '--keep-session-cookies', '--no-check-certificate'])
execute_command(cmd, args)
(Source)
import sys
import pdb
import http.client
def PassParse():
headers = {"Accept":" application/json, text/plain, */*",
"Authorization":" Basic YWRtaW46YXNkZg==",
"Referer":" http://192.168.1.113:8080/#/apps",
"Accept-Language":" zh-CN",
"Accept-Encoding":" gzip, deflate",
"User-Agent":" Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko LBBROWSER",
"Host":" 192.168.1.113:8080",
"DNT":" 1",
"Connection":" Keep-Alive"};
conn = http.client.HTTPConnection("192.168.1.113:8080");
conn.request(method="Get",url="/api/v1/login",body=None,headers=headers);
response = conn.getresponse();
responseText = response.getheaders("content-lentgh");
print ("succ!^_^!");
#print (response.status);
print (responseText);
conn.close();
run error:
Traceback (most recent call last):
File "F:\Python\test1-3.4.py", line 32, in <module>
PassParse();
File "F:\Python\test1-3.4.py", line 24, in PassParse
response = conn.getresponse();
File "E:\program files\Python 3.4.3\lib\http\client.py", line 1171, in getresponse
response.begin()
File "E:\program files\Python 3.4.3\lib\http\client.py", line 351, in begin
version, status, reason = self._read_status()
File "E:\program files\Python 3.4.3\lib\http\client.py", line 333, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: <html>
It seems that the httpserver didn't return valid http response, you can use telnet to check it.
telnet 192.168.1.113 8080
then send:
GET /api/v1/login HTTP/1.1
reference: https://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html
From the Python docs for httplib:
exception httplib.BadStatusLine
A subclass of HTTPException. Raised if a server responds with a HTTP status code that we don’t understand.
Sounds like your API might not be returning something that can be parsed as a valid HTTP response with a valid HTTP status code. You might want to check that the code for your API endpoint is working as expected and is not failing.
Besides that, your code runs fine except for one thing: response.getheader() takes an argument, while response.getheaders() takes no argument, so Python with complain about that once you resolve the BadStatusLine exception.
I have solve the problem using the following code:
from requests.auth import HTTPBasicAuth
res=requests.get('http://192.168.1.113:8080/api/v1/login', auth=(username, password));
I am really confused, new to Python and I am working on a script that scrapes a website for products on Python27. I am trying to use urllib2 to do this and when I run the script it prints multiple traceback errors. Suggestions?
Script:
import urllib2, zlib, json
url='https://launches.endclothing.com/api/products'
req = urllib2.Request(url)
req.add_header(':host','launches.endclothing.com');req.add_header(':method','GET');req.add_header(':path','/api/products');req.add_header(':scheme','https');req.add_header(':version','HTTP/1.1');req.add_header('accept','application/json, text/plain, */*');req.add_header('accept-encoding','gzip,deflate');req.add_header('accept-language','en-US,en;q=0.8');req.add_header('cache-control','max-age=0');req.add_header('cookie','__/');req.add_header('user-agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36');
resp = urllib2.urlopen(req).read()
resp = zlib.decompress(bytes(bytearray(resp)),15+32)
data = json.loads(resp)
for product in data:
for attrib in product.keys():
print str(attrib)+' :: '+ str(product[attrib])
print '\n'
Error(s):
C:\Users\Luke>py C:\Users\Luke\Documents\EndBot2.py
Traceback (most recent call last):
File "C:\Users\Luke\Documents\EndBot2.py", line 5, in <module>
resp = urllib2.urlopen(req).read()
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 391, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 409, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1181, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "C:\Python27\lib\urllib2.py", line 1148, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 1] _ssl.c:499: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error>
You're running into issues with SSL configuration of your request. I'm sorry, but I won't correct your code, because we're in 2016, and there's a wonderful library that you should use instead: requests
So its usage is pretty simple:
>>> user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'
>>> result = requests.get('https://launches.endclothing.com/api/products', headers={'user-agent': user_agent})
>>> result
<Response [200]>
>>> result.json()
[{u'name': u'Adidas Consortium x HighSnobiety Ultraboost', u'colour': u'Grey', u'id': 30, u'releaseDate': u'2016-04-09T00:01:00+0100', …
You'll notice that I changed the user-agent in the previous query to have it working, because weirdly enough, the website is refusing API access to requests:
>>> result = requests.get('https://launches.endclothing.com/api/products')
>>> result
<Response [403]>
>>> result.text
This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.</p></div><div class="error-right"><h3>What can I do to resolve this?</h3><p>If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.</p><p>If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.
Otherwise, now that you've tried requests and your life has changed, you might still run into this issue again. As you might read from many places on internet, this is related to SNI and outdated libraries and you might get headaches trying to figure this out. My best advice would be for you to upgrade to Python3, as the problem is likely to be solved by installing a new vanilla version of python and the libs involved.
HTH
I'm building a Tornado web application applying the Coroutine feature. To test the performance of the app, I used the Siege. However, It occurred many 503 errors when using Siege to call the URL. By the way, my app was running on an Raspberry Pi.
The snippet of my app:
import os
import tornado.web
import tornado.gen
import tornado.ioloop
import tornado.options
import tornado.httpclient
tornado.options.define("port", default=8888, help="message...", type=int)
url='http://..../'
class Async(tornado.web.RequestHandler):
#tornado.gen.coroutine
def get(self):
client = tornado.httpclient.AsyncHTTPClient()
response = yield client.fetch(url)
self.write('%r' % response.body)
def main():
# Command line
tornado.options.parse_command_line()
app = tornado.web.Application(
[
(r'/async', Async)
],
debug=True,
autoreload=True
)
app.listen(tornado.options.options.port)
tornado.ioloop.IOLoop.current().start()
if __name__ == "__main__":
main()
And the command:
siege 192.168.1.x:8888/async -c 5 -r 5
And the error message:
[E 150822 18:47:15 web:1908] 500 GET /async (192.168.1.3) 2045.97ms
[I 150822 18:47:15 web:1908] 200 GET /async (192.168.1.3) 3317.43ms
[E 150822 18:47:16 web:1496] Uncaught exception GET /async (192.168.1.3)
HTTPServerRequest(protocol='http', host='192.168.1.x:8888', method='GET', uri='/async', version='HTTP/1.1', remote_ip='192.168.1.3', headers={'Host': '192.168.1.9:8888', 'User-Agent': 'Mozilla/5.0 (apple-x86_64-darwin14.4.0) Siege/3.1.0', 'Connection': 'close', 'Accept': '*/*', 'Accept-Encoding': 'gzip'})
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/tornado/web.py", line 1415, in _execute
result = yield result
File "/usr/lib/python2.7/site-packages/tornado/gen.py", line 870, in run
value = future.result()
File "/usr/lib/python2.7/site-packages/tornado/concurrent.py", line 215, in result
raise_exc_info(self._exc_info)
File "/usr/lib/python2.7/site-packages/tornado/gen.py", line 876, in run
yielded = self.gen.throw(*exc_info)
File "server.py", line 19, in get
response = yield client.fetch(url)
File "/usr/lib/python2.7/site-packages/tornado/gen.py", line 870, in run
value = future.result()
File "/usr/lib/python2.7/site-packages/tornado/concurrent.py", line 215, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
HTTPError: HTTP 503: Service Temporarily Unavailable
[E 150822 18:47:16 web:1908] 500 GET /async (192.168.1.x) 3645.07ms
So, did I omit some settings?
I would be very appreciated if you can point out what's wrong with my app.
Thank you so much.
I changed the url to http://www.google.com, and no 503 errors occurred. I also tested my app on mac, and it worked well. Therefore, I think those errors resulted from where I fetch data.
Anyway, thank you so much, SDilmac.