I have the following code in Python3
import urllib.request
f = urllib.request.urlopen("https://www.okcoin.cn/api/v1/trades.do?since=0")
a = f.read() # there is data here
print(a.decode()) # error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I can get a readable result for https://www.okcoin.cn/api/v1/trades.do?since=0 in a browser. The browser confirms the encoding is UTF-8.
What am I missing?
Thanks
Downloading the data with wget reveals that the data is actually
compressed with gzip. So you need to decompress it first. There’s a
gzip module that should be useful.
Edit: try this.
import urllib.request
import gzip
f = urllib.request.urlopen("https://www.okcoin.cn/api/v1/trades.do?since=0")
a = f.read() # there is data here
uncompressed = gzip.decompress(a)
print(uncompressed.decode())
Why not to use requests module?
import requests
f = requests.get("https://www.okcoin.cn/api/v1/trades.do?since=0")
a = f.text
print(a)
Works fine for me :)
As I mentioned on my comment #Yuval Pruss's answer, requests modules handle compressed data implicitly, urllib3 as well does the same-thing as it has Support for gzip and deflate encoding. Here is a demonstration:
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request("https://www.okcoin.cn/api/v1/trades.do?since=0")
>>> r.headers['content-encoding']
'gzip'
>>>
>>> import json
>>> if r.status == 200:
json_data = json.loads(r.data.decode('utf-8'))
print(json_data[0])
{'date_ms': 1489842827000, 'tid': 7368887285, 'date': 1489842827, 'price': '7236.01', 'amount': '1.081', 'type': 'sell'}
Related
so I'm trying to get a csv file with requests and save it to my project:
import requests
import pandas as pd
import csv
def get_and_save_countries():
url = 'https://www.trackcorona.live/api/countries'
r = requests.get(url)
data = r.json()
data = data["data"]
with open("corona/dash_apps/finished_apps/apicountries.csv","w",newline="") as f:
title = "location,country_code,latitude,longitude,confirmed,dead,recovered,updated".split(",")
cw = csv.DictWriter(f,title,delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
cw.writeheader()
cw.writerows(data)
I've managed that but when I try this:
get_data.get_and_save_countries()
df = pd.read_csv("corona\\dash_apps\\finished_apps\\apicountries.csv")
I get this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1: invalid continuation byte
And I have no idea why. Any help is welcome. Thanks.
Try:
with open("corona/dash_apps/finished_apps/apicountries.csv","w",newline="", encoding ='utf-8') as f:
to explicitly specify the encoding with encoding='utf-8'
When you write to a file, the default encoding is locale.getpreferredencoding(False). On Windows that is usually not UTF-8 and even on Linux the terminal could be configured other than UTF-8. Pandas is defaulting to utf-8, so specify encoding='utf8' as another parameter to open.
I was looking at this codegolf problem, and decided to try taking the python solution and use urllib instead. I modified some sample code for manipulating json with urllib:
import urllib.request
import json
res = urllib.request.urlopen('http://api.stackexchange.com/questions?sort=hot&site=codegolf')
res_body = res.read()
j = json.loads(res_body.decode("utf-8"))
This gives:
➜ codegolf python clickbait.py
Traceback (most recent call last):
File "clickbait.py", line 7, in <module>
j = json.loads(res_body.decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
If you go to: http://api.stackexchange.com/questions?sort=hot&site=codegolf and click under "Headers" it says charset=utf-8. Why is it giving me these weird results with urlopen?
res_body is gzipped. I'm not sure that uncompressing the response is something urllib takes care of by default.
You'll have your data if you uncompress the response from the API server.
import urllib.request
import zlib
import json
with urllib.request.urlopen(
'http://api.stackexchange.com/questions?sort=hot&site=codegolf'
) as res:
decompressed_data = zlib.decompress(res.read(), 16+zlib.MAX_WBITS)
j = json.loads(decompressed_data, encoding='utf-8')
print(j)
I'm getting this response when I open this url:
r = Request(r'http://airdates.tv/')
h = urlopen(r).readline()
print(h)
Response:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00\xed\xbdkv\xdbH\x96.\xfa\xbbj\x14Q\xaeuJ\xce\xee4E\x82\xa4(9m\xe7\xd2\xd3VZ\xaf2e\xab2k\xf5\xc2\n'
What encoding is this?
Is there a way to decode it based on the standard library?
Thank you in advance for any insight on this matter!
PS: It seems to be gzip.
It's gzip compressed HTML, as you suspected.
Rather than use urllib use requests which will decompress the response for you:
import requests
r = requests.get('http://airdates.tv/')
print(r.text)
You can install it with pip install requests, and never look back.
If you really must restrict yourself to the standard library, then decompress it with the gzip module:
import gzip
import urllib2
from cStringIO import StringIO
f = urllib2.urlopen('http://airdates.tv/')
# how to determine the content encoding
content_encoding = f.headers.get('Content-Encoding')
#print(content_encoding)
# how to decompress gzip data with Python 3
if content_encoding == 'gzip':
response = gzip.decompress(f.read())
# decompress with Python 2
if content_encoding == 'gzip':
gz = gzip.GzipFile(fileobj=StringIO(f.read())
response = gz.read()
mhawke's solution (using requests instead of urllib) works perfectly and in most cases should be preferred.
That said, I was looking for a solution that does not require installing 3rd party libraries (hence my choice of urllib over requests).
I found a solution using standard libraries:
import zlib
from urllib.request import Request, urlopen
r = Request(r'http://airdates.tv/')
h = urlopen(r).read()
decomp_gzip = zlib.decompress(h, 16+zlib.MAX_WBITS)
print(decomp_gzip)
Which yields the following response:
b'<!DOCTYPE html>\n (continues...)'
I am trying to download a PDF file from a website and save it to disk. My attempts either fail with encoding errors or result in blank PDFs.
In [1]: import requests
In [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
In [3]: response = requests.get(url)
In [4]: with open('/tmp/metadata.pdf', 'wb') as f:
...: f.write(response.text)
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-4-4be915a4f032> in <module>()
1 with open('/tmp/metadata.pdf', 'wb') as f:
----> 2 f.write(response.text)
3
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)
In [5]: import codecs
In [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:
...: f.write(response.text)
...:
I know it is a codec problem of some kind but I can't seem to get it to work.
You should use response.content in this case:
with open('/tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
From the document:
You can also access the response body as bytes, for non-text requests:
>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...
So that means: response.text return the output as a string object, use it when you're downloading a text file. Such as HTML file, etc.
And response.content return the output as bytes object, use it when you're downloading a binary file. Such as PDF file, audio file, image, etc.
You can also use response.raw instead. However, use it when the file which you're about to download is large. Below is a basic example which you can also find in the document:
import requests
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
r = requests.get(url, stream=True)
with open('/tmp/metadata.pdf', 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
chunk_size is the chunk size which you want to use. If you set it as 2000, then requests will download that file the first 2000 bytes, write them into the file, and do this again, again and again, unless it finished.
So this can save your RAM. But I'd prefer use response.content instead in this case since your file is small. As you can see use response.raw is complex.
Relates:
How to download large file in python with requests.py?
How to download image using requests
In Python 3, I find pathlib is the easiest way to do this. Request's response.content marries up nicely with pathlib's write_bytes.
from pathlib import Path
import requests
filename = Path('metadata.pdf')
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url)
filename.write_bytes(response.content)
You can use urllib:
import urllib.request
urllib.request.urlretrieve(url, "filename.pdf")
Please note I'm a beginner. If My solution is wrong, please feel free to correct and/or let me know. I may learn something new too.
My solution:
Change the downloadPath accordingly to where you want your file to be saved. Feel free to use the absolute path too for your usage.
Save the below as downloadFile.py.
Usage: python downloadFile.py url-of-the-file-to-download new-file-name.extension
Remember to add an extension!
Example usage: python downloadFile.py http://www.google.co.uk google.html
import requests
import sys
import os
def downloadFile(url, fileName):
with open(fileName, "wb") as file:
response = requests.get(url)
file.write(response.content)
scriptPath = sys.path[0]
downloadPath = os.path.join(scriptPath, '../Downloads/')
url = sys.argv[1]
fileName = sys.argv[2]
print('path of the script: ' + scriptPath)
print('downloading file to: ' + downloadPath)
downloadFile(url, downloadPath + fileName)
print('file downloaded...')
print('exiting program...')
Generally, this should work in Python3:
import urllib.request
..
urllib.request.get(url)
Remember that urllib and urllib2 don't work properly after Python2.
If in some mysterious cases requests don't work (happened with me), you can also try using
wget.download(url)
Related:
Here's a decent explanation/solution to find and download all pdf files on a webpage:
https://medium.com/#dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48
regarding Kevin answer to write in a folder tmp, it should be like this:
with open('./tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
he forgot . before the address and of-course your folder tmp should have been created already
I am trying to rewrite code previously written for Python 2.7 into Python 3.4. I get the error zipfile.BadZipFile: File is not a zip file in the line zipfile = ZipFile(StringIO(zipdata)) in the code below.
import csv
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
import pandas as pd
import os
from zipfile import ZipFile
from pprint import pprint, pformat
import urllib.request
import urllib.parse
try:
import urllib.request as urllib2
except ImportError:
import urllib2
my_url = 'http://www.bankofcanada.ca/stats/results/csv'
data = urllib.parse.urlencode({"lookupPage": "lookup_yield_curve.php",
"startRange": "1986-01-01",
"searchRange": "all"})
# request = urllib2.Request(my_url, data)
# result = urllib2.urlopen(request)
binary_data = data.encode('utf-8')
req = urllib.request.Request(my_url, binary_data)
result = urllib.request.urlopen(req)
zipdata = result.read().decode("utf-8",errors="ignore")
zipfile = ZipFile(StringIO(zipdata))
df = pd.read_csv(zipfile.open(zipfile.namelist()[0]))
df = pd.melt(df, id_vars=['Date'])
df.rename(columns={'variable': 'Maturity'}, inplace=True)
Thank You
You shouldn't be decoding the data you get back in the result. The data is the bytes for the ZipFile, not bytes which are the encoding of a unicode string. I think your confusion arises because in Python 2 there is no distinction, but here in Python 3 you need a BytesIO not a StringIO.
So that part of your code should read:
zipdata = result.read()
zipfile = ZipFile(BytesIO(zipdata))
df = pd.read_csv(zipfile.open(zipfile.namelist()[0]))
The data you are getting back is not utf-8 encoded so you can't decode it that way. You would have found that more easily if you hadn't specified errors = "ignore", which is seldom a good idea ...