How to determine how a (downloaded) byte string is encoded in python?

How to determine how a (downloaded) byte string is encoded in python? - python

I am trying to download a file and write it to disk, but somehow I am lost in encoding decoding land.
from urllib.request import urlopen
url = "http://export.arxiv.org/e-print/supr-con/9608001"
with urllib.request.urlopen(url) as response:
data = response.read()
filename = 'test.txt'
file_ = open(filename, 'wb')
file_.write(data)
file_.close()
Here data is a byte string. If I check the file I find a bunch of strange characters. I tried
import chardet
the_encoding = chardet.detect(data)['encoding']
but this results in None. So I don't really know how the data I downloaded is encoded?
If I just type "http://export.arxiv.org/e-print/supr-con/9608001" into the browser, it downloads a file that I can view with a text editor and it's a perfectly fine .tex file.

Apply the python-magic library.
python-magic is a Python interface to the libmagic file type
identification library. libmagic identifies file types by checking
their headers according to a predefined list of file types. This
functionality is exposed to the command line by the Unix command
file.
Commented script (works on Windows 10, Python 3.8.6):
# stage #1: read raw data from a url
from urllib.request import urlopen
import gzip
url = "http://export.arxiv.org/e-print/supr-con/9608001"
with urlopen(url) as response:
rawdata = response.read()
# stage #2: detect raw data type by its signature
print("file signature", rawdata[0:2])
import magic
print( magic.from_buffer(rawdata[0:1024]))
# stage #3: decompress raw data and write to a file
data = gzip.decompress(rawdata)
filename = 'test.tex'
file_ = open(filename, 'wb')
file_.write(data)
file_.close()
# stage #4: detect encoding of the data ( == encoding of the written file)
import chardet
print( chardet.detect(data))
Result: .\SO\68307124.py
file signature b'\x1f\x8b'
gzip compressed data, was "9608001.tex", last modified: Thu Aug 8 04:57:44 1996, max compression, from Unix
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

Related

Download file using python without knowing its extension. - content-type - stream

Hey i just did some research and found that i could download images from urls which end with filename.extension like 000000.jpeg. i wonder now how i could downoad a picture which doesnt have any extension.
Here is my url which i want to download the image http://books.google.com/books/content?id=i2xKGwAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api
when i put the url directly to the browser it displays an image
furthermore here is what i tried:
from six.moves import urllib
thumbnail='http://books.google.com/books/content?id=i2xKGwAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api'
img=urllib.request.Request(thumbnail)
pic=urllib.request.urlopen(img)
pic=urllib.request.urlopen(img).read()
Anyhelp will be appreciated so much

This is a way to do it using HTTP response headers :
import requests
import time
r = requests.get("http://books.google.com/books/content?id=i2xKGwAACAAJ&printsec=frontcover&img=1&zoom=1&source=gbs_api", stream=True)
ext = r.headers['content-type'].split('/')[-1] # converts response headers mime type to an extension (may not work with everything)
with open("%s.%s" % (time.time(), ext), 'wb') as f: # open the file to write as binary - replace 'wb' with 'w' for text files
for chunk in r.iter_content(1024): # iterate on stream using 1KB packets
f.write(chunk) # write the file

Python Error: iterator should return strings, not bytes (did you open the file in text mode?) [duplicate]

I've been struggling with this simple problem for too long, so I thought I'd ask for help. I am trying to read a list of journal articles from National Library of Medicine ftp site into Python 3.3.2 (on Windows 7). The journal articles are in a .csv file.
I have tried the following code:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream)
data = [row for row in csvfile]
It results in the following error:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
data = [row for row in csvfile]
File "<pyshell#4>", line 1, in <listcomp>
data = [row for row in csvfile]
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
I presume I should be working with strings not bytes? Any help with the simple problem, and an explanation as to what is going wrong would be greatly appreciated.

The problem relies on urllib returning bytes. As a proof, you can try to download the csv file with your browser and opening it as a regular file and the problem is gone.
A similar problem was addressed here.
It can be solved decoding bytes to strings with the appropriate encoding. For example:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream.read().decode('utf-8')) # with the appropriate encoding
data = [row for row in csvfile]
The last line could also be: data = list(csvfile) which can be easier to read.
By the way, since the csv file is very big, it can slow and memory-consuming. Maybe it would be preferable to use a generator.
EDIT:
Using codecs as proposed by Steven Rumbalski so it's not necessary to read the whole file to decode. Memory consumption reduced and speed increased.
import csv
import urllib.request
import codecs
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
for line in csvfile:
print(line) # do something with line
Note that the list is not created either for the same reason.

Even though there is already an accepted answer, I thought I'd add to the body of knowledge by showing how I achieved something similar using the requests package (which is sometimes seen as an alternative to urlib.request).
The basis of using codecs.itercode() to solve the original problem is still the same as in the accepted answer.
import codecs
from contextlib import closing
import csv
import requests
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(codecs.iterdecode(r.iter_lines(), 'utf-8'))
for row in reader:
print row
Here we also see the use of streaming provided through the requests package in order to avoid having to load the entire file over the network into memory first (which could take long if the file is large).
I thought it might be useful since it helped me, as I was using requests rather than urllib.request in Python 3.6.
Some of the ideas (e.g using closing()) are picked from this similar post

I had a similar problem using requests package and csv.
The response from post request was type bytes.
In order to user csv library, first I a stored them as a string file in memory (in my case the size was small), decoded utf-8.
import io
import csv
import requests
response = requests.post(url, data)
# response.content is something like:
# b'"City","Awb","Total"\r\n"Bucuresti","6733338850003","32.57"\r\n'
csv_bytes = response.content
# write in-memory string file from bytes, decoded (utf-8)
str_file = io.StringIO(csv_bytes.decode('utf-8'), newline='\n')
reader = csv.reader(str_file)
for row_list in reader:
print(row_list)
# Once the file is closed,
# any operation on the file (e.g. reading or writing) will raise a ValueError
str_file.close()
Printed something like:
['City', 'Awb', 'Total']
['Bucuresti', '6733338850003', '32.57']

urlopen will return a urllib.response.addinfourl instance for an ftp request.
For ftp, file, and data urls and requests explicity handled by legacy
URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager...
>>> urllib2.urlopen(url)
<addinfourl at 48868168L whose fp = <addclosehook at 48777416L whose fp = <socket._fileobject object at 0x0000000002E52B88>>>
At this point ftpstream is a file like object, using .read() would return the contents however csv.reader requires an iterable in this case:
Defining a generator like so:
def to_lines(f):
line = f.readline()
while line:
yield line
line = f.readline()
We can create our csv reader like so:
reader = csv.reader(to_lines(ftps))
And with a url
url = "http://pic.dhe.ibm.com/infocenter/tivihelp/v41r1/topic/com.ibm.ismsaas.doc/reference/CIsImportMinimumSample.csv"
The code:
for row in reader: print row
Prints
>>>
['simpleci']
['SCI.APPSERVER']
['SRM_SaaS_ES', 'MXCIImport', 'AddChange', 'EN']
['CI_CINUM']
['unique_identifier1']
['unique_identifier2']

Download a file in python with urllib2 instead of urllib

I'm trying to download a tarball file and save it locally with python. With urllib it's pretty simple:
import urllib
urllib2.urlopen(url, 'compressed_file.tar.gz')
tar = tarfile.open('compressed_file.tar.gz')
print tar.getmembers()
So my question is really simple: What's the way to achieve this using the urllib2 library?

Quoting docs:
urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[,
context]]]]])
Open the URL url, which can be either a string or a
Request object.
data may be a string specifying additional data to send to the server, or None if no such data is needed.
Nothing in urlopen interface documentation says, that second argument is a name of file where response should be written.
You need to explicitly write data read from response to file:
r = urllib2.urlopen(url)
CHUNK_SIZE = 1 << 20
with open('compressed_file.tar.gz', 'wb') as f:
# line belows downloads all file at once to memory, and dumps it to file afterwards
# f.write(r.read())
# below is preferable lazy solution - download and write data in chunks
while True:
chunk = r.read(CHUNK_SIZE)
if not chunk:
break
f.write(chunk)

Download and save PDF file with Python requests module

I am trying to download a PDF file from a website and save it to disk. My attempts either fail with encoding errors or result in blank PDFs.
In [1]: import requests
In [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
In [3]: response = requests.get(url)
In [4]: with open('/tmp/metadata.pdf', 'wb') as f:
...: f.write(response.text)
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-4-4be915a4f032> in <module>()
1 with open('/tmp/metadata.pdf', 'wb') as f:
----> 2 f.write(response.text)
3
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)
In [5]: import codecs
In [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:
...: f.write(response.text)
...:
I know it is a codec problem of some kind but I can't seem to get it to work.

You should use response.content in this case:
with open('/tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
From the document:
You can also access the response body as bytes, for non-text requests:
>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...
So that means: response.text return the output as a string object, use it when you're downloading a text file. Such as HTML file, etc.
And response.content return the output as bytes object, use it when you're downloading a binary file. Such as PDF file, audio file, image, etc.
You can also use response.raw instead. However, use it when the file which you're about to download is large. Below is a basic example which you can also find in the document:
import requests
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
r = requests.get(url, stream=True)
with open('/tmp/metadata.pdf', 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
chunk_size is the chunk size which you want to use. If you set it as 2000, then requests will download that file the first 2000 bytes, write them into the file, and do this again, again and again, unless it finished.
So this can save your RAM. But I'd prefer use response.content instead in this case since your file is small. As you can see use response.raw is complex.
Relates:
How to download large file in python with requests.py?
How to download image using requests

In Python 3, I find pathlib is the easiest way to do this. Request's response.content marries up nicely with pathlib's write_bytes.
from pathlib import Path
import requests
filename = Path('metadata.pdf')
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url)
filename.write_bytes(response.content)

You can use urllib:
import urllib.request
urllib.request.urlretrieve(url, "filename.pdf")

Please note I'm a beginner. If My solution is wrong, please feel free to correct and/or let me know. I may learn something new too.
My solution:
Change the downloadPath accordingly to where you want your file to be saved. Feel free to use the absolute path too for your usage.
Save the below as downloadFile.py.
Usage: python downloadFile.py url-of-the-file-to-download new-file-name.extension
Remember to add an extension!
Example usage: python downloadFile.py http://www.google.co.uk google.html
import requests
import sys
import os
def downloadFile(url, fileName):
with open(fileName, "wb") as file:
response = requests.get(url)
file.write(response.content)
scriptPath = sys.path[0]
downloadPath = os.path.join(scriptPath, '../Downloads/')
url = sys.argv[1]
fileName = sys.argv[2]
print('path of the script: ' + scriptPath)
print('downloading file to: ' + downloadPath)
downloadFile(url, downloadPath + fileName)
print('file downloaded...')
print('exiting program...')

Generally, this should work in Python3:
import urllib.request
..
urllib.request.get(url)
Remember that urllib and urllib2 don't work properly after Python2.
If in some mysterious cases requests don't work (happened with me), you can also try using
wget.download(url)
Related:
Here's a decent explanation/solution to find and download all pdf files on a webpage:
https://medium.com/#dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

regarding Kevin answer to write in a folder tmp, it should be like this:
with open('./tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
he forgot . before the address and of-course your folder tmp should have been created already

Saving response from Requests to file

I'm using Requests to upload a PDF to an API. It is stored as "response" below. I'm trying to write that out to Excel.
import requests
files = {'f': ('1.pdf', open('1.pdf', 'rb'))}
response = requests.post("https://pdftables.com/api?&format=xlsx-single",files=files)
response.raise_for_status() # ensure we notice bad responses
file = open("out.xls", "w")
file.write(response)
file.close()
I'm getting the error:
file.write(response)
TypeError: expected a character buffer object

I believe all the existing answers contain the relevant information, but I would like to summarize.
The response object that is returned by requests get and post operations contains two useful attributes:
Response attributes
response.text - Contains str with the response text.
response.content - Contains bytes with the raw response content.
You should choose one or other of these attributes depending on the type of response you expect.
For text-based responses (html, json, yaml, etc) you would use response.text
For binary-based responses (jpg, png, zip, xls, etc) you would use response.content.
Writing response to file
When writing responses to file you need to use the open function with the appropriate file write mode.
For text responses you need to use "w" - plain write mode.
For binary responses you need to use "wb" - binary write mode.
Examples
Text request and save
# Request the HTML for this web page:
response = requests.get("https://stackoverflow.com/questions/31126596/saving-response-from-requests-to-file")
with open("response.txt", "w") as f:
f.write(response.text)
Binary request and save
# Request the profile picture of the OP:
response = requests.get("https://i.stack.imgur.com/iysmF.jpg?s=32&g=1")
with open("response.jpg", "wb") as f:
f.write(response.content)
Answering the original question
The original code should work by using wb and response.content:
import requests
files = {'f': ('1.pdf', open('1.pdf', 'rb'))}
response = requests.post("https://pdftables.com/api?&format=xlsx-single",files=files)
response.raise_for_status() # ensure we notice bad responses
file = open("out.xls", "wb")
file.write(response.content)
file.close()
But I would go further and use the with context manager for open.
import requests
with open('1.pdf', 'rb') as file:
files = {'f': ('1.pdf', file)}
response = requests.post("https://pdftables.com/api?&format=xlsx-single",files=files)
response.raise_for_status() # ensure we notice bad responses
with open("out.xls", "wb") as file:
file.write(response.content)

You can use the response.text to write to a file:
import requests
files = {'f': ('1.pdf', open('1.pdf', 'rb'))}
response = requests.post("https://pdftables.com/api?&format=xlsx-single",files=files)
response.raise_for_status() # ensure we notice bad responses
with open("resp_text.txt", "w") as file:
file.write(response.text)

As Peter already pointed out:
In [1]: import requests
In [2]: r = requests.get('https://api.github.com/events')
In [3]: type(r)
Out[3]: requests.models.Response
In [4]: type(r.content)
Out[4]: str
You may also want to check r.text.
Also: https://2.python-requests.org/en/latest/user/quickstart/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to determine how a (downloaded) byte string is encoded in python? - python

Related

Download file using python without knowing its extension. - content-type - stream

Python Error: iterator should return strings, not bytes (did you open the file in text mode?) [duplicate]

Download a file in python with urllib2 instead of urllib

Download and save PDF file with Python requests module

Saving response from Requests to file

Categories

Resources