Using configparser for text read from URL using urllib2 - python

I have to read a txt ini file from my browser. [this is required]
res = urllib2.urlopen(URL)
inifile = res.read()
Then I want to basically use this the same way as I would have read any txt file.
config = ConfigParser.SafeConfigParser()
config.read( inifile )
But now looks like I can't use it as this is actually a string now
Can anybody suggest a way around?

You want configparser.readfp -- Presumably, you might even be able to get away with:
res = urllib2.urlopen(URL)
config = ConfgiParser.SafeConfigParser()
config.readfp(res)
assuming that urllib2.urlopen returns an object that is sufficiently file-like (i.e. it has a readline method). For easier debugging, you could do:
config.readfp(res, URL)
If you have to read it the data from a string, you could pack the whole thing into a io.StringIO (or StringIO.StringIO) buffer and read from that:
import io
res = urllib2.urlopen(URL)
inifile_text = res.read()
inifile = io.StringIO(inifile_text)
inifile.seek(0)
config.readfp(inifile)

Related

Python - How to read the content of an URL twice?

I am using 'urllib.request.urlopen' to read the content of an HTML page. Afterwards, I want to print the content to my local file and then do a certain operation (e.g. constuct a parser on that page e.g. BeautifulSoup).
The problem
After reading the content for the first time (and writing it into a file), I can't read the content for the second time in order to do something with it (e.g. construct a parser on it). It is just empty and I can't move the cursor(seek(0)) back to the beginning.
import urllib.request
response = urllib.request.urlopen("http://finance.yahoo.com")
file = open( "myTestFile.html", "w")
file.write( response.read() ) # Tried responce.readlines(), but that did not help me
#Tried: response.seek() but that did not work
print( response.read() ) # Actually, I want something done here... e.g. construct a parser:
# BeautifulSoup(response).
# Anyway this is an empty result
file.close()
How can I fix it?
Thank you very much!
You can not read the response twice. But you can easily reuse the saved content:
content = response.read()
file.write(content)
print(content)

Python Error: iterator should return strings, not bytes (did you open the file in text mode?) [duplicate]

I've been struggling with this simple problem for too long, so I thought I'd ask for help. I am trying to read a list of journal articles from National Library of Medicine ftp site into Python 3.3.2 (on Windows 7). The journal articles are in a .csv file.
I have tried the following code:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream)
data = [row for row in csvfile]
It results in the following error:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
data = [row for row in csvfile]
File "<pyshell#4>", line 1, in <listcomp>
data = [row for row in csvfile]
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
I presume I should be working with strings not bytes? Any help with the simple problem, and an explanation as to what is going wrong would be greatly appreciated.
The problem relies on urllib returning bytes. As a proof, you can try to download the csv file with your browser and opening it as a regular file and the problem is gone.
A similar problem was addressed here.
It can be solved decoding bytes to strings with the appropriate encoding. For example:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream.read().decode('utf-8')) # with the appropriate encoding
data = [row for row in csvfile]
The last line could also be: data = list(csvfile) which can be easier to read.
By the way, since the csv file is very big, it can slow and memory-consuming. Maybe it would be preferable to use a generator.
EDIT:
Using codecs as proposed by Steven Rumbalski so it's not necessary to read the whole file to decode. Memory consumption reduced and speed increased.
import csv
import urllib.request
import codecs
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
for line in csvfile:
print(line) # do something with line
Note that the list is not created either for the same reason.
Even though there is already an accepted answer, I thought I'd add to the body of knowledge by showing how I achieved something similar using the requests package (which is sometimes seen as an alternative to urlib.request).
The basis of using codecs.itercode() to solve the original problem is still the same as in the accepted answer.
import codecs
from contextlib import closing
import csv
import requests
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(codecs.iterdecode(r.iter_lines(), 'utf-8'))
for row in reader:
print row
Here we also see the use of streaming provided through the requests package in order to avoid having to load the entire file over the network into memory first (which could take long if the file is large).
I thought it might be useful since it helped me, as I was using requests rather than urllib.request in Python 3.6.
Some of the ideas (e.g using closing()) are picked from this similar post
I had a similar problem using requests package and csv.
The response from post request was type bytes.
In order to user csv library, first I a stored them as a string file in memory (in my case the size was small), decoded utf-8.
import io
import csv
import requests
response = requests.post(url, data)
# response.content is something like:
# b'"City","Awb","Total"\r\n"Bucuresti","6733338850003","32.57"\r\n'
csv_bytes = response.content
# write in-memory string file from bytes, decoded (utf-8)
str_file = io.StringIO(csv_bytes.decode('utf-8'), newline='\n')
reader = csv.reader(str_file)
for row_list in reader:
print(row_list)
# Once the file is closed,
# any operation on the file (e.g. reading or writing) will raise a ValueError
str_file.close()
Printed something like:
['City', 'Awb', 'Total']
['Bucuresti', '6733338850003', '32.57']
urlopen will return a urllib.response.addinfourl instance for an ftp request.
For ftp, file, and data urls and requests explicity handled by legacy
URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager...
>>> urllib2.urlopen(url)
<addinfourl at 48868168L whose fp = <addclosehook at 48777416L whose fp = <socket._fileobject object at 0x0000000002E52B88>>>
At this point ftpstream is a file like object, using .read() would return the contents however csv.reader requires an iterable in this case:
Defining a generator like so:
def to_lines(f):
line = f.readline()
while line:
yield line
line = f.readline()
We can create our csv reader like so:
reader = csv.reader(to_lines(ftps))
And with a url
url = "http://pic.dhe.ibm.com/infocenter/tivihelp/v41r1/topic/com.ibm.ismsaas.doc/reference/CIsImportMinimumSample.csv"
The code:
for row in reader: print row
Prints
>>>
['simpleci']
['SCI.APPSERVER']
['SRM_SaaS_ES', 'MXCIImport', 'AddChange', 'EN']
['CI_CINUM']
['unique_identifier1']
['unique_identifier2']

Download a file in python with urllib2 instead of urllib

I'm trying to download a tarball file and save it locally with python. With urllib it's pretty simple:
import urllib
urllib2.urlopen(url, 'compressed_file.tar.gz')
tar = tarfile.open('compressed_file.tar.gz')
print tar.getmembers()
So my question is really simple: What's the way to achieve this using the urllib2 library?
Quoting docs:
urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[,
context]]]]])
Open the URL url, which can be either a string or a
Request object.
data may be a string specifying additional data to send to the server, or None if no such data is needed.
Nothing in urlopen interface documentation says, that second argument is a name of file where response should be written.
You need to explicitly write data read from response to file:
r = urllib2.urlopen(url)
CHUNK_SIZE = 1 << 20
with open('compressed_file.tar.gz', 'wb') as f:
# line belows downloads all file at once to memory, and dumps it to file afterwards
# f.write(r.read())
# below is preferable lazy solution - download and write data in chunks
while True:
chunk = r.read(CHUNK_SIZE)
if not chunk:
break
f.write(chunk)

How to get image file size in python when fetching from URL (before deciding to save)

import urllib.request,io
url = 'http://www.image.com/image.jpg'
path = io.BytesIO(urllib.request.urlopen(url).read())
I'd like to check the file size of the URL image in the filestream path before saving, how can i do this?
Also, I don't want to rely on Content-Length headers, I'd like to fetch it into a filestream, check the size and then save
You can get the size of the io.BytesIO() object the same way you can get it for any file object: by seeking to the end and asking for the file position:
path = io.BytesIO(urllib.request.urlopen(url).read())
path.seek(0, 2) # 0 bytes from the end
size = path.tell()
However, you could just as easily have just taken the len() of the bytestring you just read, before inserting it into an in-memory file object:
data = urllib.request.urlopen(url).read()
size = len(data)
path = io.BytesIO(data)
Note that this means your image has already been loaded into memory. You cannot use this to prevent loading too large an image object. For that using the Content-Length header is the only option.
If the server uses a chunked transfer encoding to facilitate streaming (so no content length has been set up front), you can use a loop limit how much data is read.
Try importing urllib.request
import urllib.request, io
url = 'http://www.elsecarrailway.co.uk/images/Events/TeddyBear-3.jpg'
path = urllib.request.urlopen(url)
meta = path.info()
>>>meta.get(name="Content-Length")
'269898' # ie 269kb
You could ask the server for the content-length information. Using urllib2 (which I hope is available in your python):
req = urllib2.urlopen(url)
meta = req,info()
length_text = meta.getparam("Content-Length")
try:
length = int(length_text)
except:
# length unknown, you may need to read
length = -1

downloading a file, not the contents

I am trying to automate downloading a .Z file from a website, but the file I get is 2kb when it should be around 700 kb and it contains a list of the contents of the page (ie: all the files available for download). I am able to download it manually without a problem. I have tried urllib and urllib2 and different configurations of each, but each does the same thing. I should add that the urlVar and fileName variables are generated in a different part of the code, but I have given an example of each here to demonstrate.
import urllib2
urlVar = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z"
fileName = txga1000.14d.Z
downFile = urllib2.urlopen(urlVar)
with open(fileName, "wb") as f:
f.write(downFile.read())
At least the urllib2documentation suggest you should use the Requestobject. This works with me:
import urllib2
req = urllib2.Request("ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z")
response = urllib2.urlopen(req)
data = response.read()
Data length seems to be 740725.
I was able to download what seems like the correct size for your file with the following python2 code:
import urllib2
filename = "txga1000.14d.Z"
url = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/{}".format(filename)
reply = urllib2.urlopen(url)
buf = reply.read()
with open(filename, "wb") as fh:
fh.write(buf)
Edit: The post above me was answered faster and is much better.. I thought I'd post since I tested and wrote this out anyways.

Categories

Resources