Here is my scenario: I have a zip file that I am downloading with requests into memory rather than writing a file. I am unzipping the data into an object called myzipfile. Inside the zip file is a csv file. I would like to convert each row of the csv data into a dictionary. Here is what I have so far.
import csv
from io import BytesIO
import requests
# other imports etc.
r = requests.get(url=fileurl, headers=headers, stream=True)
filebytes = BytesIO(r.content)
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
mycsv = myzipfile.open(name).read()
for row in csv.DictReader(mycsv): # it fails here.
print(row)
errors:
Traceback (most recent call last):
File "/usr/lib64/python3.7/csv.py", line 98, in fieldnames
self._fieldnames = next(self.reader)
_csv.Error: iterator should return strings, not int (did you open the file in text mode?)
Looks like csv.DictReader(mycsv) expects a file object instead of raw data. How do I convert the rows in the mycsv object data (<class 'bytes'>) to a list of dictionaries? I'm trying to accomplish this without writing a file to disk and working directly from csv objects in memory.
The DictReader expects a file or file-like object: we can satisfy this expectation by loading the zipped file into an io.StringIO instance.
Note that StringIO expects its argument to be a str, but reading a file from the zipfile returns bytes, so the data must be decoded. This example assumes that the csv was originally encoded with the local system's default encoding. If that is not the case the correct encoding must be passed to decode().
for name in myzipfile.namelist():
data = myzipfile.open(name).read().decode()
mycsv = io.StringIO(data)
reader = csv.DictReader(mycsv)
for row in reader:
print(row)
dict_list = [] # a list
reader = csv.DictReader(open('yourfile.csv', 'rb'))
for line in reader: # since we used DictReader, each line will be saved as a dictionary
dict_list.append(line)
Related
I'm working on a new library which will allow the user to parse any file (xlsx, csv, json, tar, zip, txt) into generators.
Now I'm stuck at zip archive and when I try to parse a csv from it, I get
io.UnsupportedOperation: seek immediately after elem.seek(0). The csv file is a simple one 4x4 rows and columns. If I parse the csv using the csv_parser I get what I want, but trying to parse it from a zip archive... boom. Error!
with open("/Users/ro/Downloads/archive_file/csv.zip", 'r') as my_file_file:
asd = parse_zip(my_file_file)
print asd
Where parse_zip is:
def parse_zip(element):
"""Function for manipulating zip files"""
try:
my_zip = zipfile.ZipFile(element, 'r')
except zipfile.BadZipfile:
raise err.NestedArchives(element)
else:
my_file = my_zip.open('corect_csv.csv')
# print my_file
my_mime = csv_tsv_parser.parse_csv_tsv(my_file)
print list(my_mime)
And parse_cvs_tsv is:
def _csv_tsv_parser(element):
"""Helper function for csv and tsv files that return an generator"""
for row in element:
if any(s for s in row):
yield row
def parse_csv_tsv(elem):
"""Function for manipulating all the csv files"""
dialect = csv.Sniffer().sniff(elem.readline())
elem.seek(0)
data_file = csv.reader(elem, dialect)
read_data = _csv_tsv_parser(data_file)
yield '', read_data
Where am I wrong? Is the way I'm opening the file OK or...?
Zipfile.open returns a file-like ZipExtFile object that inherits from io.BufferedIOBase. io.BufferedIOBase does not support seek (only text streams in the io module support seek), hence the exception.
However, ZipExtFile does provide a peek method, which will return a number of bytes without moving the file pointer. So changing
dialect = csv.Sniffer().sniff(elem.readline())
elem.seek(0)
to
num_bytes = 128 # number of bytes to read
dialect = csv.Sniffer().sniff(elem.peek(n=num_bytes))
solves the problem.
Sample fileobject data contains the following,
b'QmFyY29kZSxRdHkKQTIzMjMsMTAKQTIzMjQsMTUKNjUxMDA1OTUzMjkyNSwxMgpBMjMyNCwxCkEyMzI0LDEKQTIzMjMsMTAK'
And python file contains the following code
string_data = BytesIO(base64.decodestring(csv_rec))
read_file = csv.reader(string_data, quotechar='"', delimiter=',')
next(read_file)
when i run the above code in python, i got the following exception
_csv.Error: iterator should return strings, not int (did you open the file in text mode?)
How can i open a bytes data in text mode ?
You are almost there. Indeed, csv.reader expects iterator which returns strings (not bytes). Such iterator is provided by sibling of BytesIO - io.StringIO.
from io import StringIO
csv_rec = b'QmFyY29kZSxRdHkKQTIzMjMsMTAKQTIzMjQsMTUKNjUxMDA1OTUzMjkyNSwxMgpBMjMyNCwxCkEyMzI0LDEKQTIzMjMsMTAK'
bytes_data = base64.decodestring(csv_rec)
# decode() method is used to decode bytes to string
string_data = StringIO(bytes_data.decode())
read_file = csv.reader(string_data, quotechar='"', delimiter=',')
next(read_file)
I've been struggling with this simple problem for too long, so I thought I'd ask for help. I am trying to read a list of journal articles from National Library of Medicine ftp site into Python 3.3.2 (on Windows 7). The journal articles are in a .csv file.
I have tried the following code:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream)
data = [row for row in csvfile]
It results in the following error:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
data = [row for row in csvfile]
File "<pyshell#4>", line 1, in <listcomp>
data = [row for row in csvfile]
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
I presume I should be working with strings not bytes? Any help with the simple problem, and an explanation as to what is going wrong would be greatly appreciated.
The problem relies on urllib returning bytes. As a proof, you can try to download the csv file with your browser and opening it as a regular file and the problem is gone.
A similar problem was addressed here.
It can be solved decoding bytes to strings with the appropriate encoding. For example:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream.read().decode('utf-8')) # with the appropriate encoding
data = [row for row in csvfile]
The last line could also be: data = list(csvfile) which can be easier to read.
By the way, since the csv file is very big, it can slow and memory-consuming. Maybe it would be preferable to use a generator.
EDIT:
Using codecs as proposed by Steven Rumbalski so it's not necessary to read the whole file to decode. Memory consumption reduced and speed increased.
import csv
import urllib.request
import codecs
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
for line in csvfile:
print(line) # do something with line
Note that the list is not created either for the same reason.
Even though there is already an accepted answer, I thought I'd add to the body of knowledge by showing how I achieved something similar using the requests package (which is sometimes seen as an alternative to urlib.request).
The basis of using codecs.itercode() to solve the original problem is still the same as in the accepted answer.
import codecs
from contextlib import closing
import csv
import requests
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(codecs.iterdecode(r.iter_lines(), 'utf-8'))
for row in reader:
print row
Here we also see the use of streaming provided through the requests package in order to avoid having to load the entire file over the network into memory first (which could take long if the file is large).
I thought it might be useful since it helped me, as I was using requests rather than urllib.request in Python 3.6.
Some of the ideas (e.g using closing()) are picked from this similar post
I had a similar problem using requests package and csv.
The response from post request was type bytes.
In order to user csv library, first I a stored them as a string file in memory (in my case the size was small), decoded utf-8.
import io
import csv
import requests
response = requests.post(url, data)
# response.content is something like:
# b'"City","Awb","Total"\r\n"Bucuresti","6733338850003","32.57"\r\n'
csv_bytes = response.content
# write in-memory string file from bytes, decoded (utf-8)
str_file = io.StringIO(csv_bytes.decode('utf-8'), newline='\n')
reader = csv.reader(str_file)
for row_list in reader:
print(row_list)
# Once the file is closed,
# any operation on the file (e.g. reading or writing) will raise a ValueError
str_file.close()
Printed something like:
['City', 'Awb', 'Total']
['Bucuresti', '6733338850003', '32.57']
urlopen will return a urllib.response.addinfourl instance for an ftp request.
For ftp, file, and data urls and requests explicity handled by legacy
URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager...
>>> urllib2.urlopen(url)
<addinfourl at 48868168L whose fp = <addclosehook at 48777416L whose fp = <socket._fileobject object at 0x0000000002E52B88>>>
At this point ftpstream is a file like object, using .read() would return the contents however csv.reader requires an iterable in this case:
Defining a generator like so:
def to_lines(f):
line = f.readline()
while line:
yield line
line = f.readline()
We can create our csv reader like so:
reader = csv.reader(to_lines(ftps))
And with a url
url = "http://pic.dhe.ibm.com/infocenter/tivihelp/v41r1/topic/com.ibm.ismsaas.doc/reference/CIsImportMinimumSample.csv"
The code:
for row in reader: print row
Prints
>>>
['simpleci']
['SCI.APPSERVER']
['SRM_SaaS_ES', 'MXCIImport', 'AddChange', 'EN']
['CI_CINUM']
['unique_identifier1']
['unique_identifier2']
I'm trying to loop through rows in a csv file. I get csv file as string from a web location. I know how to create csv.reader using with when data is stored in a file. What I don't know is, how to get rows using csv.reader without storing string to a file. I'm using Python 2.7.12.
I've tried to create StringIO object like this:
from StringIO import StringIO
csv_data = "some_string\nfor_example"
with StringIO(csv_data) as input_file:
csv_reader = reader(csv_data, delimiter=",", quotechar='"')
However, I'm getting this error:
Traceback (most recent call last):
File "scraper.py", line 228, in <module>
with StringIO(csv_data) as input_file:
AttributeError: StringIO instance has no attribute '__exit__'
I understand that StringIO class doesn't have __exit__ method which is called when when finishes doing whatever it does with this object.
My answer is how to do this correctly? I suppose I can alter StringIO class by subclassing it and adding __exit__ method, but I suspect that there is easier solution.
Update:
Also, I've tried different combinations that came to my mind:
with open(StringIO(csv_data)) as input_file:
with csv_data as input_file:
but, of course, none of those worked.
>>> import csv
>>> csv_data = "some,string\nfor,example"
>>> result = csv.reader(csv_data.splitlines())
>>> list(result)
[['some', 'string'], ['for', 'example']]
You should use the io module instead of the StringIO one, because io.BytesIO for byte string or io.StringIO for Unicode ones both support the context manager interface and can be used in with statements:
from io import BytesIO
from csv import reader
csv_data = "some_string\nfor_example"
with BytesIO(csv_data) as input_file:
csv_reader = reader(input_file, delimiter=",", quotechar='"')
for row in csv_reader:
print row
If you like context managers, you can use tempfile instead:
import tempfile
with tempfile.NamedTemporaryFile(mode='w') as t:
t.write('csv_data')
t.seek(0)
csv_reader = reader(open(t.name), delimiter=",", quotechar='"')
As an advantage to pass string splitlines directly to csv reader you can write file of any size and then safely read it in csv reader without memory issues.
This file will be closed and deleted automatically
I am trying to read CSV files from a directory that is not in the same directory as my Python script.
Additionally the CSV files are stored in ZIP folders that have the exact same names (the only difference being one ends with .zip and the other is a .csv).
Currently I am using Python's zipfile and csv libraries to open and get the data from the files, however I am getting the error:
Traceback (most recent call last): File "write_pricing_data.py", line 13, in <module>
for row in reader:
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
My code:
import os, csv
from zipfile import *
folder = r'D:/MarketData/forex'
localFiles = os.listdir(folder)
for file in localFiles:
zipArchive = ZipFile(folder + '/' + file)
with zipArchive.open(file[:-4] + '.csv') as csvFile:
reader = csv.reader(csvFile, delimiter=',')
for row in reader:
print(row[0])
How can I resolve this error?
It's a bit of a kludge and I'm sure there's a better way (that just happens to elude me right now). If you don't have embedded new lines, then you can use:
import zipfile, csv
zf = zipfile.ZipFile('testing.csv.zip')
with zf.open('testing.csv', 'r') as fin:
# Create a generator of decoded lines for input to csv.reader
# (the csv module is only really happy with ASCII input anyway...)
lines = (line.decode('ascii') for line in fin)
for row in csv.reader(lines):
print(row)