What is the 'Python way' regarding working with a CSV file? If I want to run some methods on the data in a particular column, should copy the whole think into an array, or should I pass the open file into a series of methods?
I tried to return the open file and got this error:
ValueError: I/O operation on closed file
here's the code:
import sys
import os
import csv
def main():
pass
def openCSVFile(CSVFile, openMode):
with open(CSVFile, openMode) as csvfile:
zipreader = csv.reader(csvfile, delimiter=',')
return zipreader
if __name__ == '__main__':
zipfile = openCSVFile('propertyOutput.csv','rb')
numRows = sum(1 for row in zipfile)
print"Rows equals %d." % numRows
Well there are many ways you could go about manipulating csv files. It depends
largely on how big your data is and how often you will perform these operations.
I will build on the already good answers and comments to present a somewhat more
complex handling, that wouldn't be far off from a real world example.
First of all, I prefer csv.DictReader because most csv files have a header
row with the column names. csv.DictReader takes advantage of that and gives
you the opportunity to grab it's cell value by its name.
Also, most of the times you need to perform various validation and normalization
operations on said data, so we're going to associate some functions with specific
columns.
Suppose we have a csv with information about products.
e.g.
Product Name,Release Date,Price
foo product,2012/03/23,99.9
awesome product,2013/10/14,40.5
.... and so on ........
Let's write a program to parse it and normalize the values
into appropriate native python objects.
import csv
import datetime
from decimal import Decimal
def stripper(value):
# Strip any whitespace from the left and right
return value.strip()
def to_decimal(value):
return Decimal(value)
def to_date(value):
# We expect dates like: "2013/05/23"
datetime.datetime.strptime(value, '%Y/%m/%d').date()
OPERATIONS = {
'Product Name': [stripper],
'Release Date': [stripper, to_date],
'Price': [stripper, to_decimal]
}
def parse_csv(filepath):
with open(filepath, 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
for column in row:
operations = OPERATIONS[column]
value = row[column]
for op in operations:
value = op(value)
# Print the cleaned value or store it somewhere
print value
Things to note:
1) We operate on the csv in a line by line basis. DictReader yields lines
one at a time and that means we can handle arbitrary sizes of csv files,
since we are not going to load the whole file into memory.
2) You can go crazy with normalizing the values of a csv, by building special
classes with magic methods or whatnot. As I said, it depends on the complexity
of your files, the quality of the data and the operations you need to perform
on them.
Have fun.
csv module provides one row at a time, understanding its content by spliting it as a list object (or dict in case of DictReader).
As Python knows how to loop on such an object, if you're just interested in some specific fields, building a list with these fields seems 'Pythonic' to me. Using an iterator is also valid if each item shall be considered separatly from the others.
You probably need to read PEP 343: The 'with' statement
Relevant quote:
Some standard Python objects now support the context management protocol and can be used with the 'with' statement. File objects are one example:
with open('/etc/passwd', 'r') as f:
for line in f:
print line
... more processing code ...
After this statement has executed, the file object in f will have been automatically closed, even if the 'for' loop raised an exception part-way through the block.
So your csvfile is closed outside with statement, and outside openCSVFile function. You need to not to use with statement,
def openCSVFile(CSVFile, openMode):
csvfile = open(CSVFile, openMode)
return csv.reader(csvfile, delimiter=',')
or move it to __main__:
def get_csv_reader(filelike):
return csv.reader(csvfile, delimiter=',')
if __name__ == '__main__':
with open('propertyOutput.csv', 'rb') as csvfile:
zipfile = get_csv_reader(csvfile)
numRows = sum(1 for row in zipfile)
print"Rows equals %d." % numRows
Firstly, the reason you're getting ValueError: I/O operation on closed file is that in the following, the with acting as a context manager is operating on an opened file which is the underlying fileobj that zipreader is then set to work on. What happens, is that as soon as the with block is exited, the file that was opened is then closed, which leaves the file unusable for zipreader to read from...
with open(CSVFile, openMode) as csvfile:
zipreader = csv.reader(csvfile, delimiter=',')
return zipreader
Generally, acquire the resource and then pass it a function if needed. So, in your main program open the file and create the csv.reader and then pass that to something and have it closed in the main program when it makes more sense that "you're done with it now".
Related
I have the following code which I am using to create and add a new row to a csv file.
def calcPrice(data):
fieldnames = ["ReferenceID","clientName","Date","From","To","Rate","Price"]
with open('rec2.csv', 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerow(data)
return
However, it as the header as a new row as well. How can I prevent this?
Here's a link to the gist with the whole code: https://gist.github.com/chriskinyua/5ff8a527b31451ddc7d7cf157c719bba
You could check if the file already exists
import os
def calcPrice(data):
filename = 'rec2.csv'
write_header = not os.path.exists(filename)
fieldnames = ["ReferenceID","clientName","Date","From","To","Rate","Price"]
with open(filename, 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
if write_header:
writer.writeheader()
writer.writerow(data)
Let's assume there's a function we can call that will tell us whether we should write out the header or not, so the code would look like this:
import csv
def calcPrice(data):
fieldnames = ["ReferenceID","clientName","Date","From","To","Rate","Price"]
with open('rec2.csv', 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
if should_write_header(csvfile):
writer.writeheader()
writer.writerow(data)
What will should_write_header look like? Here are three possibilities. For all of them, we will need to import the io module:
import io
The logic of all these functions is the same: we want to work out if the end of the file is the same as the beginning of the file. If that is true, then we want to write the header row.
This function is the most verbose: it finds the current position using the file's tell method, moves to the beginning of the file using its seek method, then runs tell again to see if the reported positions are the same. If they are not it seeks back to the end of the file before returning the result. We don't simply compare the value of EOF to zero because the Python docs state that the result of tell for text files does not necessarily correspond to the actual position of the file pointer.
def should_write_header1(fileobj):
EOF = fileobj.tell()
fileobj.seek(0, io.SEEK_SET)
res = fileobj.tell() == EOF
if not res:
fileobj.seek(EOF, io.SEEK_SET)
return res
This version assumes that while the tell method does not necessarily correspond to the position of the file pointer in general, tell will always return zero for an empty file. This will probably work in common cases.
def should_write_header2(fileobj):
return fileobj.tell() == 0
This version accesses the tell method of the binary stream that TextIOWrapper (the text file object class) wraps. For binary streams, tell is documented to return the actual file pointer position. This is removes the uncertainty of should_write_header2, but unfortunately buffer is not guaranteed to exist in all Python implementations, so this isn't portable.
def should_write_header3(fileobj):
return fileobj.buffer.tell() == 0
So for 100% certainty, use should_write_header1. For less certainty but shorter code, use one of the others. If performance is a concern favour should_write_header3, because tell in binary streams is faster than tell in text streams.
I have a generator that yields rows from a CSV file one at a time, something like:
import csv
def as_csv(filename):
with open(filename) as fin:
yield from csv.reader(fin)
However, I need to also capture the raw string returned from the file, as this needs to be persisted at the same time.
As far as I can tell, the csv built-in can be used on an ad-hoc basis, something like this:
import csv
def as_csv_and_raw(filename):
with open(filename) as fin:
for row in fin:
raw = row.strip()
values = csv.reader([raw])[0]
yield (values, raw)
... but this has the overhead of creating a new reader and a new iterable for each row of the file, so on files with millions of rows I'm concerned about the performance impact.
It feels like I could create a coroutine that could interact with the primary function, yielding the parsed fields in a way where I can control the input directly without losing it, something like this:
import csv
def as_csv_and_raw(filename):
with open(filename) as fin:
reader = raw_to_csv(some_coroutine())
reader.next()
for row in fin:
raw = row.strip()
fields = reader.send(raw)
yield fields, raw
def raw_to_csv(data):
yield from csv.reader(data)
def some_coroutine():
# what goes here?
raise NotImplementedError
I haven't really wrapped my head around coroutines and using yield as an expression, so I'm not sure what goes in some_coroutine, but the intent is that each time I send a value in, that value is run through the csv.reader object and I get the set of fields back.
Can someone provide the implementation of some_coroutine, or alternately show me a better mechanism for getting the desired data?
You can use itertools.tee to create two independent iterators from the iterable file object, create a csv.reader from one of them, and then zip the other iterator with it for output:
from itertools import tee
def as_csv_and_raw(filename):
with open(filename) as fin:
row, raw = tee(fin)
yield from zip(csv.reader(row), raw)
My code currently writes a dictionary which contains scores for a class to a CSV file. This part is correctly done by the program and the scores are wrote to file, however the latest dictionary written to file is not printed. For example, after the code has been ran once, it will not be printed however once the code has been ran for a second time, the first bit of data is printed however the new data isn't. Can someone tell me where I am going wrong?
SortedScores = sorted(Class10x1.items(), key = lambda t: t[0], reverse = True) #this sorts the scores in alphabetical order and by the highest score
FileWriter = csv.writer(open('10x1 Class Score.csv', 'a+'))
FileWriter.writerow(SortedScores) #the sorted scores are written to file
print "Okay here are your scores!\n"
I am guessing the problem is here somewhere however I cannot quite pinpoint what or where it is. I have tried to solve this by changing the mode of the file when it is read back in to r, r+ and rb, however all have the same consequence.
ReadFile = csv.reader(open("10x1 Class Score.csv", "r")) #this opens the file using csv.reader in read mode
for row in ReadFile:
print row
return
from Input output- python docs:
It is good practice to use the with keyword when dealing with file objects. This has the advantage that the file is properly closed after its suite finishes, even if an exception is raised on the way. It is also much shorter than writing equivalent try-finally blocks:
>>> with open('workfile', 'r') as f:
... read_data = f.read()
>>> f.closed
True
File objects have some additional methods, such as isatty() and truncate() which are less frequently used; consult the Library Reference for a complete guide to file objects.
I'm not sure why they bury that so far in the documentation since it is really useful and a very common beginner mistake:
SortedScores = sorted(Class10x1.items(), key = lambda t: t[0], reverse = True) #this sorts the scores in alphabetical order and by the highest score
with open('10x1 Class Score.csv', 'a+') as file:
FileWriter = csv.writer(file)
FileWriter.writerow(SortedScores) #the sorted scores are written to file
print "Okay here are your scores!\n"
this will close the file for you even if an error is raised which prevents many possibilities of loss of data
the reason it did not appear to write to the file is because when you do .write_row() it doesn't immediately write to the hard drive, only to a buffer which is occasionally emptied into the file on hard drive, although with only one write statement it has no need to empty.
Remember to close the file after operation, otherwise the data will not be saved properly.
Try to use with keyword so that Python will handle the closure for you:
import csv
with open('10x1 Class Score.csv', 'a+') as f:
csv_writer = csv.writer(f)
# write something into the file
...
# when the above block is done, file will be automatically closed
# so that the file is saved properly
I am trying to parse a "pseudo-CSV" file with the python CSV reader, and am having some doubts about how to add some extra logic. The reason I call it a "pseudo-CSV" file is because some of the lines in the input file will have text (30-40 chars) before the actual CSV data starts. I am trying to figure out the best way to remove this text.
Currently, I have found 3 options for removing said text:
From Python, call grep and sed and pipe the output to a temp file which can then be fed to the csv reader
(Ugh, I would like to avoid this option)
Create a CSV dialect to remove the unwanted text
(This option just feels wrong)
Extend the File object, implementing the next() function to remove the unwanted text as necessary.
I have no control over how the input file is generated, so its not an option to modify the generation.
Here is the related code I had when I realized the problem with the input file.
with open('myFile', 'r') as csvfile:
theReader = csv.reader(csvfile)
for row in theReader:
# my logic here
If I go with option 3 above, the solution is quite straight-forward, but
then I wont be able to incorporate the with open() syntax.
So, here is my question (2 actually): Is option 3 the best way to solve this
problem? If so, how can I incorporate it with the with open() syntax?
Edit: Forgot to mention that Im using Python 2.7 on Linux.
csv.reader accepts an arbitrary iterable besides files:
with open('myFile', 'rb') as csvfile:
reader = csv.reader(filter_line(line) for line in csvfile)
for row in reader:
# my logic here
You can just use contextlib and create your own context manager.
from contextlib import contextmanager
#contextmanager
def csv_factory(filename, mode="r"):
# setup here
fileobj = open(filename, mode)
reader = mycsv.reader(fileobj)
try:
yield reader # return value for usage in with
finally:
fileobj.close() # clean up here
with csv_factory("myFile") as csvfile:
for line in csvfile:
print(line)
I'm using the CSV module to read a tab delimited file. Code below:
z = csv.reader(open('/home/rv/ncbi-blast-2.2.23+/db/output.blast'), delimiter='\t')
But when I add Z.close() to end of my script i get and error stating "csv.reader' object has no attribute 'close'"
z.close()
So how do i close "Z"?
The reader is really just a parser. When you ask it for a line of data, it delegates the reading action to the underlying file object and just converts the result into a set of fields. The reader itself doesn't manage any resources that would need to be cleaned up when you're done using it, so there's no need to close it; it'd be a meaningless operation.
You should make sure to close the underlying file object, though, because that does manage a resource (an open file descriptor) that needs to be cleaned up. Here's the way to do that:
with open('/home/rv/ncbi-blast-2.2.23+/db/output.blast') as f:
z = csv.reader(f, delimiter='\t')
# do whatever you need to with z
If you're not familiar with the with statement, it's roughly equivalent to enclosing its contents in a try...finally block that closes the file in the finally part.
Hopefully this doesn't matter (and if it does, you really need to update to a newer version of Python), but the with statement was introduced in Python 2.5, and in that version you would have needed a __future__ import to enable it. If you were working with an even older version of Python, you would have had to do the closing yourself using try...finally.
Thanks to Jared for pointing out compatibility issues with the with statement.
You do not close CSV readers directly; instead you should close whatever file-like object is being used. For example, in your case, you'd say:
f = open('/home/rv/ncbi-blast-2.2.23+/db/output.blast')
z = csv.reader(f, delimiter='\t')
...
f.close()
If you are using a recent version of Python, you can use the with statement, e.g.
with open('/home/rv/ncbi-blast-2.2.23+/db/output.blast') as f:
z = csv.reader(f, delimiter='\t')
...
This has the advantage that f will be closed even if you throw an exception or otherwise return inside the with-block, whereas such a case would lead to the file remaining open in the previous example. In other words, it's basically equivalent to a try/finally block, e.g.
f = open('/home/rv/ncbi-blast-2.2.23+/db/output.blast')
try:
z = csv.reader(f, delimiter='\t')
...
finally:
f.close()
You don't close the result of the reader() method, you close the result of the open() method. So, use two statements: foo=open(...); bar=csv.reader(foo). Then you can call foo.close().
There are no bonus points awarded for doing in one line that which can be more readable and functional in two.