I am trying to parse a "pseudo-CSV" file with the python CSV reader, and am having some doubts about how to add some extra logic. The reason I call it a "pseudo-CSV" file is because some of the lines in the input file will have text (30-40 chars) before the actual CSV data starts. I am trying to figure out the best way to remove this text.
Currently, I have found 3 options for removing said text:
From Python, call grep and sed and pipe the output to a temp file which can then be fed to the csv reader
(Ugh, I would like to avoid this option)
Create a CSV dialect to remove the unwanted text
(This option just feels wrong)
Extend the File object, implementing the next() function to remove the unwanted text as necessary.
I have no control over how the input file is generated, so its not an option to modify the generation.
Here is the related code I had when I realized the problem with the input file.
with open('myFile', 'r') as csvfile:
theReader = csv.reader(csvfile)
for row in theReader:
# my logic here
If I go with option 3 above, the solution is quite straight-forward, but
then I wont be able to incorporate the with open() syntax.
So, here is my question (2 actually): Is option 3 the best way to solve this
problem? If so, how can I incorporate it with the with open() syntax?
Edit: Forgot to mention that Im using Python 2.7 on Linux.
csv.reader accepts an arbitrary iterable besides files:
with open('myFile', 'rb') as csvfile:
reader = csv.reader(filter_line(line) for line in csvfile)
for row in reader:
# my logic here
You can just use contextlib and create your own context manager.
from contextlib import contextmanager
#contextmanager
def csv_factory(filename, mode="r"):
# setup here
fileobj = open(filename, mode)
reader = mycsv.reader(fileobj)
try:
yield reader # return value for usage in with
finally:
fileobj.close() # clean up here
with csv_factory("myFile") as csvfile:
for line in csvfile:
print(line)
Related
import csv
with open('thefile.csv', 'rb') as f:
data = list(csv.reader(f))
import collections
counter = collections.defaultdict(int)
for row in data:
counter[row[10]] += 1
with open('/pythonwork/thefile_subset11.csv', 'w') as outfile:
writer = csv.writer(outfile)
for row in data:
if counter[row[10]] >= 504:
writer.writerow(row)
This code reads thefile.csv, makes changes, and writes results to thefile_subset1.
However, when I open the resulting csv in Microsoft Excel, there is an extra blank line after each record!
Is there a way to make it not put an extra blank line?
The csv.writer module directly controls line endings and writes \r\n into the file directly. In Python 3 the file must be opened in untranslated text mode with the parameters 'w', newline='' (empty string) or it will write \r\r\n on Windows, where the default text mode will translate each \n into \r\n.
#!python3
with open('/pythonwork/thefile_subset11.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
In Python 2, use binary mode to open outfile with mode 'wb' instead of 'w' to prevent Windows newline translation. Python 2 also has problems with Unicode and requires other workarounds to write non-ASCII text. See the Python 2 link below and the UnicodeReader and UnicodeWriter examples at the end of the page if you have to deal with writing Unicode strings to CSVs on Python 2, or look into the 3rd party unicodecsv module:
#!python2
with open('/pythonwork/thefile_subset11.csv', 'wb') as outfile:
writer = csv.writer(outfile)
Documentation Links
https://docs.python.org/3/library/csv.html#csv.writer
https://docs.python.org/2/library/csv.html#csv.writer
Opening the file in binary mode "wb" will not work in Python 3+. Or rather, you'd have to convert your data to binary before writing it. That's just a hassle.
Instead, you should keep it in text mode, but override the newline as empty. Like so:
with open('/pythonwork/thefile_subset11.csv', 'w', newline='') as outfile:
Note: It seems this is not the preferred solution because of how the extra line was being added on a Windows system. As stated in the python document:
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
Windows is one such platform where that makes a difference. While changing the line terminator as I described below may have fixed the problem, the problem could be avoided altogether by opening the file in binary mode. One might say this solution is more "elegent". "Fiddling" with the line terminator would have likely resulted in unportable code between systems in this case, where opening a file in binary mode on a unix system results in no effect. ie. it results in cross system compatible code.
From Python Docs:
On Windows, 'b' appended to the mode
opens the file in binary mode, so
there are also modes like 'rb', 'wb',
and 'r+b'. Python on Windows makes a
distinction between text and binary
files; the end-of-line characters in
text files are automatically altered
slightly when data is read or written.
This behind-the-scenes modification to
file data is fine for ASCII text
files, but it’ll corrupt binary data
like that in JPEG or EXE files. Be
very careful to use binary mode when
reading and writing such files. On
Unix, it doesn’t hurt to append a 'b'
to the mode, so you can use it
platform-independently for all binary
files.
Original:
As part of optional paramaters for the csv.writer if you are getting extra blank lines you may have to change the lineterminator (info here). Example below adapated from the python page csv docs. Change it from '\n' to whatever it should be. As this is just a stab in the dark at the problem this may or may not work, but it's my best guess.
>>> import csv
>>> spamWriter = csv.writer(open('eggs.csv', 'w'), lineterminator='\n')
>>> spamWriter.writerow(['Spam'] * 5 + ['Baked Beans'])
>>> spamWriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])
The simple answer is that csv files should always be opened in binary mode whether for input or output, as otherwise on Windows there are problems with the line ending. Specifically on output the csv module will write \r\n (the standard CSV row terminator) and then (in text mode) the runtime will replace the \n by \r\n (the Windows standard line terminator) giving a result of \r\r\n.
Fiddling with the lineterminator is NOT the solution.
A lot of the other answers have become out of date in the ten years since the original question. For Python3, the answer is right in the documentation:
If csvfile is a file object, it should be opened with newline=''
The footnote explains in more detail:
If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.
Use the method defined below to write data to the CSV file.
open('outputFile.csv', 'a',newline='')
Just add an additional newline='' parameter inside the open method :
def writePhoneSpecsToCSV():
rowData=["field1", "field2"]
with open('outputFile.csv', 'a',newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(rowData)
This will write CSV rows without creating additional rows!
I'm writing this answer w.r.t. to python 3, as I've initially got the same problem.
I was supposed to get data from arduino using PySerial, and write them in a .csv file. Each reading in my case ended with '\r\n', so newline was always separating each line.
In my case, newline='' option didn't work. Because it showed some error like :
with open('op.csv', 'a',newline=' ') as csv_file:
ValueError: illegal newline value: ''
So it seemed that they don't accept omission of newline here.
Seeing one of the answers here only, I mentioned line terminator in the writer object, like,
writer = csv.writer(csv_file, delimiter=' ',lineterminator='\r')
and that worked for me for skipping the extra newlines.
with open(destPath+'\\'+csvXML, 'a+') as csvFile:
writer = csv.writer(csvFile, delimiter=';', lineterminator='\r')
writer.writerows(xmlList)
The "lineterminator='\r'" permit to pass to next row, without empty row between two.
Borrowing from this answer, it seems like the cleanest solution is to use io.TextIOWrapper. I managed to solve this problem for myself as follows:
from io import TextIOWrapper
...
with open(filename, 'wb') as csvfile, TextIOWrapper(csvfile, encoding='utf-8', newline='') as wrapper:
csvwriter = csv.writer(wrapper)
for data_row in data:
csvwriter.writerow(data_row)
The above answer is not compatible with Python 2. To have compatibility, I suppose one would simply need to wrap all the writing logic in an if block:
if sys.version_info < (3,):
# Python 2 way of handling CSVs
else:
# The above logic
I used writerow
def write_csv(writer, var1, var2, var3, var4):
"""
write four variables into a csv file
"""
writer.writerow([var1, var2, var3, var4])
numbers=set([1,2,3,4,5,6,7,2,4,6,8,10,12,14,16])
rules = list(permutations(numbers, 4))
#print(rules)
selection=[]
with open("count.csv", 'w',newline='') as csvfile:
writer = csv.writer(csvfile)
for rule in rules:
number1,number2,number3,number4=rule
if ((number1+number2+number3+number4)%5==0):
#print(rule)
selection.append(rule)
write_csv(writer,number1,number2,number3,number4)
When using Python 3 the empty lines can be avoid by using the codecs module. As stated in the documentation, files are opened in binary mode so no change of the newline kwarg is necessary. I was running into the same issue recently and that worked for me:
with codecs.open( csv_file, mode='w', encoding='utf-8') as out_csv:
csv_out_file = csv.DictWriter(out_csv)
I have a large CSV file (~250000 rows) and before I work on fully parsing and sorting it I was trying to display only a part of it by writing it to a text file.
csvfile = open(file_path, "rb")
rows = csvfile.readlines()
text_file = open("output.txt", "w")
row_num = 0
while row_num < 20:
text_file.write(", ".join(row[row_num]))
row_num += 1
text_file.close()
I want to iterate through the CSV file and write only a small section of it to a text file so I can look at how it does this and see if it would be of any use to me. Currently the text file ends up empty.
A way I thought might do this would be to iterate through the file with a for loop that exits after a certain number of iteration but I could be wrong and I'm not sure how to do this, any ideas?
There's nothing specifically wrong with what you're doing, but it's not particularly Pythonic. In particular reading the whole file into memory with readlines() at the start seems pointless if you're only using 20 lines.
Instead you could use a for loop with enumerate and break when necessary.
csvfile = open(file_path, "rb")
text_file = open("output.txt", "w")
for i, row in enumerate(csvfile):
text_file.write(row)
if row_num >= 20:
break
text_file.close()
You could further improve this by using with blocks to open the files, rather than closing them explicitly. For example:
with open(file_path, "rb") as csvfile:
#your code here involving csvfile
#now the csvfile is closed!
Also note that Python might not be the best tool for this - you could do it directly from Bash, for example, with just head -n20 csvfile.csv > output.txt.
A simple solution would be to just do :
#!/usr/bin/python
# -*- encoding: utf-8 -*-
file_path = './test.csv'
with open(file_path, 'rb') as csvfile:
with open('output.txt', 'wb') as textfile:
for i, row in enumerate(csvfile):
textfile.write(row)
if i >= 20:
break
Explanation :
with open(file_path, 'rb') as csvfile:
with open('output.txt', 'wb') as textfile:
Instead of using open and close, it is recommended to use this line instead. Just write the lines that you want to execute when your file is opened into a new level of indentation.
'rb' and 'wb' are the keywords you need to open a file in respectively 'reading' and 'writing' in 'binary mode'
for i, row in enumerate(csvfile):
This line allows you to read line by line your CSV file, and using a tuple (i, row) gives you both the content of the row and its index. That's one of the awesome built-in functions from Python : check out here for more about it.
Hope this helps !
EDIT : Note that Python has a CSV package that can do that without enumerate :
# -*- encoding: utf-8 -*-
import csv
file_path = './test.csv'
with open(file_path, 'rb') as csvfile:
reader = csv.reader(csvfile)
with open('output.txt', 'wb') as textfile:
writer = csv.writer(textfile)
i = 0
while i<20:
row = next(reader)
writer.writerow(row)
i += 1
All we need to use is its reader and writer. They have functions next (that reads one line) and writerow (that writes one). Note that here, the variable row is not a string, but a list of strings, because the function does the split job by itself. It might be faster than the previous solution.
Also, this has the major advantage of allowing you to look anywhere you want in the file, no necessarily from the beginning (just change the bounds for i)
I am generating and parsing CSV files and I'm noticing something odd.
When the CSV gets generated, there is always an empty line at the end, which is causing issues when subsequently parsing them.
My code to generate is as follows:
with open(file, 'wb') as fp:
a = csv.writer(fp, delimiter=",", quoting=csv.QUOTE_NONE, quotechar='')
a.writerow(["Id", "Builing", "Age", "Gender"])
results = get_results()
for val in results:
if any(val):
a.writerow(val)
It doesn't show up via the command line, but I do see it in my IDE/text editor
Does anyone know why it is doing this?
Could it be possible whitespace?
Is the problem the line terminator? It could be as simple as changing one line:
a = csv.writer(fp, delimiter=",", quoting=csv.QUOTE_NONE, quotechar='', lineterminator='\n')
I suspect this is it since I know that csv.writer defaults to using carriage return + line feed ("\r\n") as the line terminator. The program you are using to read the file might be expecting just a line feed ("\n"). This is common in switching file back and forth between *nix and Windows.
If this doesn't work, then the program you are using to read the file seems to be expecting no line terminator for the last row, I'm not sure the csv module supports that. For that, you could write the csv to a StringIO, "strip()" it and then write that your file.
Also since you are not quoting anyting, is there a reason to use csv at all? Why not:
with open(file, 'wb') as fp:
fp.write("\n".join( [ ",".join([ field for field in record ]) for record in get_results()]))
What is the 'Python way' regarding working with a CSV file? If I want to run some methods on the data in a particular column, should copy the whole think into an array, or should I pass the open file into a series of methods?
I tried to return the open file and got this error:
ValueError: I/O operation on closed file
here's the code:
import sys
import os
import csv
def main():
pass
def openCSVFile(CSVFile, openMode):
with open(CSVFile, openMode) as csvfile:
zipreader = csv.reader(csvfile, delimiter=',')
return zipreader
if __name__ == '__main__':
zipfile = openCSVFile('propertyOutput.csv','rb')
numRows = sum(1 for row in zipfile)
print"Rows equals %d." % numRows
Well there are many ways you could go about manipulating csv files. It depends
largely on how big your data is and how often you will perform these operations.
I will build on the already good answers and comments to present a somewhat more
complex handling, that wouldn't be far off from a real world example.
First of all, I prefer csv.DictReader because most csv files have a header
row with the column names. csv.DictReader takes advantage of that and gives
you the opportunity to grab it's cell value by its name.
Also, most of the times you need to perform various validation and normalization
operations on said data, so we're going to associate some functions with specific
columns.
Suppose we have a csv with information about products.
e.g.
Product Name,Release Date,Price
foo product,2012/03/23,99.9
awesome product,2013/10/14,40.5
.... and so on ........
Let's write a program to parse it and normalize the values
into appropriate native python objects.
import csv
import datetime
from decimal import Decimal
def stripper(value):
# Strip any whitespace from the left and right
return value.strip()
def to_decimal(value):
return Decimal(value)
def to_date(value):
# We expect dates like: "2013/05/23"
datetime.datetime.strptime(value, '%Y/%m/%d').date()
OPERATIONS = {
'Product Name': [stripper],
'Release Date': [stripper, to_date],
'Price': [stripper, to_decimal]
}
def parse_csv(filepath):
with open(filepath, 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
for column in row:
operations = OPERATIONS[column]
value = row[column]
for op in operations:
value = op(value)
# Print the cleaned value or store it somewhere
print value
Things to note:
1) We operate on the csv in a line by line basis. DictReader yields lines
one at a time and that means we can handle arbitrary sizes of csv files,
since we are not going to load the whole file into memory.
2) You can go crazy with normalizing the values of a csv, by building special
classes with magic methods or whatnot. As I said, it depends on the complexity
of your files, the quality of the data and the operations you need to perform
on them.
Have fun.
csv module provides one row at a time, understanding its content by spliting it as a list object (or dict in case of DictReader).
As Python knows how to loop on such an object, if you're just interested in some specific fields, building a list with these fields seems 'Pythonic' to me. Using an iterator is also valid if each item shall be considered separatly from the others.
You probably need to read PEP 343: The 'with' statement
Relevant quote:
Some standard Python objects now support the context management protocol and can be used with the 'with' statement. File objects are one example:
with open('/etc/passwd', 'r') as f:
for line in f:
print line
... more processing code ...
After this statement has executed, the file object in f will have been automatically closed, even if the 'for' loop raised an exception part-way through the block.
So your csvfile is closed outside with statement, and outside openCSVFile function. You need to not to use with statement,
def openCSVFile(CSVFile, openMode):
csvfile = open(CSVFile, openMode)
return csv.reader(csvfile, delimiter=',')
or move it to __main__:
def get_csv_reader(filelike):
return csv.reader(csvfile, delimiter=',')
if __name__ == '__main__':
with open('propertyOutput.csv', 'rb') as csvfile:
zipfile = get_csv_reader(csvfile)
numRows = sum(1 for row in zipfile)
print"Rows equals %d." % numRows
Firstly, the reason you're getting ValueError: I/O operation on closed file is that in the following, the with acting as a context manager is operating on an opened file which is the underlying fileobj that zipreader is then set to work on. What happens, is that as soon as the with block is exited, the file that was opened is then closed, which leaves the file unusable for zipreader to read from...
with open(CSVFile, openMode) as csvfile:
zipreader = csv.reader(csvfile, delimiter=',')
return zipreader
Generally, acquire the resource and then pass it a function if needed. So, in your main program open the file and create the csv.reader and then pass that to something and have it closed in the main program when it makes more sense that "you're done with it now".
I have a text document that I would like to repeatedly remove the first line of text from every 30 seconds or so.
I have already written (or more accurately copied) the code for the python resettable timer object that allows a function to be called every 30 seconds in a non blocking way if not asked to reset or cancel.
Resettable timer in python repeats until cancelled
(If someone could check the way I implemented the repeat in that is ok, because my python sometimes crashes while running that, would be appreciated :))
I now want to write my function to load a text file and perhaps copy all but the first line and then rewrite it to the same text file. I can do this, this way I think... but is it the most efficient ?
def removeLine():
with open(path, 'rU') as file:
lines = deque(file)
try:
print lines.popleft()
except IndexError:
print "Nothing to pop?"
with open(path, 'w') as file:
file.writelines(lines)
This works, but is it the best way to do it ?
I'd use the fileinput module with inplace=True:
import fileinput
def removeLine():
inputfile = fileinput.input(path, inplace=True, mode='rU')
next(inputfile, None) # skip a line *if present*
for line in inputfile:
print line, # write out again, but without an extra newline
inputfile.close()
inplace=True causes sys.stdout to be redirected to the open file, so we can simply 'print' the lines.
The next() call is used to skip the first line; giving it a default None suppresses the StopIteration exception for an empty file.
This makes rewriting a large file more efficient as you only need to keep the fileinput readlines buffer in memory.
I don't think a deque is needed at all, even for your solution; just use next() there too, then use list() to catch the remaining lines:
def removeLine():
with open(path, 'rU') as file:
next(file, None) # skip a line *if present*
lines = list(file)
with open(path, 'w') as file:
file.writelines(lines)
but this requires you to read all of the file in memory; don't do that with large files.