Using Python's csv module, is it possible to read an entire, large, csv file into a lazy list of lists?
I am asking this, because in Clojure there are csv parsing modules that will parse a large file and return a lazy sequence (a sequence of sequences). I'm just wondering if that's possible in Python.
Unless I'm misunderstanding you, this is the default behavior, which is the very essence of reading through a csv file:
import csv
def lazy(csvfile):
with open(csvfile) as f:
r = csv.reader(f)
for row in r:
yield row
gives you back one row at a time.
The csv module's reader is lazy by default.
It will read a line in at a time from the file, parse it to a list, and return that list.
Python's reader or DictReader are generators. A row is produced only when the object's next() method is called.
The csv module does load the data lazily, one row at a time.
Related
I have a generator that yields rows from a CSV file one at a time, something like:
import csv
def as_csv(filename):
with open(filename) as fin:
yield from csv.reader(fin)
However, I need to also capture the raw string returned from the file, as this needs to be persisted at the same time.
As far as I can tell, the csv built-in can be used on an ad-hoc basis, something like this:
import csv
def as_csv_and_raw(filename):
with open(filename) as fin:
for row in fin:
raw = row.strip()
values = csv.reader([raw])[0]
yield (values, raw)
... but this has the overhead of creating a new reader and a new iterable for each row of the file, so on files with millions of rows I'm concerned about the performance impact.
It feels like I could create a coroutine that could interact with the primary function, yielding the parsed fields in a way where I can control the input directly without losing it, something like this:
import csv
def as_csv_and_raw(filename):
with open(filename) as fin:
reader = raw_to_csv(some_coroutine())
reader.next()
for row in fin:
raw = row.strip()
fields = reader.send(raw)
yield fields, raw
def raw_to_csv(data):
yield from csv.reader(data)
def some_coroutine():
# what goes here?
raise NotImplementedError
I haven't really wrapped my head around coroutines and using yield as an expression, so I'm not sure what goes in some_coroutine, but the intent is that each time I send a value in, that value is run through the csv.reader object and I get the set of fields back.
Can someone provide the implementation of some_coroutine, or alternately show me a better mechanism for getting the desired data?
You can use itertools.tee to create two independent iterators from the iterable file object, create a csv.reader from one of them, and then zip the other iterator with it for output:
from itertools import tee
def as_csv_and_raw(filename):
with open(filename) as fin:
row, raw = tee(fin)
yield from zip(csv.reader(row), raw)
I want to generate a log file in which I have to print two lists for about 50 input files. So, there are approximately 100 lists reported in the log file. I tried using pickle.dump, but it adds some strange characters in the beginning of each value. Also, it writes each value in a different line and the enclosing brackets are also not shown.
Here is a sample output from a test code.
import pickle
x=[1,2,3,4]
fp=open('log.csv','w')
pickle.dump(x,fp)
fp.close()
output:
I want my log file to report:
list 1 is: [1,2,3,4]
If you want your log file to be readable, you are approaching it the wrong way by using pickle which "implements binary protocols"--i.e. it is unreadable.
To get what you want, replace the line
pickle.dump(x,fp)
with
fp.write(' list 1 is: '
fp.write(str(x))
This requires minimal change in the rest of your code. However, good practice would change your code to a better style.
pickle is for storing objects in a form which you could use to recreate the original object. If all you want to do is create a log message, the builtin __str__ method is sufficient.
x = [1, 2, 3, 4]
with open('log.csv', 'w') as fp:
print('list 1 is: {}'.format(x), file=fp)
Python's pickle is used to serialize objects, which is basically a way that an object and its hierarchy can be stored on your computer for use later.
If your goal is to write data to a csv, then read the csv file and output what you read inside of it, then read below.
Writing To A CSV File see here for a great tutorial if you need more info
import csv
list = [1,2,3,4]
myFile = open('yourFile.csv', 'w')
writer = csv.writer(myFile)
writer.writerow(list)
the function writerow() will write each element of an iterable (each element of the list in your case) to a column. You can run through each one of your lists and write it to its own row in this way. If you want to write multiple rows at once, check out the method writerows()
Your file will be automatically saved when you write.
Reading A CSV File
import csv
with open('example.csv', newline='') as File:
reader = csv.reader(File)
for row in reader:
print(row)
This will run through all the rows in your csv file and will print it to the console.
I am reading an xlsx file (using openpyxl) and a csv (using csv.reader). The openpyxl returns a generator properly, I can iterate over the values in the generator after it is returned from a function that differentiates whether the file is an excel file or a csv. The problem arises when I am doing the same thing with a csv file, you see it returns a generator but I can't iterate over it since the csv file appears to be closed after the function return within the with statement. I know it's obvious that file closes after the with statement has fullfilled it's purpose, but why then does the openpyxl work? why can I still can iterate over the generator of an excel file? and, my ultimate question, how can I make the csv.reader behave the way openpyxl behaves here, i.e. me being able to iterate over the generator values.
import csv
from openpyxl import load_workbook
def iter_rows(worksheet):
"""
Iterate over Excel rows and return an iterator of the row value lists
"""
for row in worksheet.iter_rows():
yield [cell.value for cell in row]
def get_rows(filename, file_extension):
"""
Based on file extension, read the appropriate file format
"""
# read csv
if file_extension == 'csv':
with open(filename) as f:
return csv.reader(f, delimiter=",")
# read Excel files with openpyxl
if file_extension in ['xls', 'xlsx']:
wb2 = load_workbook(filename)
worksheet1 = wb2[wb2.get_sheet_names()[0]]
return iter_rows(worksheet1)
# this works properly
rows = get_rows('excels/ar.xlsx', 'xlsx')
print(rows) # I am: <generator object iter_rows at 0x074D7A58>
print([row for row in rows]) # I am printing each row of the excel file from the generator
# Error: ValueError: I/O operation on closed file
rows = get_rows('excels/ar.csv', 'csv')
print(rows) # I am: <generator object iter_rows at 0x074D7A58>
print([row for row in rows]) # ValueError: I/O operation on closed file
You don't use a with statement with openpxyl functions. But it seems you already know the problem, i.e. that you are trying to iterate over a file-handler after the with block has closed it. Iterate earlier? Or better yet, yield from the reader object:
def get_rows(filename, file_extension):
"""
Based on file extension, read the appropriate file format
"""
# read csv
if file_extension == 'csv':
with open(filename) as f:
yield from csv.reader(f, delimiter=",")
# read Excel files with openpyxl
if file_extension in ['xls', 'xlsx']:
wb2 = load_workbook(filename)
worksheet1 = wb2[wb2.get_sheet_names()[0]]
yield from iter_rows(worksheet1)
Or, if you are on Python 2:
def get_rows(filename, file_extension):
"""
Based on file extension, read the appropriate file format
"""
# read csv
if file_extension == 'csv':
with open(filename) as f:
for row in csv.reader(f, delimiter=",")
yield row
# read Excel files with openpyxl
if file_extension in ['xls', 'xlsx']:
wb2 = load_workbook(filename)
worksheet1 = wb2[wb2.get_sheet_names()[0]]
for row in iter_rows(worksheet1):
yield row
Also note 2 things:
The addition of the yield from/yield makes the get_rows function a generator function, changing the semantics of the return iter_rows(worksheet1) line. You now want to yield from both branches.
the way you originally wrote get_rows does not return a generator when you have a "csv". A csv.reader object is not a generator, (neither is, I believe a worksheet.iter_rows object, but I don't know because I don't use openpyxl). The reason your "xlsx" branch returns a generator is because you explicitly return the call to iter_rows, which you've defined as a generator. Your "csv" branch returns a csv.reader object. The latter is a lazy iterable, but it is not a generator. The former is a generator. Not all iterables are generators, but generators were added as a language construct to facilitate the writing of iterables, but now have expanded to be able to do all sorts of fancy stuff, like coroutines. See this answer to a famous question, which I think is better than the accepted answer.
The issue has to do with how you're handling the file. Both of the branches in your function return an iterator, but the CSV branch uses a with statement which closes the file automatically when you return. That means the iterator you get csv.reader is useless, since the file it tries to read from is already closed by the time your top level code can use it.
One way to work around this would be to make your get_rows function a generator. If you yield from the csv.reader instead of returning it, the file will not be closed until it has been fully read (or the generator is discarded). You'll also need to yield from the iter_rows generator you wrote for the other branch.
I am extremely new to python 3 and I am learning as I go here. I figured someone could help me with a basic question: how to store text from a CSV file as a variable to be used later in the code. So the idea here would be to import a CSV file into the python interpreter:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
...
and then extract the text from that file and store it as a variable (i.e. w = ["csv file text"]) to then be used later in the code to create permutations:
print (list(itertools.permutations(["w"], 2)))
If someone could please help and explain the process, it would be very much appreciated as I am really trying to learn. Please let me know if any more explanation is needed!
itertools.permutations() wants an iterable (e.g. a list) and a length as its arguments, so your data structure needs to reflect that, but you also need to define what you are trying to achieve here. For example, if you wanted to read a CSV file and produce permutations on every individual CSV field you could try this:
import csv
with open('some.csv', newline='') as f:
reader = csv.reader(f)
w = []
for row in reader:
w.extend(row)
print(list(itertools.permutations(w, 2)))
The key thing here is to create a flat list that can be passed to itertools.permutations() - this is done by intialising w to an empty list, and then extending its elements with the elements/fields from each row of the CSV file.
Note: As pointed out by #martineau, for the reasons explained here, the file should be opened with newline='' when used with the Python 3 csv module.
If you want to use Python 3 (as you state in the question) and to process the CSV file using the standard csv module, you should be careful about how to open the file. So far, your code and the answers use the Python 2 way of opening the CSV file. The things has changed in Python 3.
As shengy wrote, the CSV file is just a text file, and the csv module gets the elements as strings. Strings in Python 3 are unicode strings. Because of that, you should open the file in the text mode, and you should supply the encoding. Because of the nature of CSV file processing, you should also use the newline='' when opening the file.
Now extending the explanation of Burhan Khalid... When reading the CSV file, you get the rows as lists of strings. If you want to read all content of the CSV file into memory and store it in a variable, you probably want to use the list of rows (i.e. list of lists where the nested lists are the rows). The for loop iterates through the rows. The same way the list() function iterates through the sequence (here through the sequence of rows) and build the list of the items. To combine that with the wish to store everything in the content variable, you can write:
import csv
with open('some.csv', newline='', encoding='utf_8') as f:
reader = csv.reader(f)
content = list(reader)
Now you can do your permutation as you wish. The itertools is the correct way to do the permutations.
import csv
data = csv.DictReader(open('FileName.csv', 'r'))
print data.fieldnames
output = []
for each_row in data:
row = {}
try:
p = dict((k.strip(), v) for k, v in p.iteritems() if v.lower() != 'null')
except AttributeError, e:
print e
print p
raise Exception()
//based on the number of column
if p.get('col1'):
row['col1'] = p['col1']
if p.get('col2'):
row['col2'] = p['col2']
output.append(row)
Finally all data stored in output variable
Is this what you need?
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
rows = list(reader)
print('The csv file had {} rows'.format(len(rows)))
for row in rows:
do_stuff(row)
do_stuff_to_all_rows(rows)
The interesting line is rows = list(reader), which converts each row from the csv file (which will be a list), into another list rows, in effect giving you a list of lists.
If you had a csv file with three rows, rows would be a list with three elements, each element a row representing each line in the original csv file.
If all you care about is to read the raw text in the file (csv or not) then:
with open('some.csv') as f:
w = f.read()
will be a simple solution to having w="csv, file, text\nwithout, caring, about columns\n"
You should try pandas, which work both with Python 2.7 and Python 3.2+ :
import pandas as pd
csv = pd.read_csv("your_file.csv")
Then you can handle you data easily.
More fun here
First, a csv file is a text file too, so everything you can do with a file, you can do it with a csv file. That means f.read(), f.readline(), f.readlines() can all be used. see detailed information of these functions here.
But, as your file is a csv file, you can utilize the csv module.
# input.csv
# 1,david,enterprise
# 2,jeff,personal
import csv
with open('input.csv') as f:
reader = csv.reader(f)
for serial, name, version in reader:
# The csv module already extracts the information for you
print serial, name, version
More details about the csv module is here.
What is the 'Python way' regarding working with a CSV file? If I want to run some methods on the data in a particular column, should copy the whole think into an array, or should I pass the open file into a series of methods?
I tried to return the open file and got this error:
ValueError: I/O operation on closed file
here's the code:
import sys
import os
import csv
def main():
pass
def openCSVFile(CSVFile, openMode):
with open(CSVFile, openMode) as csvfile:
zipreader = csv.reader(csvfile, delimiter=',')
return zipreader
if __name__ == '__main__':
zipfile = openCSVFile('propertyOutput.csv','rb')
numRows = sum(1 for row in zipfile)
print"Rows equals %d." % numRows
Well there are many ways you could go about manipulating csv files. It depends
largely on how big your data is and how often you will perform these operations.
I will build on the already good answers and comments to present a somewhat more
complex handling, that wouldn't be far off from a real world example.
First of all, I prefer csv.DictReader because most csv files have a header
row with the column names. csv.DictReader takes advantage of that and gives
you the opportunity to grab it's cell value by its name.
Also, most of the times you need to perform various validation and normalization
operations on said data, so we're going to associate some functions with specific
columns.
Suppose we have a csv with information about products.
e.g.
Product Name,Release Date,Price
foo product,2012/03/23,99.9
awesome product,2013/10/14,40.5
.... and so on ........
Let's write a program to parse it and normalize the values
into appropriate native python objects.
import csv
import datetime
from decimal import Decimal
def stripper(value):
# Strip any whitespace from the left and right
return value.strip()
def to_decimal(value):
return Decimal(value)
def to_date(value):
# We expect dates like: "2013/05/23"
datetime.datetime.strptime(value, '%Y/%m/%d').date()
OPERATIONS = {
'Product Name': [stripper],
'Release Date': [stripper, to_date],
'Price': [stripper, to_decimal]
}
def parse_csv(filepath):
with open(filepath, 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
for column in row:
operations = OPERATIONS[column]
value = row[column]
for op in operations:
value = op(value)
# Print the cleaned value or store it somewhere
print value
Things to note:
1) We operate on the csv in a line by line basis. DictReader yields lines
one at a time and that means we can handle arbitrary sizes of csv files,
since we are not going to load the whole file into memory.
2) You can go crazy with normalizing the values of a csv, by building special
classes with magic methods or whatnot. As I said, it depends on the complexity
of your files, the quality of the data and the operations you need to perform
on them.
Have fun.
csv module provides one row at a time, understanding its content by spliting it as a list object (or dict in case of DictReader).
As Python knows how to loop on such an object, if you're just interested in some specific fields, building a list with these fields seems 'Pythonic' to me. Using an iterator is also valid if each item shall be considered separatly from the others.
You probably need to read PEP 343: The 'with' statement
Relevant quote:
Some standard Python objects now support the context management protocol and can be used with the 'with' statement. File objects are one example:
with open('/etc/passwd', 'r') as f:
for line in f:
print line
... more processing code ...
After this statement has executed, the file object in f will have been automatically closed, even if the 'for' loop raised an exception part-way through the block.
So your csvfile is closed outside with statement, and outside openCSVFile function. You need to not to use with statement,
def openCSVFile(CSVFile, openMode):
csvfile = open(CSVFile, openMode)
return csv.reader(csvfile, delimiter=',')
or move it to __main__:
def get_csv_reader(filelike):
return csv.reader(csvfile, delimiter=',')
if __name__ == '__main__':
with open('propertyOutput.csv', 'rb') as csvfile:
zipfile = get_csv_reader(csvfile)
numRows = sum(1 for row in zipfile)
print"Rows equals %d." % numRows
Firstly, the reason you're getting ValueError: I/O operation on closed file is that in the following, the with acting as a context manager is operating on an opened file which is the underlying fileobj that zipreader is then set to work on. What happens, is that as soon as the with block is exited, the file that was opened is then closed, which leaves the file unusable for zipreader to read from...
with open(CSVFile, openMode) as csvfile:
zipreader = csv.reader(csvfile, delimiter=',')
return zipreader
Generally, acquire the resource and then pass it a function if needed. So, in your main program open the file and create the csv.reader and then pass that to something and have it closed in the main program when it makes more sense that "you're done with it now".