StopIteration Error in pythoncode while reading csv data - python

I am writing a program to read csv file. I have craeted a reader object and calling next() on it gives me the header row. But when I am calling it again it gives StopIteration error although there are rows in the csv file.I am doing file.seek(0) then it is working fine. Anyone please explains this to me? A snapshot of code is given below:
with open(file,'r') as f:
reader = csv.reader(f)
header = next(reader)
result = []
for colname in header[2:]:
col_index = header.index(colname)
# f.seek(0)
next(reader)

You're calling next once for each column (except the first two). So, if you have, say, 10 columns, it's going to try to read 8 rows.
If you have 20 rows, that's not going to raise an exception, but you'll be ignoring the last 12 rows, which you probably don't want. On the other hand, if you have only 5 rows, it's going to raise when trying to read the 6th row.
The reason the f.seek(0) prevents the exception is that it resets the file back to the start before each next, so you just read the header row over and over, ignoring everything else in the file. It doesn't raise anything, but it's not doing useful.
What you probably wanted is something like this:
with open(file,'r') as f:
reader = csv.reader(f)
header = next(reader)
result = []
for row in reader:
for col_index, colname in enumerate(header)[2:]:
value = row[col_index]
result.append(do_something_with(value, colname))
This reads every row exactly once, and does something with each column but the first two of each row.
From a comment, what you actually want to do is find the maximum value for each column. So, you do need to iterate over the columns–and then, within each column, you need to iterate over the rows.
A csv.reader is an iterator, which means you can only iterate over it once. So, if you just do this the obvious way, it won't work:
maxes = {}
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
for col_index, colname in enumerate(header)[2:]:
maxes[colname] = max(reader, key=operator.itemgetter(col_index))
The first column will read whatever's left after reading the header, which is good. The next column will read whatever's left after reading the whole file, which is nothing.
So, how can you fix this?
One way is to re-create the iterator each time through the outer loop:
maxes = {}
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
for col_index, colname in enumerate(header)[2:]:
with open(file) as f:
reader = csv.reader(f)
next(reader)
maxes[colname] = max(reader, key=lambda row: float(row[col_index]))
The problem with this is that you're reading the whole file N times, and reading the file off disk is probably by far the slowest thing your program does.
What you were attempting to do with f.seek(0) is a trick that depends on how file objects and csv.reader objects work. While file objects are iterators, they're special, in that they have a way to reset them to the beginning (or to save a position and return to it later). And csv.reader objects are basically simple wrappers around file objects, so if you reset the file, you also reset the reader. (It's not clear that this is guaranteed to work, but if you know how csv works, you can probably convince yourself that in practice it's safe.) So:
maxes = {}
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
for col_index, colname in enumerate(header)[2:]:
f.seek(0)
next(reader)
maxes[colname] = max(reader, key=lambda row: float(row[col_index]))
This saves you the cost of closing and opening the file each time, but that's not the expensive part; you're still doing the disk reads over and over. And now anyone reading your code has to understand the trick with using file objects as iterators but resetting them, or they won't know how your code works.
So, how can you avoid that?
In general, whenever you need to make multiple passes over an iterator, there are two options. The simple solution is to copy the iterator into a reusable iterable, like a list:
maxes = {}
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
rows = list(reader)
for col_index, colname in enumerate(header)[2:]:
maxes[colname] = max(rows, key=lambda row: float(row[col_index]))
This is not only much simpler than the earlier code, it's also much faster. Unless the file is huge. By storing all of the rows in a list, you're reading the whole file into memory at once. If it's too big to fit, your program will fail. Or, worse, if it fits, but only by using virtual memory, your program will swap parts of it in and out of memory every time you go through the loop, thrashing your swapfile and making everything slow to a crawl.
The other alternative is to reorganize things so you only have to make one pass. This means you have to put the loop over the rows on the outside, and the loop over the columns on the inside. It requires rethinking the design a bit, and it means you can't just use the simple max function, but the tradeoff is probably worth it:
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
maxes = {colname: float('-inf') for colname in header[2:]}
for row in reader:
for col_index, colname in enumerate(header)[2:]:
maxes[colname] = max(maxes[colname], float(row[col_index]))
You can simplify this even further—e.g., use a Counter instead of a plain dict, and a DictReader instead of a plain reader—but it's already simple, readable, and efficient as-is.

For me Was accidentally remove data (there is no data) in .csv
so get this message.
so make sure to check if there is data in .csv file or not.

Why didn't you write:
header = next(reader)
In the last line as well? I don't know if this is your problem, but I would start there.

Related

Python Function whose arguments specify the lines to retrieve from CSV

Hello there community!
I've been struggling with this function for a while and I cannot seem to make it work.
I need a function that reads a csv file and takes as arguments: the csv file, and a range of first to last line to be read. I've looked into many threads but none seem to work in my case.
My current function is based on the answer to this post: How to read specific lines of a large csv file
It looks like this:
def lines(file, first, last):
lines_set = range(first, last, 1)
result = []
with open("./2020.csv", "r") as csvfile:
for line_number, row in enumerate(csvfile):
if line_number in lines_set:
result.append(extract_data(csvfile.readline())) #this extract_data is a previously created function that fetches specific columns from the csv
return result
What's happening is it's skipping a row everytime, meaning that instead of reading, for example, from line 1 to 4, it reads these four lines: 1, 3, 5 and 7.
Adittional question: The csv file has a headline. How can I use the next method in order to never include the header?
Thank you very much for all your help!
I recommend you use the CSV reader, it’ll save you from getting any incomplete row as a row can span many lines.
That said, your basic problem is manually iterating your reader inside the for-loop, which is already automatically iterating your reader:
import csv
lines_set = range(first, last, 1)
result = []
with open("./2020.csv", "r") as csv_file:
reader = csv.reader(csv_file)
next(reader) # only if CSV has a header AND it should not be counted
for line_number, row in enumerate(reader):
if line_number in lines_set:
result.append(extract_data(row))
return result
Also, keep in mind that enumerate() will start at 0 by default, give the start=1 option (or whatever start value you need) to start counting correctly. Then, range() has a non-inclusive end, so you might want end+1.
Or... do away with range() entirely since < and > are sufficient, and maybe clearer:
import csv
start = 3
end = 6
with open('input.csv') as file:
reader = csv.reader(file)
next(reader) # Skip header row
for i, row in enumerate(reader, start=1):
if i < start or i > end:
continue
print(row)

Python list from reader

New to Python so help much appreciated. Trying to create a list from a csv file but I get back empty list. The code is:
f = open('f04.csv','r')
fs = csv.reader(f)
for row in fs:
print(row)
print("\n create list:")
rawlist = list(fs)
print(rawlist)
The output is
['0.161872459', '0.007599374', '0.356000303']
['0.252876406', '0.174271357', '0.186588529']
['0.002431561', '0.304828772', '0.413916445']
['0.317126652', '0.205842315', '0.007455314']
['0.035314146', '0.033238938', '0.037165931']
['0.701855497', '0.173086947', '0.040612888']
['0.576945162', '0.846018396', '0.046673223']
['0.170591115', '0.587155632', '0.177219448']
['0.83711092', '0.419058334', '0.829849144']
create list:
[]
As you can see the reader works but when I try to create a list it returns empty. Thanks in advance.
Iterators such as files can only be iterated once. You need to re-initialize them to be able to restart reading from the beginning. Try:
f = open('f04.csv','r')
fs = csv.reader(f)
rawlist = []
for row in fs:
print(row)
rawlist.append(row)
print(rawlist)
You can only move though the iterator once. You can however reset the iterator by using seek (see docs https://docs.python.org/3/library/io.html#io.IOBase.seek).
Your example would then only need one additional line of code, becoming:
f = open('f04.csv','r')
fs = csv.reader(f)
for row in fs:
print(row)
f.seek(0)
print("\n create list:")
rawlist = list(fs)
print(rawlist)
Files cannot be iterated twice without rewinding, and csv.reader returns an iterator that cannot be iterated twice at all.
The easy and "right" way is that you load the whole file into memory:
with open('f04.csv') as f: # with syntax ensures that the file is closed right away
rawlist = list(csv.reader(f))
# from here on, `rawlist` contains everything you had in the file
for row in rawlist:
print(row)
print(rawlist)
This will work as long as your file is not huge.

How to leave first iteration elements?

with open(args[0]) as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
pass
How to leave first iteration elements in my example?
Example data:
000, 000, 000,
666, 444, 555,
111, 111, 111,
I need to leave 000, 000, 000,
Use next to skip a row:
with open(args[0]) as f:
reader = csv.reader(f, delimiter=',')
next(reader, None) # Read and skip first line
for row in reader:
pass
Be careful to give the second argument to next, which is what it returns if the iterator is already empty (in case of empty file), even though you don't care about it. Otherwise StopIteration will be raised, which can lead to really unexpected behaviour if you happen to call the function this is in from inside a generator expression somewhere else...
You can just call next() outside the for loop to skip the line
with open(args[0]) as f:
reader = csv.reader(f, delimiter=',')
reader.next()
for row in reader:
doSomething()
try:
for i in range(1, len(reader)):
row = reader[i]
I assume that your files are not extremely large. If that assumption is wrong, forget about this solution. If it is right, the best solution, from the semantic point of view, in my opinion is:
with open(args[0]) as f:
reader = csv.reader(f, delimiter=',')
rows = list(reader)
for r in rows[1:]:
do_stuff_with_row(r)
This separates your task in two steps:
reading the file, building the list of rows
iterating through a sub-set of the rows
It is not the most memory-efficient solution, but in most application scenarios this does not matter at all. If memory-efficiency is of interest to you, use a next()-based solution. Note that this approach here can be improved using itertools, especially itertools.islice(iterable, start, stop[, step]).
try this
with open(args[0]) as f:
index = 0
reader = csv.reader(f, delimiter=',')
for row in reader:
index++
if(index=1):
pass
do somthing

Trouble with Python order of operations/loop

I have some code that is meant to convert CSV files into tab delimited files. My problem is that I cannot figure out how to write the correct values in the correct order. Here is my code:
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write(item['name']+'\t'+item['order_num']...)
tab_file.write('\n'+item['amt_due']+'\t'+item['due_date']...)
Now, since both my write statements are in the for row in data loop, my headers are being written multiple times over. If I outdent the first write statement, I'll have an obvious formatting error. If I move the second write statement above the first and then outdent, my data will be out of order. What can I do to make sure that the first write statement gets written once as a header, and the second gets written for each line in the CSV file? How do I extract the first 'write' statement outside of the loop without breaking the dictionary? Thanks!
The csv module contains methods for writing as well as reading, making this pretty trivial:
import csv
with open("test.csv") as file, open("test_tab.csv", "w") as out:
reader = csv.reader(file)
writer = csv.writer(out, dialect=csv.excel_tab)
for row in reader:
writer.writerow(row)
No need to do it all yourself. Note my use of the with statement, which should always be used when working with files in Python.
Edit: Naturally, if you want to select specific values, you can do that easily enough. You appear to be making your own dictionary to select the values - again, the csv module provides DictReader to do that for you:
import csv
with open("test.csv") as file, open("test_tab.csv", "w") as out:
reader = csv.DictReader(file)
writer = csv.writer(out, dialect=csv.excel_tab)
for row in reader:
writer.writerow([row["name"], row["order_num"], ...])
As kirelagin points out in the commends, csv.writerows() could also be used, here with a generator expression:
writer.writerows([row["name"], row["order_num"], ...] for row in reader)
Extract the code that writes the headers outside the main loop, in such a way that it only gets written exactly once at the beginning.
Also, consider using the CSV module for writing CSV files (not just for reading), don't reinvent the wheel!
Ok, so I figured it out, but it's not the most elegant solutions. Basically, I just ran the first loop, wrote to the file, then ran it a second time and appended the results. See my code below. I would love any input on a better way to accomplish what I've done here. Thanks!
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write(item['name']+'\t'+item['order_num']...)
tab_file.close()
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write('\n'+item['amt_due']+'\t'+item['due_date']...)
tab_file.close()

Have csv.reader tell when it is on the last line

Apparently some csv output implementation somewhere truncates field separators from the right on the last row and only the last row in the file when the fields are null.
Example input csv, fields 'c' and 'd' are nullable:
a|b|c|d
1|2||
1|2|3|4
3|4||
2|3
In something like the script below, how can I tell whether I am on the last line so I know how to handle it appropriately?
import csv
reader = csv.reader(open('somefile.csv'), delimiter='|', quotechar=None)
header = reader.next()
for line_num, row in enumerate(reader):
assert len(row) == len(header)
....
Basically you only know you've run out after you've run out. So you could wrap the reader iterator, e.g. as follows:
def isLast(itr):
old = itr.next()
for new in itr:
yield False, old
old = new
yield True, old
and change your code to:
for line_num, (is_last, row) in enumerate(isLast(reader)):
if not is_last: assert len(row) == len(header)
etc.
I am aware it is an old question, but I came up with a different answer than the ones presented. The reader object already increments the line_num attribute as you iterate through it. Then I get the total number of lines at first using row_count, then I compare it with the line_num.
import csv
def row_count(filename):
with open(filename) as in_file:
return sum(1 for _ in in_file)
in_filename = 'somefile.csv'
reader = csv.reader(open(in_filename), delimiter='|')
last_line_number = row_count(in_filename)
for row in reader:
if last_line_number == reader.line_num:
print "It is the last line: %s" % row
If you have an expectation of a fixed number of columns in each row, then you should be defensive against:
(1) ANY row being shorter -- e.g. a writer (SQL Server / Query Analyzer IIRC) may omit trailing NULLs at random; users may fiddle with the file using a text editor, including leaving blank lines.
(2) ANY row being longer -- e.g. commas not quoted properly.
You don't need any fancy tricks. Just an old-fashioned if-test in your row-reading loop:
for row in csv.reader(...):
ncols = len(row)
if ncols != expected_cols:
appropriate_action()
if you want to get exactly the last row try this code:
with open("\\".join([myPath,files]), 'r') as f:
print f.readlines()[-1] #or your own manipulations
If you want to continue working with values from row do the following:
f.readlines()[-1].split(",")[0] #this would let you get columns by their index
Just extend the row to the length of the header:
for line_num, row in enumerate(reader):
while len(row) < len(header):
row.append('')
...
Could you not just catch the error when the csv reader reads the last line in a
try:
... do your stuff here...
except: StopIteration
condition ?
See the following python code on stackoverflow for an example of how to use the try: catch: Python CSV DictReader/Writer issues
If you use for row in reader:, it will just stop the loop after the last item has been read.

Categories

Resources