with open(args[0]) as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
pass
How to leave first iteration elements in my example?
Example data:
000, 000, 000,
666, 444, 555,
111, 111, 111,
I need to leave 000, 000, 000,
Use next to skip a row:
with open(args[0]) as f:
reader = csv.reader(f, delimiter=',')
next(reader, None) # Read and skip first line
for row in reader:
pass
Be careful to give the second argument to next, which is what it returns if the iterator is already empty (in case of empty file), even though you don't care about it. Otherwise StopIteration will be raised, which can lead to really unexpected behaviour if you happen to call the function this is in from inside a generator expression somewhere else...
You can just call next() outside the for loop to skip the line
with open(args[0]) as f:
reader = csv.reader(f, delimiter=',')
reader.next()
for row in reader:
doSomething()
try:
for i in range(1, len(reader)):
row = reader[i]
I assume that your files are not extremely large. If that assumption is wrong, forget about this solution. If it is right, the best solution, from the semantic point of view, in my opinion is:
with open(args[0]) as f:
reader = csv.reader(f, delimiter=',')
rows = list(reader)
for r in rows[1:]:
do_stuff_with_row(r)
This separates your task in two steps:
reading the file, building the list of rows
iterating through a sub-set of the rows
It is not the most memory-efficient solution, but in most application scenarios this does not matter at all. If memory-efficiency is of interest to you, use a next()-based solution. Note that this approach here can be improved using itertools, especially itertools.islice(iterable, start, stop[, step]).
try this
with open(args[0]) as f:
index = 0
reader = csv.reader(f, delimiter=',')
for row in reader:
index++
if(index=1):
pass
do somthing
Related
Hello there community!
I've been struggling with this function for a while and I cannot seem to make it work.
I need a function that reads a csv file and takes as arguments: the csv file, and a range of first to last line to be read. I've looked into many threads but none seem to work in my case.
My current function is based on the answer to this post: How to read specific lines of a large csv file
It looks like this:
def lines(file, first, last):
lines_set = range(first, last, 1)
result = []
with open("./2020.csv", "r") as csvfile:
for line_number, row in enumerate(csvfile):
if line_number in lines_set:
result.append(extract_data(csvfile.readline())) #this extract_data is a previously created function that fetches specific columns from the csv
return result
What's happening is it's skipping a row everytime, meaning that instead of reading, for example, from line 1 to 4, it reads these four lines: 1, 3, 5 and 7.
Adittional question: The csv file has a headline. How can I use the next method in order to never include the header?
Thank you very much for all your help!
I recommend you use the CSV reader, it’ll save you from getting any incomplete row as a row can span many lines.
That said, your basic problem is manually iterating your reader inside the for-loop, which is already automatically iterating your reader:
import csv
lines_set = range(first, last, 1)
result = []
with open("./2020.csv", "r") as csv_file:
reader = csv.reader(csv_file)
next(reader) # only if CSV has a header AND it should not be counted
for line_number, row in enumerate(reader):
if line_number in lines_set:
result.append(extract_data(row))
return result
Also, keep in mind that enumerate() will start at 0 by default, give the start=1 option (or whatever start value you need) to start counting correctly. Then, range() has a non-inclusive end, so you might want end+1.
Or... do away with range() entirely since < and > are sufficient, and maybe clearer:
import csv
start = 3
end = 6
with open('input.csv') as file:
reader = csv.reader(file)
next(reader) # Skip header row
for i, row in enumerate(reader, start=1):
if i < start or i > end:
continue
print(row)
I am trying to do the following:
reader = csv.DictReader(open(self.file_path), delimiter='|')
reader_length = sum([_ for item in reader])
for line in reader:
print line
However, doing the reader_length line, makes the reader itself unreadable. Note that I do not want to do a list() on the reader, as it is too big to read on my machine entirely from memory.
Use enumerate with a start value of 1, when you get to the end of the file you will have the line count:
for count,line in enumerate(reader,1):
# do work
print count
Or if you need the count at the start for some reason sum using a generator expression and seek back to the start of the file:
with open(self.file_path) as f:
reader = csv.DictReader(f, delimiter='|')
count = sum(1 for _ in reader)
f.seek(0)
reader = csv.DictReader(f, delimiter='|')
for line in reader:
print(line)
reader = list(csv.DictReader(open(self.file_path), delimiter='|'))
print len(reader)
is one way to do this i suppose
another way to do it would be
reader = csv.DictReader(open(self.file_path), delimiter='|')
for i,row in enumerate(reader):
...
num_rows = i+1
I am writing a program to read csv file. I have craeted a reader object and calling next() on it gives me the header row. But when I am calling it again it gives StopIteration error although there are rows in the csv file.I am doing file.seek(0) then it is working fine. Anyone please explains this to me? A snapshot of code is given below:
with open(file,'r') as f:
reader = csv.reader(f)
header = next(reader)
result = []
for colname in header[2:]:
col_index = header.index(colname)
# f.seek(0)
next(reader)
You're calling next once for each column (except the first two). So, if you have, say, 10 columns, it's going to try to read 8 rows.
If you have 20 rows, that's not going to raise an exception, but you'll be ignoring the last 12 rows, which you probably don't want. On the other hand, if you have only 5 rows, it's going to raise when trying to read the 6th row.
The reason the f.seek(0) prevents the exception is that it resets the file back to the start before each next, so you just read the header row over and over, ignoring everything else in the file. It doesn't raise anything, but it's not doing useful.
What you probably wanted is something like this:
with open(file,'r') as f:
reader = csv.reader(f)
header = next(reader)
result = []
for row in reader:
for col_index, colname in enumerate(header)[2:]:
value = row[col_index]
result.append(do_something_with(value, colname))
This reads every row exactly once, and does something with each column but the first two of each row.
From a comment, what you actually want to do is find the maximum value for each column. So, you do need to iterate over the columns–and then, within each column, you need to iterate over the rows.
A csv.reader is an iterator, which means you can only iterate over it once. So, if you just do this the obvious way, it won't work:
maxes = {}
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
for col_index, colname in enumerate(header)[2:]:
maxes[colname] = max(reader, key=operator.itemgetter(col_index))
The first column will read whatever's left after reading the header, which is good. The next column will read whatever's left after reading the whole file, which is nothing.
So, how can you fix this?
One way is to re-create the iterator each time through the outer loop:
maxes = {}
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
for col_index, colname in enumerate(header)[2:]:
with open(file) as f:
reader = csv.reader(f)
next(reader)
maxes[colname] = max(reader, key=lambda row: float(row[col_index]))
The problem with this is that you're reading the whole file N times, and reading the file off disk is probably by far the slowest thing your program does.
What you were attempting to do with f.seek(0) is a trick that depends on how file objects and csv.reader objects work. While file objects are iterators, they're special, in that they have a way to reset them to the beginning (or to save a position and return to it later). And csv.reader objects are basically simple wrappers around file objects, so if you reset the file, you also reset the reader. (It's not clear that this is guaranteed to work, but if you know how csv works, you can probably convince yourself that in practice it's safe.) So:
maxes = {}
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
for col_index, colname in enumerate(header)[2:]:
f.seek(0)
next(reader)
maxes[colname] = max(reader, key=lambda row: float(row[col_index]))
This saves you the cost of closing and opening the file each time, but that's not the expensive part; you're still doing the disk reads over and over. And now anyone reading your code has to understand the trick with using file objects as iterators but resetting them, or they won't know how your code works.
So, how can you avoid that?
In general, whenever you need to make multiple passes over an iterator, there are two options. The simple solution is to copy the iterator into a reusable iterable, like a list:
maxes = {}
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
rows = list(reader)
for col_index, colname in enumerate(header)[2:]:
maxes[colname] = max(rows, key=lambda row: float(row[col_index]))
This is not only much simpler than the earlier code, it's also much faster. Unless the file is huge. By storing all of the rows in a list, you're reading the whole file into memory at once. If it's too big to fit, your program will fail. Or, worse, if it fits, but only by using virtual memory, your program will swap parts of it in and out of memory every time you go through the loop, thrashing your swapfile and making everything slow to a crawl.
The other alternative is to reorganize things so you only have to make one pass. This means you have to put the loop over the rows on the outside, and the loop over the columns on the inside. It requires rethinking the design a bit, and it means you can't just use the simple max function, but the tradeoff is probably worth it:
with open(file) as f:
reader = csv.reader(f)
header = next(reader)
maxes = {colname: float('-inf') for colname in header[2:]}
for row in reader:
for col_index, colname in enumerate(header)[2:]:
maxes[colname] = max(maxes[colname], float(row[col_index]))
You can simplify this even further—e.g., use a Counter instead of a plain dict, and a DictReader instead of a plain reader—but it's already simple, readable, and efficient as-is.
For me Was accidentally remove data (there is no data) in .csv
so get this message.
so make sure to check if there is data in .csv file or not.
Why didn't you write:
header = next(reader)
In the last line as well? I don't know if this is your problem, but I would start there.
I've run into a strange problem while writing some code for my personal use. I'll let my code do the talking...
def getValues(self, reader):
for row in reader:
#does stuff
return assetName, efficiencyRating
def handleSave(self, assetName, reader):
outputFile = open(self.outFilename, 'w')
for row in reader:
#does other stuff
outputFile.close()
return
def handleCalc(self):
reader = csv.reader(open(self.filename), delimiter = ',', quotechar = '\"')
assetName, efficiencyRating = self.getValues(reader)
self.handleSave(assetName, reader)
This is just a portion of the code (obviously). The problem I'm having is in handleSave trying to loop through reader. It doesn't appear to ever enter the loop? I'm really not sure what is happening. The loop in getValues behaves as expected.
Can someone explain what is happening? What have I done wrong? What should I do to fix this?
Once you've iterated through an iterator once, you can't iterate through it again.
One way you can solve this is before you call handleSave, rewind the file and create a new reader:
f = open(self.filename)
reader = csv.reader(f, delimiter = ',', quotechar = '"')
assetName, efficiencyRating = self.getValues(reader)
f.seek(0) # rewind file
reader = csv.reader(f, delimiter = ',', quotechar = '"')
self.handleSave(assetName, reader)
Alternatively, you can read the data into a list:
rows = list(reader)
And then iterate through rows rather than reader.
As a side note, the convention in Python is for names to be lowercase, separated by underscores, rather than camel case. (e.g. get_values rather than getValues, handle_save rather than handleSave)
The reader method of csv module, acts on the sequences and as you have iterated over it once in your getValues method, the sequence is already exhaused. Unfortunately, I don't see any better method than passing the sequence again.
Move the csv.reader into your methods and send
open(self.filename), delimiter = ',', quotechar = '\"')
Or create a file object each time or seek(0) to reset and send that as the argument to the object which will be handled by the reader method. That should help.
Apparently some csv output implementation somewhere truncates field separators from the right on the last row and only the last row in the file when the fields are null.
Example input csv, fields 'c' and 'd' are nullable:
a|b|c|d
1|2||
1|2|3|4
3|4||
2|3
In something like the script below, how can I tell whether I am on the last line so I know how to handle it appropriately?
import csv
reader = csv.reader(open('somefile.csv'), delimiter='|', quotechar=None)
header = reader.next()
for line_num, row in enumerate(reader):
assert len(row) == len(header)
....
Basically you only know you've run out after you've run out. So you could wrap the reader iterator, e.g. as follows:
def isLast(itr):
old = itr.next()
for new in itr:
yield False, old
old = new
yield True, old
and change your code to:
for line_num, (is_last, row) in enumerate(isLast(reader)):
if not is_last: assert len(row) == len(header)
etc.
I am aware it is an old question, but I came up with a different answer than the ones presented. The reader object already increments the line_num attribute as you iterate through it. Then I get the total number of lines at first using row_count, then I compare it with the line_num.
import csv
def row_count(filename):
with open(filename) as in_file:
return sum(1 for _ in in_file)
in_filename = 'somefile.csv'
reader = csv.reader(open(in_filename), delimiter='|')
last_line_number = row_count(in_filename)
for row in reader:
if last_line_number == reader.line_num:
print "It is the last line: %s" % row
If you have an expectation of a fixed number of columns in each row, then you should be defensive against:
(1) ANY row being shorter -- e.g. a writer (SQL Server / Query Analyzer IIRC) may omit trailing NULLs at random; users may fiddle with the file using a text editor, including leaving blank lines.
(2) ANY row being longer -- e.g. commas not quoted properly.
You don't need any fancy tricks. Just an old-fashioned if-test in your row-reading loop:
for row in csv.reader(...):
ncols = len(row)
if ncols != expected_cols:
appropriate_action()
if you want to get exactly the last row try this code:
with open("\\".join([myPath,files]), 'r') as f:
print f.readlines()[-1] #or your own manipulations
If you want to continue working with values from row do the following:
f.readlines()[-1].split(",")[0] #this would let you get columns by their index
Just extend the row to the length of the header:
for line_num, row in enumerate(reader):
while len(row) < len(header):
row.append('')
...
Could you not just catch the error when the csv reader reads the last line in a
try:
... do your stuff here...
except: StopIteration
condition ?
See the following python code on stackoverflow for an example of how to use the try: catch: Python CSV DictReader/Writer issues
If you use for row in reader:, it will just stop the loop after the last item has been read.