Apparently some csv output implementation somewhere truncates field separators from the right on the last row and only the last row in the file when the fields are null.
Example input csv, fields 'c' and 'd' are nullable:
a|b|c|d
1|2||
1|2|3|4
3|4||
2|3
In something like the script below, how can I tell whether I am on the last line so I know how to handle it appropriately?
import csv
reader = csv.reader(open('somefile.csv'), delimiter='|', quotechar=None)
header = reader.next()
for line_num, row in enumerate(reader):
assert len(row) == len(header)
....
Basically you only know you've run out after you've run out. So you could wrap the reader iterator, e.g. as follows:
def isLast(itr):
old = itr.next()
for new in itr:
yield False, old
old = new
yield True, old
and change your code to:
for line_num, (is_last, row) in enumerate(isLast(reader)):
if not is_last: assert len(row) == len(header)
etc.
I am aware it is an old question, but I came up with a different answer than the ones presented. The reader object already increments the line_num attribute as you iterate through it. Then I get the total number of lines at first using row_count, then I compare it with the line_num.
import csv
def row_count(filename):
with open(filename) as in_file:
return sum(1 for _ in in_file)
in_filename = 'somefile.csv'
reader = csv.reader(open(in_filename), delimiter='|')
last_line_number = row_count(in_filename)
for row in reader:
if last_line_number == reader.line_num:
print "It is the last line: %s" % row
If you have an expectation of a fixed number of columns in each row, then you should be defensive against:
(1) ANY row being shorter -- e.g. a writer (SQL Server / Query Analyzer IIRC) may omit trailing NULLs at random; users may fiddle with the file using a text editor, including leaving blank lines.
(2) ANY row being longer -- e.g. commas not quoted properly.
You don't need any fancy tricks. Just an old-fashioned if-test in your row-reading loop:
for row in csv.reader(...):
ncols = len(row)
if ncols != expected_cols:
appropriate_action()
if you want to get exactly the last row try this code:
with open("\\".join([myPath,files]), 'r') as f:
print f.readlines()[-1] #or your own manipulations
If you want to continue working with values from row do the following:
f.readlines()[-1].split(",")[0] #this would let you get columns by their index
Just extend the row to the length of the header:
for line_num, row in enumerate(reader):
while len(row) < len(header):
row.append('')
...
Could you not just catch the error when the csv reader reads the last line in a
try:
... do your stuff here...
except: StopIteration
condition ?
See the following python code on stackoverflow for an example of how to use the try: catch: Python CSV DictReader/Writer issues
If you use for row in reader:, it will just stop the loop after the last item has been read.
Related
Hello there community!
I've been struggling with this function for a while and I cannot seem to make it work.
I need a function that reads a csv file and takes as arguments: the csv file, and a range of first to last line to be read. I've looked into many threads but none seem to work in my case.
My current function is based on the answer to this post: How to read specific lines of a large csv file
It looks like this:
def lines(file, first, last):
lines_set = range(first, last, 1)
result = []
with open("./2020.csv", "r") as csvfile:
for line_number, row in enumerate(csvfile):
if line_number in lines_set:
result.append(extract_data(csvfile.readline())) #this extract_data is a previously created function that fetches specific columns from the csv
return result
What's happening is it's skipping a row everytime, meaning that instead of reading, for example, from line 1 to 4, it reads these four lines: 1, 3, 5 and 7.
Adittional question: The csv file has a headline. How can I use the next method in order to never include the header?
Thank you very much for all your help!
I recommend you use the CSV reader, it’ll save you from getting any incomplete row as a row can span many lines.
That said, your basic problem is manually iterating your reader inside the for-loop, which is already automatically iterating your reader:
import csv
lines_set = range(first, last, 1)
result = []
with open("./2020.csv", "r") as csv_file:
reader = csv.reader(csv_file)
next(reader) # only if CSV has a header AND it should not be counted
for line_number, row in enumerate(reader):
if line_number in lines_set:
result.append(extract_data(row))
return result
Also, keep in mind that enumerate() will start at 0 by default, give the start=1 option (or whatever start value you need) to start counting correctly. Then, range() has a non-inclusive end, so you might want end+1.
Or... do away with range() entirely since < and > are sufficient, and maybe clearer:
import csv
start = 3
end = 6
with open('input.csv') as file:
reader = csv.reader(file)
next(reader) # Skip header row
for i, row in enumerate(reader, start=1):
if i < start or i > end:
continue
print(row)
Hope you can help. I'm trying to iterate over a .csv file and delete rows where the first character of the first item is a #.
Whilst my code does indeed delete the necessary rows, I'm being presented with a "string index out of range" error.
My code is as below:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if (row[0][0]) != '#':
writer.writerow(row)
input.close()
output.close()
As far as I can tell, I have no empty rows that I'm trying to iterate over.
Check if the string is empty with if row[0] before trying to index:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if row[0] and row[0][0] != '#': # here
writer.writerow(row)
input.close()
output.close()
Or simply use if row[0].startswith('#') as your condition
You are likely running into an empty string.
Perhaps try
`if row and row[0][0] != '#':
Then why don't you make sure you don't bump into any of those even if they exist by checking if the line is empty first like so:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if row:
if (row[0][0]) != '#':
writer.writerow(row)
else:
continue
input.close()
output.close()
Also when working with *.csv files it is good to have a look at them in a text editor to make sure the delimiters and end_of_line characters are like you think they are. The sniffer is also a good read.
Cheers
Why not provide working code (imports included) and wrap as usual the physical resources in context managers?
Like so:
#! /usr/bin/env python
"""Only strip those rows that start with hash (#)."""
import csv
IN_F_PATH = "/home/stephen/Desktop/paths_output.csv"
OUT_F_PATH = "/home/stephen/Desktop/paths_output2.csv"
with open(IN_F_PATH, 'rb') as i_f, open(OUT_F_PATH, "wb") as o_f:
writer = csv.writer(o_f)
for row in csv.reader(i_f):
if row and row[0].startswith('#'):
continue
writer.writerow(row)
Some notes:
The closing of the files is automated by leaving the context blocks,
the names are better chosen, as input is well a keyword ...
you may want to include empty lines, I only read you want to strip comment lines from the question, so detect these and continue.
it is row[0] that is the first columns string and that startswith # natively mapped to the best matching simple string "method".
In case you also might want to strip empty lines, than one could use the following condition to continueinstead:
if not row or row and row[0].startswith('#'):
and you should be ready to go.
HTH
To answer a comment on the above code line that causes also the skipping of blank input "Lines".
In Python we have left to right (lazy evaluation) and short circuit for boolean expressions so:
>>> row = ["#", "42"]
>>> if not row or row and row[0].startswith("#"):
... print "Empty or comment!"
...
Empty or comment!
>>> row = []
>>> if not row or row and row[0].startswith("#"):
... print "Empty or comment!"
...
Empty or comment!
I suspect that there are lines with an empty first cell, so row[0][0] tries to access the first character of the empty string.
You should try:
for row in csv.reader(input):
if not row[0].startswith('#'):
writer.writerow(row)
I have a csv file that needs to add a zero in front of the number if its less than 4 digits.
I only have to update a particular row:
import csv
f = open('csvpatpos.csv')
csv_f = csv.reader(f)
for row in csv_f:
print row[5]
then I want to parse through that row and add a 0 to the front of any number that is shorter than 4 digits. And then input it into a new csv file with the adjusted data.
You want to use string formatting for these things:
>>> '{:04}'.format(99)
'0099'
Format String Syntax documentation
When you think about parsing, you either need to think about regex or pyparsing. In this case, regex would perform the parsing quite easily.
But that's not all, once you are able to parse the numbers, you need to zero fill it. For that purpose, you need to use str.format for padding and justifying the string accordingly.
Consider your string
st = "parse through that row and add a 0 to the front of any number that is shorter than 4 digits."
In the above lines, you can do something like
Implementation
parts = re.split(r"(\d{0,3})", st)
''.join("{:>04}".format(elem) if elem.isdigit() else elem for elem in parts)
Output
'parse through that row and add a 0000 to the front of any number that is shorter than 0004 digits.'
The following code will read in the given csv file, iterate through each row and each item in each row, and output it to a new csv file.
import csv
import os
f = open('csvpatpos.csv')
# open temp .csv file for output
out = open('csvtemp.csv','w')
csv_f = csv.reader(f)
for row in csv_f:
# create a temporary list for this row
temp_row = []
# iterate through all of the items in the row
for item in row:
# add the zero filled value of each temporary item to the list
temp_row.append(item.zfill(4))
# join the current temporary list with commas and write it to the out file
out.write(','.join(temp_row) + '\n')
out.close()
f.close()
Your results will be in csvtemp.csv. If you want to save the data with the original filename, just add the following code to the end of the script
# remove original file
os.remove('csvpatpos.csv')
# rename temp file to original file name
os.rename('csvtemp.csv','csvpatpos.csv')
Pythonic Version
The code above is is very verbose in order to make it understandable. Here is the code refactored to make it more Pythonic
import csv
new_rows = []
with open('csvpatpos.csv','r') as f:
csv_f = csv.reader(f)
for row in csv_f:
row = [ x.zfill(4) for x in row ]
new_rows.append(row)
with open('csvpatpos.csv','wb') as f:
csv_f = csv.writer(f)
csv_f.writerows(new_rows)
Will leave you with two hints:
s = "486"
s.isdigit() == True
for finding what things are numbers.
And
s = "486"
s.zfill(4) == "0486"
for filling in zeroes.
I have a CSV file, which I created using an HTML export from a Check Point firewall policy.
Each rule is represented as several lines, in some cases. That occurs when a rule has several address sources, destinations or services.
I need the output to have each rule described in only one line.
It's easy to distinguish when each rule begins. In the first column, there's the rule ID, which is a number.
Here's an example. In green are marked the strings that should be moved:
http://i.imgur.com/i785sDi.jpg
Let me show you an example:
NO.;NAME;SOURCE;DESTINATION;SERVICE;ACTION;
1;;fwgcluster;mcast_vrrp;vrrp;accept;
;;;;igmp;;
2;Testing;fwgcluster;fwgcluster;FireWall;accept;
;;fwmgmpe;fwmgmpe;ssh;;
;;fwmgm;fwmgm;;;
What I need ,explained in pseudo code, is this:
Read the first column of the next line. If there's a number:
Evaluate the first column of the next line. If there's no number there, concatenate (separating with a comma) \
the strings in the columns of this line with the last one and eliminate the text in the current one
The output should be something like this:
NO.;NAME;SOURCE;DESTINATION;SERVICE;ACTION;
1;;fwgcluster;mcast_vrrp;vrrp-igmp;accept;
;;;;;;
2;Testing;fwgcluster-fwmgmpe-fwmgm;fwgcluster-fwmgmpe-fwmgm;FireWall-ssh;accept;
;;;;;;
The empty lines are there only to be more clear, I don't actually need them.
Thanks!
This should get you started
import csv
with open('data.txt', 'r') as f:
reader = csv.DictReader(f, delimiter=';')
for r in reader:
print r
EDIT: Given your required output, this should get you nearly there. Its a bit crude but does the majority of what you need. It checks for the 'No.' key and if it has a value it will start a record. If not it will join any other data in the row with the equivalent data in the record. Finally, when a new record is created the old one is appended to the result, this also happens at the end to catch the last item.
import csv
result, record = [], None
with open('data2.txt', 'r') as f:
reader = csv.DictReader(f, delimiter=';', lineterminator='\n')
for r in reader:
if r['NO.']:
if record:
result.append(record)
record = r
else:
for key in r.keys():
if r[key]:
record[key] = '-'.join([record[key], r[key]])
if record:
result.append(record)
print result
Graeme, thanks again, just before your edit I solved it with the following code.
But you got me looking in the right direction!
If anyone needs it, here it is:
import csv
# adjust these 3 lines
WRITE_EMPTIES = False
INFILE = "input.csv"
OUTFILE = "output.csv"
with open(INFILE, "r") as in_file:
r = csv.reader(in_file, delimiter=";")
with open(OUTFILE, "wb") as out_file:
previous = None
empties_to_write = 0
out_writer = csv.writer(out_file, delimiter=";")
for i, row in enumerate(r):
first_val = row[0].strip()
if first_val:
if previous:
out_writer.writerow(previous)
if WRITE_EMPTIES and empties_to_write:
out_writer.writerows(
[["" for _ in previous]] * empties_to_write
)
empties_to_write = 0
previous = row
else: # append sub-portions to each other
previous = [
"|".join(
subitem
for subitem in existing.split(",") + [new]
if subitem
)
for existing, new in zip(previous, row)
]
empties_to_write += 1
if previous: # take care of the last row
out_writer.writerow(previous)
if WRITE_EMPTIES and empties_to_write:
out_writer.writerows(
[["" for _ in previous]] * empties_to_write
)
I have csv files with unwanted first characters in the header row except the first column.
The while loop strips the first character from the headers and writes the new header row to a new file (exit by counter). The else statement then writes the rest of the rows to the new file. The problem is the else statement begins with the header row and writes it a second time. Is there a way to have else begin an the next line with out breaking the for iterator? The actual files are 21 columns by 400,000+ rows. The unwanted character is a single space, but I used * in the example below to make it easier to see. Thanks for any help!
file.csv =
a,*b,*c,*d
1,2,3,4
import csv
reader = csv.reader(open('file.csv', 'rb'))
writer = csv.writer(open('file2.csv','wb'))
count = 0
for row in reader:
while (count <= 0):
row[1]=row[1][1:]
row[2]=row[2][1:]
row[3]=row[3][1:]
writer.writerow([row[0], row[1], row[2], row[3]])
count = count + 1
else:
writer.writerow([row[0], row[1], row[2], row[3]])
If you only want to change the header and copy the remaining lines without change:
with open('file.csv', 'r') as src, open('file2.csv', 'w') as dst:
dst.write(next(src).replace(" ", "")) # delete whitespaces from header
dst.writelines(line for line in src)
If you want to do additional transformations you can do something like this or this question.
If all you want to do is remove spaces, you can use:
string.replace(" ", "")
Hmm... It seems like your logic might be a bit backward. A bit cleaner, I think, to check if you're on the first row first. Also, a slightly more idiomatic way to remove spaces is to use string's lstrip method with no arguments to remove leading whitespace.
Why not use enumerate and check if your row is the header?
import csv
reader = csv.reader(open('file.csv', 'rb'))
writer = csv.writer(open('file2.csv','wb'))
for i, row in enumerate(reader):
if i == 0:
writer.writerow([row[0],
row[1].lstrip(),
row[2].lstrip(),
row[3].lstrip()])
else:
writer.writerow([row[0], row[1], row[2], row[3]])
If you have 21 columns, you don't want to write row[0], ... , row[21]. Plus, you want to close your files after opening them. .next() gets your header. And strip() lets you flexibly remove unwanted leading and trailing characters.
import csv
file = 'file1.csv'
newfile = open('file2.csv','wb')
writer = csv.writer(newfile)
with open(file, 'rb') as f:
reader = csv.reader(f)
header = reader.next()
newheader = []
for c in header:
newheader.append(c.strip(' '))
writer.writerow(newheader)
for r in reader:
writer.writerow(r)
newfile.close()