Python loops through CSV, but writes header row twice - python

I have csv files with unwanted first characters in the header row except the first column.
The while loop strips the first character from the headers and writes the new header row to a new file (exit by counter). The else statement then writes the rest of the rows to the new file. The problem is the else statement begins with the header row and writes it a second time. Is there a way to have else begin an the next line with out breaking the for iterator? The actual files are 21 columns by 400,000+ rows. The unwanted character is a single space, but I used * in the example below to make it easier to see. Thanks for any help!
file.csv =
a,*b,*c,*d
1,2,3,4
import csv
reader = csv.reader(open('file.csv', 'rb'))
writer = csv.writer(open('file2.csv','wb'))
count = 0
for row in reader:
while (count <= 0):
row[1]=row[1][1:]
row[2]=row[2][1:]
row[3]=row[3][1:]
writer.writerow([row[0], row[1], row[2], row[3]])
count = count + 1
else:
writer.writerow([row[0], row[1], row[2], row[3]])

If you only want to change the header and copy the remaining lines without change:
with open('file.csv', 'r') as src, open('file2.csv', 'w') as dst:
dst.write(next(src).replace(" ", "")) # delete whitespaces from header
dst.writelines(line for line in src)
If you want to do additional transformations you can do something like this or this question.

If all you want to do is remove spaces, you can use:
string.replace(" ", "")

Hmm... It seems like your logic might be a bit backward. A bit cleaner, I think, to check if you're on the first row first. Also, a slightly more idiomatic way to remove spaces is to use string's lstrip method with no arguments to remove leading whitespace.
Why not use enumerate and check if your row is the header?
import csv
reader = csv.reader(open('file.csv', 'rb'))
writer = csv.writer(open('file2.csv','wb'))
for i, row in enumerate(reader):
if i == 0:
writer.writerow([row[0],
row[1].lstrip(),
row[2].lstrip(),
row[3].lstrip()])
else:
writer.writerow([row[0], row[1], row[2], row[3]])

If you have 21 columns, you don't want to write row[0], ... , row[21]. Plus, you want to close your files after opening them. .next() gets your header. And strip() lets you flexibly remove unwanted leading and trailing characters.
import csv
file = 'file1.csv'
newfile = open('file2.csv','wb')
writer = csv.writer(newfile)
with open(file, 'rb') as f:
reader = csv.reader(f)
header = reader.next()
newheader = []
for c in header:
newheader.append(c.strip(' '))
writer.writerow(newheader)
for r in reader:
writer.writerow(r)
newfile.close()

Related

Remove the last empty line in CSV file

nf=open(Output_File,'w+')
with open(Input_File,'read') as f:
for row in f:
Current_line = str(row)
Reformated_line=str(','.join(Current_line.split('|')[1:-1]))
nf.write(Reformated_line+ "\n")
I'm trying to read Input file which is in Table Format and write it in a CSV file, but my Output contains one last empty line also. How can I remove the last empty line in CSV?
It sounds like you have an empty line in your input file. From your comments, you actually have a non-empty line that has no | characters in it. In either case, it is easy enough to check for an empty result line.
Try this:
#UNTESTED
nf=open(Output_File,'w+')
with open(Input_File,'read') as f:
for row in f:
Current_line = str(row)
Reformated_line=str(','.join(Current_line.split('|')[1:-1]))
if Reformatted_line:
nf.write(Reformated_line+ "\n")
Other notes:
You should use with consistently. Open both files the same way.
str(row) is a no-op. row is already a str.
str(','.join(...)) is similarly redundant.
open(..., 'read') is not a valid use of the mode parameter to open(). You should use r or even omit the parameter altogether.
I prefer not to introduce new names when changing the format of existing data. That is, I prefer row = row.split() over Reformatted_line = row.split().
Here is a version that incorporates these and other suggestions:
with open(Input_File) as inf, open(Output_File, 'w+') as outf:
for row in inf:
row = ','.join(row.split('|')[1:-1])
if row:
outf.write(row + "\n")
Just a question of reordering things a little:
first = True
with open(Input_File,'read') as f, open(Output_File,'w+') as nf:
for row in f:
Current_line = str(row)
Reformated_line=str(','.join(Current_line.split('|')[1:-1]))
if not first:
nf.write('\n')
else:
first = False
nf.write(Reformated_line)

Python "String Index Out of Range" during for row operation

Hope you can help. I'm trying to iterate over a .csv file and delete rows where the first character of the first item is a #.
Whilst my code does indeed delete the necessary rows, I'm being presented with a "string index out of range" error.
My code is as below:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if (row[0][0]) != '#':
writer.writerow(row)
input.close()
output.close()
As far as I can tell, I have no empty rows that I'm trying to iterate over.
Check if the string is empty with if row[0] before trying to index:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if row[0] and row[0][0] != '#': # here
writer.writerow(row)
input.close()
output.close()
Or simply use if row[0].startswith('#') as your condition
You are likely running into an empty string.
Perhaps try
`if row and row[0][0] != '#':
Then why don't you make sure you don't bump into any of those even if they exist by checking if the line is empty first like so:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if row:
if (row[0][0]) != '#':
writer.writerow(row)
else:
continue
input.close()
output.close()
Also when working with *.csv files it is good to have a look at them in a text editor to make sure the delimiters and end_of_line characters are like you think they are. The sniffer is also a good read.
Cheers
Why not provide working code (imports included) and wrap as usual the physical resources in context managers?
Like so:
#! /usr/bin/env python
"""Only strip those rows that start with hash (#)."""
import csv
IN_F_PATH = "/home/stephen/Desktop/paths_output.csv"
OUT_F_PATH = "/home/stephen/Desktop/paths_output2.csv"
with open(IN_F_PATH, 'rb') as i_f, open(OUT_F_PATH, "wb") as o_f:
writer = csv.writer(o_f)
for row in csv.reader(i_f):
if row and row[0].startswith('#'):
continue
writer.writerow(row)
Some notes:
The closing of the files is automated by leaving the context blocks,
the names are better chosen, as input is well a keyword ...
you may want to include empty lines, I only read you want to strip comment lines from the question, so detect these and continue.
it is row[0] that is the first columns string and that startswith # natively mapped to the best matching simple string "method".
In case you also might want to strip empty lines, than one could use the following condition to continueinstead:
if not row or row and row[0].startswith('#'):
and you should be ready to go.
HTH
To answer a comment on the above code line that causes also the skipping of blank input "Lines".
In Python we have left to right (lazy evaluation) and short circuit for boolean expressions so:
>>> row = ["#", "42"]
>>> if not row or row and row[0].startswith("#"):
... print "Empty or comment!"
...
Empty or comment!
>>> row = []
>>> if not row or row and row[0].startswith("#"):
... print "Empty or comment!"
...
Empty or comment!
I suspect that there are lines with an empty first cell, so row[0][0] tries to access the first character of the empty string.
You should try:
for row in csv.reader(input):
if not row[0].startswith('#'):
writer.writerow(row)

Python to not count the header in a csv file

I have python code to edit a column in a csv file. It removes the zeros from integers in row 5. And then it adds a zero if the integer is 3 or less so it has a total of 4 integers or more.
The problem I'm having it doesn't like the title row which is not an integer. Does anyone know how I keep the header but adjust the code so that it doesn't look at the first line of the csv file.
Here is the code:
import csv
import re
import os
import sys
with open('', 'r') as infile, open('', 'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
#firstline = True
#for row in outfile:
# if outfile:
# firstline = False
for row in reader:
# strip all 0's from the front
stripped_value = re.sub(r'^0+', '', row[5])
# pad zeros on the left to smaller numbers to make them 4 digits
row[5] = '%04d'%int(stripped_value)
writer.writerow(row)
Add this before the loop:
# Python 2.x
writer.writerow(reader.next())
# Python 3.x
writer.writerow(next(reader))
It will get the first line and return it. And then you are writing it to the output.
However, in my opinion you should make the code inside the loop resistant to non-numbers on that column (like in Al.Sal answer).
You could use an exception handler. The try is incredibly cheap; since you'd only have one header, the more expensive except won't get called enough to impact performance. Also, you would have a good way to handle non-number rows later on.
for row in reader:
# strip all 0's from the front
stripped_value = re.sub(r'^0+', '', row[5])
# pad zeros on the left to smaller numbers to make them 4 digits
try:
row[5] = '%04d'%int(stripped_value)
except ValueError:
pass # Or do something, to avoid passing it silently
writer.writerow(row)
Your code snippet with correct indentation:
import csv
import re
import os
import sys
with open('', 'r') as infile, open('', 'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
# strip all 0's from the front
stripped_value = re.sub(r'^0+', '', row[5])
# pad zeros on the left to smaller numbers to make them 4 digits
try:
row[5] = '%04d'%int(stripped_value)
except ValueError:
pass # Or do something, to avoid passing it silently
writer.writerow(row)

Consolidate several lines of a CSV file with firewall rules, in order to parse them easier?

I have a CSV file, which I created using an HTML export from a Check Point firewall policy.
Each rule is represented as several lines, in some cases. That occurs when a rule has several address sources, destinations or services.
I need the output to have each rule described in only one line.
It's easy to distinguish when each rule begins. In the first column, there's the rule ID, which is a number.
Here's an example. In green are marked the strings that should be moved:
http://i.imgur.com/i785sDi.jpg
Let me show you an example:
NO.;NAME;SOURCE;DESTINATION;SERVICE;ACTION;
1;;fwgcluster;mcast_vrrp;vrrp;accept;
;;;;igmp;;
2;Testing;fwgcluster;fwgcluster;FireWall;accept;
;;fwmgmpe;fwmgmpe;ssh;;
;;fwmgm;fwmgm;;;
What I need ,explained in pseudo code, is this:
Read the first column of the next line. If there's a number:
Evaluate the first column of the next line. If there's no number there, concatenate (separating with a comma) \
the strings in the columns of this line with the last one and eliminate the text in the current one
The output should be something like this:
NO.;NAME;SOURCE;DESTINATION;SERVICE;ACTION;
1;;fwgcluster;mcast_vrrp;vrrp-igmp;accept;
;;;;;;
2;Testing;fwgcluster-fwmgmpe-fwmgm;fwgcluster-fwmgmpe-fwmgm;FireWall-ssh;accept;
;;;;;;
The empty lines are there only to be more clear, I don't actually need them.
Thanks!
This should get you started
import csv
with open('data.txt', 'r') as f:
reader = csv.DictReader(f, delimiter=';')
for r in reader:
print r
EDIT: Given your required output, this should get you nearly there. Its a bit crude but does the majority of what you need. It checks for the 'No.' key and if it has a value it will start a record. If not it will join any other data in the row with the equivalent data in the record. Finally, when a new record is created the old one is appended to the result, this also happens at the end to catch the last item.
import csv
result, record = [], None
with open('data2.txt', 'r') as f:
reader = csv.DictReader(f, delimiter=';', lineterminator='\n')
for r in reader:
if r['NO.']:
if record:
result.append(record)
record = r
else:
for key in r.keys():
if r[key]:
record[key] = '-'.join([record[key], r[key]])
if record:
result.append(record)
print result
Graeme, thanks again, just before your edit I solved it with the following code.
But you got me looking in the right direction!
If anyone needs it, here it is:
import csv
# adjust these 3 lines
WRITE_EMPTIES = False
INFILE = "input.csv"
OUTFILE = "output.csv"
with open(INFILE, "r") as in_file:
r = csv.reader(in_file, delimiter=";")
with open(OUTFILE, "wb") as out_file:
previous = None
empties_to_write = 0
out_writer = csv.writer(out_file, delimiter=";")
for i, row in enumerate(r):
first_val = row[0].strip()
if first_val:
if previous:
out_writer.writerow(previous)
if WRITE_EMPTIES and empties_to_write:
out_writer.writerows(
[["" for _ in previous]] * empties_to_write
)
empties_to_write = 0
previous = row
else: # append sub-portions to each other
previous = [
"|".join(
subitem
for subitem in existing.split(",") + [new]
if subitem
)
for existing, new in zip(previous, row)
]
empties_to_write += 1
if previous: # take care of the last row
out_writer.writerow(previous)
if WRITE_EMPTIES and empties_to_write:
out_writer.writerows(
[["" for _ in previous]] * empties_to_write
)

Have csv.reader tell when it is on the last line

Apparently some csv output implementation somewhere truncates field separators from the right on the last row and only the last row in the file when the fields are null.
Example input csv, fields 'c' and 'd' are nullable:
a|b|c|d
1|2||
1|2|3|4
3|4||
2|3
In something like the script below, how can I tell whether I am on the last line so I know how to handle it appropriately?
import csv
reader = csv.reader(open('somefile.csv'), delimiter='|', quotechar=None)
header = reader.next()
for line_num, row in enumerate(reader):
assert len(row) == len(header)
....
Basically you only know you've run out after you've run out. So you could wrap the reader iterator, e.g. as follows:
def isLast(itr):
old = itr.next()
for new in itr:
yield False, old
old = new
yield True, old
and change your code to:
for line_num, (is_last, row) in enumerate(isLast(reader)):
if not is_last: assert len(row) == len(header)
etc.
I am aware it is an old question, but I came up with a different answer than the ones presented. The reader object already increments the line_num attribute as you iterate through it. Then I get the total number of lines at first using row_count, then I compare it with the line_num.
import csv
def row_count(filename):
with open(filename) as in_file:
return sum(1 for _ in in_file)
in_filename = 'somefile.csv'
reader = csv.reader(open(in_filename), delimiter='|')
last_line_number = row_count(in_filename)
for row in reader:
if last_line_number == reader.line_num:
print "It is the last line: %s" % row
If you have an expectation of a fixed number of columns in each row, then you should be defensive against:
(1) ANY row being shorter -- e.g. a writer (SQL Server / Query Analyzer IIRC) may omit trailing NULLs at random; users may fiddle with the file using a text editor, including leaving blank lines.
(2) ANY row being longer -- e.g. commas not quoted properly.
You don't need any fancy tricks. Just an old-fashioned if-test in your row-reading loop:
for row in csv.reader(...):
ncols = len(row)
if ncols != expected_cols:
appropriate_action()
if you want to get exactly the last row try this code:
with open("\\".join([myPath,files]), 'r') as f:
print f.readlines()[-1] #or your own manipulations
If you want to continue working with values from row do the following:
f.readlines()[-1].split(",")[0] #this would let you get columns by their index
Just extend the row to the length of the header:
for line_num, row in enumerate(reader):
while len(row) < len(header):
row.append('')
...
Could you not just catch the error when the csv reader reads the last line in a
try:
... do your stuff here...
except: StopIteration
condition ?
See the following python code on stackoverflow for an example of how to use the try: catch: Python CSV DictReader/Writer issues
If you use for row in reader:, it will just stop the loop after the last item has been read.

Categories

Resources