Hope you can help. I'm trying to iterate over a .csv file and delete rows where the first character of the first item is a #.
Whilst my code does indeed delete the necessary rows, I'm being presented with a "string index out of range" error.
My code is as below:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if (row[0][0]) != '#':
writer.writerow(row)
input.close()
output.close()
As far as I can tell, I have no empty rows that I'm trying to iterate over.
Check if the string is empty with if row[0] before trying to index:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if row[0] and row[0][0] != '#': # here
writer.writerow(row)
input.close()
output.close()
Or simply use if row[0].startswith('#') as your condition
You are likely running into an empty string.
Perhaps try
`if row and row[0][0] != '#':
Then why don't you make sure you don't bump into any of those even if they exist by checking if the line is empty first like so:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if row:
if (row[0][0]) != '#':
writer.writerow(row)
else:
continue
input.close()
output.close()
Also when working with *.csv files it is good to have a look at them in a text editor to make sure the delimiters and end_of_line characters are like you think they are. The sniffer is also a good read.
Cheers
Why not provide working code (imports included) and wrap as usual the physical resources in context managers?
Like so:
#! /usr/bin/env python
"""Only strip those rows that start with hash (#)."""
import csv
IN_F_PATH = "/home/stephen/Desktop/paths_output.csv"
OUT_F_PATH = "/home/stephen/Desktop/paths_output2.csv"
with open(IN_F_PATH, 'rb') as i_f, open(OUT_F_PATH, "wb") as o_f:
writer = csv.writer(o_f)
for row in csv.reader(i_f):
if row and row[0].startswith('#'):
continue
writer.writerow(row)
Some notes:
The closing of the files is automated by leaving the context blocks,
the names are better chosen, as input is well a keyword ...
you may want to include empty lines, I only read you want to strip comment lines from the question, so detect these and continue.
it is row[0] that is the first columns string and that startswith # natively mapped to the best matching simple string "method".
In case you also might want to strip empty lines, than one could use the following condition to continueinstead:
if not row or row and row[0].startswith('#'):
and you should be ready to go.
HTH
To answer a comment on the above code line that causes also the skipping of blank input "Lines".
In Python we have left to right (lazy evaluation) and short circuit for boolean expressions so:
>>> row = ["#", "42"]
>>> if not row or row and row[0].startswith("#"):
... print "Empty or comment!"
...
Empty or comment!
>>> row = []
>>> if not row or row and row[0].startswith("#"):
... print "Empty or comment!"
...
Empty or comment!
I suspect that there are lines with an empty first cell, so row[0][0] tries to access the first character of the empty string.
You should try:
for row in csv.reader(input):
if not row[0].startswith('#'):
writer.writerow(row)
Related
I am trying to build a small crawler to grab twitter handles. I cannot for the life get around an error I keep having. It seems to be the exact same error for re.search. re.findall and re.finditer. The error is TypeError: expected string or buffer.
The data is structured as followed from the CSV:
30,"texg",#handle,,,,,,,,
Note that the print row works fine, the test = re.... errors out before getting to the print line.
def read_urls(filename):
f = open(filename, 'rb')
reader = csv.reader(f)
data = open('Data.txt', 'w')
dict1 = {}
for row in reader:
print row
test = re.search(r'#(\w+)', row)
print test.group(1)
Also not I have been working through this problem at a number of different threads but all solutions explained have not worked. It just seems like re isn't able to read the row call...
Take a look at your code carefully:
for row in reader:
print row
test = re.search(r'#(\w+)', row)
print test.group(1)
Note that row is a list not a string and according to search documentation:
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
That means you should create a string and check whether test is not None
for row in reader:
print row
test = re.search(r'#(\w+)', ''.join(row))
if test:
print test.group(1)
Also open file without b flag like
f = open(filename, 'r')
You're trying to read a list after you run the file through the reader.
import re
f = open('file1.txt', 'r')
for row in f:
print(row)
test = re.search(r'#(\w+)', row)
print(test.group(1))
f.close()
https://repl.it/JCng/1
If you want to use the CSV reader, you can loop through the list.
Here is a section of the log file I want to parse:
And here is the code I am writing:
import csv
with open('Coplanarity_PedCheck.log','rt') as tsvin, open('YX2.csv', 'wt') as csvout:
read_tsvin = csv.reader(tsvin, delimiter='\t')
for row in read_tsvin:
print(row)
filters = row[0]
if "#Log File Initialized!" in filters:
print(row)
datetime = row[0]
print("looklook",datetime[23:46])
csvout.write(datetime[23:46]+",")
BS = row[0]
print("looklook",BS[17:21])
csvout.write(datetime[17:21]+",")
csvout.write("\n")
csvout.close()
I need to get the date and time information from row1, then get "left" from row2, then need to skip section 4. How should I do it?
Since the csv.reader makes row1 an list with only 1 element I converted it to string again to split out the datetime info I need. But I think it is not efficient.
I did same thing for row2, then I want to skip row 3-6, but I don't know how.
Also, csv.reader converts my float data into text, how can I convert them back before I write them into another file?
You are going to want to learn to use regular expressions.
For example, you could do something like this:
with open('Coplanarity_PedCheck.log','rt') as tsvin, open('YX2.csv', 'wt') as csvout:
read_tsvin = csv.reader(tsvin, delimiter='\t')
# Get the first line and use the first field
header = read_tsvin.next()[0]
m = re.search('\[([0-9: -]+)\]', row)
datetime = m.group(1)
csvout.write(datetime, ',')
# Find if 'Left' is in line 1
direction = read_tsvin.next()[0]
m = re.search('Left', direction)
if m:
# If m is found print the whole line
csvout.write(m.group(0))
csvout.write('\n')
# Skip lines 3-6
for i in range(4):
null = read_tsvin.next()
# Loop over the rest of the rows
for row in tsvin:
# Parse the data
csvout.close()
Modifying this to look for a line containing '#Log File Initialized!' rather than hard coding for the first line would be fairly simple using regular expressions. Take a look at the regular expression documentation
This probably isn't exactly what you want to do, but rather a suggestion for a good starting point.
I have a csv file that needs to add a zero in front of the number if its less than 4 digits.
I only have to update a particular row:
import csv
f = open('csvpatpos.csv')
csv_f = csv.reader(f)
for row in csv_f:
print row[5]
then I want to parse through that row and add a 0 to the front of any number that is shorter than 4 digits. And then input it into a new csv file with the adjusted data.
You want to use string formatting for these things:
>>> '{:04}'.format(99)
'0099'
Format String Syntax documentation
When you think about parsing, you either need to think about regex or pyparsing. In this case, regex would perform the parsing quite easily.
But that's not all, once you are able to parse the numbers, you need to zero fill it. For that purpose, you need to use str.format for padding and justifying the string accordingly.
Consider your string
st = "parse through that row and add a 0 to the front of any number that is shorter than 4 digits."
In the above lines, you can do something like
Implementation
parts = re.split(r"(\d{0,3})", st)
''.join("{:>04}".format(elem) if elem.isdigit() else elem for elem in parts)
Output
'parse through that row and add a 0000 to the front of any number that is shorter than 0004 digits.'
The following code will read in the given csv file, iterate through each row and each item in each row, and output it to a new csv file.
import csv
import os
f = open('csvpatpos.csv')
# open temp .csv file for output
out = open('csvtemp.csv','w')
csv_f = csv.reader(f)
for row in csv_f:
# create a temporary list for this row
temp_row = []
# iterate through all of the items in the row
for item in row:
# add the zero filled value of each temporary item to the list
temp_row.append(item.zfill(4))
# join the current temporary list with commas and write it to the out file
out.write(','.join(temp_row) + '\n')
out.close()
f.close()
Your results will be in csvtemp.csv. If you want to save the data with the original filename, just add the following code to the end of the script
# remove original file
os.remove('csvpatpos.csv')
# rename temp file to original file name
os.rename('csvtemp.csv','csvpatpos.csv')
Pythonic Version
The code above is is very verbose in order to make it understandable. Here is the code refactored to make it more Pythonic
import csv
new_rows = []
with open('csvpatpos.csv','r') as f:
csv_f = csv.reader(f)
for row in csv_f:
row = [ x.zfill(4) for x in row ]
new_rows.append(row)
with open('csvpatpos.csv','wb') as f:
csv_f = csv.writer(f)
csv_f.writerows(new_rows)
Will leave you with two hints:
s = "486"
s.isdigit() == True
for finding what things are numbers.
And
s = "486"
s.zfill(4) == "0486"
for filling in zeroes.
i have following output from a csv file:
word1|word2|word3|word4|word5|word6|01:12|word8
word1|word2|word3|word4|word5|word6|03:12|word8
word1|word2|word3|word4|word5|word6|01:12|word8
what i need to do is change the time string like this 00:01:12.
my idea is to extract the list item [7] and add a "00:" as string to the front.
import csv
with open('temp', 'r') as f:
reader = csv.reader(f, delimiter="|")
for row in reader:
fixed_time = (str("00:") + row[7])
begin = row[:6]
end = row[:8]
print begin + fixed_time +end
get error message:
TypeError: can only concatenate list (not "str") to list.
i also had a look on this post.
how to change [1,2,3,4] to '1234' using python
i neeed to know if my approach to soloution is the right way. maybe need to use split or anything else for this.
thx for any help
The line that's throwing the exception is
print begin + fixed_time +end
because begin and end are both lists and fixed_time is a string. Whenever you take a slice of a list (that's the row[:6] and row[:8] parts), a list is returned. If you just want to print it out, you can do
print begin, fixed_time, end
and you won't get an error.
Corrected code:
I'm opening a new file for writing (I'm calling it 'final', but you can call it whatever you want), and I'm just writing everything to it with the one modification. It's easiest to just change the one element of the list that has the line (row[6] here), and use '|'.join to write a pipe character between each column.
import csv
with open('temp', 'r') as f, open('final', 'w') as fw:
reader = csv.reader(f, delimiter="|")
for row in reader:
# just change the element in the row to have the extra zeros
row[6] = '00:' + row[6]
# 'write the row back out, separated by | characters, and a new line.
fw.write('|'.join(row) + '\n')
you can use regex for that:
>>> txt = """\
... word1|word2|word3|word4|word5|word6|01:12|word8
... word1|word2|word3|word4|word5|word6|03:12|word8
... word1|word2|word3|word4|word5|word6|01:12|word8"""
>>> import re
>>> print(re.sub(r'\|(\d\d:\d\d)\|', r'|00:\1|', txt))
word1|word2|word3|word4|word5|word6|00:01:12|word8
word1|word2|word3|word4|word5|word6|00:03:12|word8
word1|word2|word3|word4|word5|word6|00:01:12|word8
Apparently some csv output implementation somewhere truncates field separators from the right on the last row and only the last row in the file when the fields are null.
Example input csv, fields 'c' and 'd' are nullable:
a|b|c|d
1|2||
1|2|3|4
3|4||
2|3
In something like the script below, how can I tell whether I am on the last line so I know how to handle it appropriately?
import csv
reader = csv.reader(open('somefile.csv'), delimiter='|', quotechar=None)
header = reader.next()
for line_num, row in enumerate(reader):
assert len(row) == len(header)
....
Basically you only know you've run out after you've run out. So you could wrap the reader iterator, e.g. as follows:
def isLast(itr):
old = itr.next()
for new in itr:
yield False, old
old = new
yield True, old
and change your code to:
for line_num, (is_last, row) in enumerate(isLast(reader)):
if not is_last: assert len(row) == len(header)
etc.
I am aware it is an old question, but I came up with a different answer than the ones presented. The reader object already increments the line_num attribute as you iterate through it. Then I get the total number of lines at first using row_count, then I compare it with the line_num.
import csv
def row_count(filename):
with open(filename) as in_file:
return sum(1 for _ in in_file)
in_filename = 'somefile.csv'
reader = csv.reader(open(in_filename), delimiter='|')
last_line_number = row_count(in_filename)
for row in reader:
if last_line_number == reader.line_num:
print "It is the last line: %s" % row
If you have an expectation of a fixed number of columns in each row, then you should be defensive against:
(1) ANY row being shorter -- e.g. a writer (SQL Server / Query Analyzer IIRC) may omit trailing NULLs at random; users may fiddle with the file using a text editor, including leaving blank lines.
(2) ANY row being longer -- e.g. commas not quoted properly.
You don't need any fancy tricks. Just an old-fashioned if-test in your row-reading loop:
for row in csv.reader(...):
ncols = len(row)
if ncols != expected_cols:
appropriate_action()
if you want to get exactly the last row try this code:
with open("\\".join([myPath,files]), 'r') as f:
print f.readlines()[-1] #or your own manipulations
If you want to continue working with values from row do the following:
f.readlines()[-1].split(",")[0] #this would let you get columns by their index
Just extend the row to the length of the header:
for line_num, row in enumerate(reader):
while len(row) < len(header):
row.append('')
...
Could you not just catch the error when the csv reader reads the last line in a
try:
... do your stuff here...
except: StopIteration
condition ?
See the following python code on stackoverflow for an example of how to use the try: catch: Python CSV DictReader/Writer issues
If you use for row in reader:, it will just stop the loop after the last item has been read.