Print lines of csv file that contain specified keyword - python

I'm new to Python but I would like to do some data analysis on some csv files. I'd like to print lines from a csv file that only include some keywords. I use the first block to print all valid lines. From these lines I would like to print the ones including keywords. Thanks for your help.
csv.field_size_limit(sys.maxsize)
invalids = 0
valids = 0
for f in ['1.csv']:
reader = csv.reader(open(f, 'rU'), delimiter='|', quotechar='\\')
for row in reader:
try:
print row[2]
valids += 1
except:
invalids += 1
print 'parsed %s records. ignored %s' % (valids, invalids)
With keywords:
for w in ['ford', 'hyundai','honda', 'jeep', 'maserati','audi','jaguar', 'volkswagen','chevrolet','chrysler']:
I guess I need to filter my top code with an if statement, but I've been struggling with this for hours and can't seem to get it to work.

Your guess is correct. All you need to do is filter the lines with an if statement, checking if each field matches a keyword. Here is how you do it (I've also made some improvement to your code and explained them in the comments.):
# First, create a set of the keywords. Sets are faster than a list for
# checking if they contain an element. The curly brackets create a set.
keywords = {'ford', 'hyundai','honda', 'jeep', 'maserati','audi','jaguar',
'volkswagen','chevrolet','chrysler'}
csv.field_size_limit(sys.maxsize)
invalids = 0
valids = 0
for filename in ['1.csv']:
# The with statement in Python makes sure that your file is properly closed
# (automatically) when an error occurs. This is a common idiom.
# In addition, CSV files should be opened only in 'rb' mode.
with open(filename, 'rb') as f:
reader = csv.reader(f, delimiter='|', quotechar='\\')
for row in reader:
try:
print row[2]
valids += 1
# Don't use bare except clauses. It will catch
# exceptions you don't want or intend to catch.
except IndexError:
invalids += 1
# The filtering is done here.
for field in row:
if field in keywords:
print row
break
# Prefer the str.format() method over the old style string formatting.
print 'parsed {0} records. ignored {1}'.format(valids, invalids)

Related

Regex commands not finding designated strings

I am trying to build a small crawler to grab twitter handles. I cannot for the life get around an error I keep having. It seems to be the exact same error for re.search. re.findall and re.finditer. The error is TypeError: expected string or buffer.
The data is structured as followed from the CSV:
30,"texg",#handle,,,,,,,,
Note that the print row works fine, the test = re.... errors out before getting to the print line.
def read_urls(filename):
f = open(filename, 'rb')
reader = csv.reader(f)
data = open('Data.txt', 'w')
dict1 = {}
for row in reader:
print row
test = re.search(r'#(\w+)', row)
print test.group(1)
Also not I have been working through this problem at a number of different threads but all solutions explained have not worked. It just seems like re isn't able to read the row call...
Take a look at your code carefully:
for row in reader:
print row
test = re.search(r'#(\w+)', row)
print test.group(1)
Note that row is a list not a string and according to search documentation:
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
That means you should create a string and check whether test is not None
for row in reader:
print row
test = re.search(r'#(\w+)', ''.join(row))
if test:
print test.group(1)
Also open file without b flag like
f = open(filename, 'r')
You're trying to read a list after you run the file through the reader.
import re
f = open('file1.txt', 'r')
for row in f:
print(row)
test = re.search(r'#(\w+)', row)
print(test.group(1))
f.close()
https://repl.it/JCng/1
If you want to use the CSV reader, you can loop through the list.

Python "String Index Out of Range" during for row operation

Hope you can help. I'm trying to iterate over a .csv file and delete rows where the first character of the first item is a #.
Whilst my code does indeed delete the necessary rows, I'm being presented with a "string index out of range" error.
My code is as below:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if (row[0][0]) != '#':
writer.writerow(row)
input.close()
output.close()
As far as I can tell, I have no empty rows that I'm trying to iterate over.
Check if the string is empty with if row[0] before trying to index:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if row[0] and row[0][0] != '#': # here
writer.writerow(row)
input.close()
output.close()
Or simply use if row[0].startswith('#') as your condition
You are likely running into an empty string.
Perhaps try
`if row and row[0][0] != '#':
Then why don't you make sure you don't bump into any of those even if they exist by checking if the line is empty first like so:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if row:
if (row[0][0]) != '#':
writer.writerow(row)
else:
continue
input.close()
output.close()
Also when working with *.csv files it is good to have a look at them in a text editor to make sure the delimiters and end_of_line characters are like you think they are. The sniffer is also a good read.
Cheers
Why not provide working code (imports included) and wrap as usual the physical resources in context managers?
Like so:
#! /usr/bin/env python
"""Only strip those rows that start with hash (#)."""
import csv
IN_F_PATH = "/home/stephen/Desktop/paths_output.csv"
OUT_F_PATH = "/home/stephen/Desktop/paths_output2.csv"
with open(IN_F_PATH, 'rb') as i_f, open(OUT_F_PATH, "wb") as o_f:
writer = csv.writer(o_f)
for row in csv.reader(i_f):
if row and row[0].startswith('#'):
continue
writer.writerow(row)
Some notes:
The closing of the files is automated by leaving the context blocks,
the names are better chosen, as input is well a keyword ...
you may want to include empty lines, I only read you want to strip comment lines from the question, so detect these and continue.
it is row[0] that is the first columns string and that startswith # natively mapped to the best matching simple string "method".
In case you also might want to strip empty lines, than one could use the following condition to continueinstead:
if not row or row and row[0].startswith('#'):
and you should be ready to go.
HTH
To answer a comment on the above code line that causes also the skipping of blank input "Lines".
In Python we have left to right (lazy evaluation) and short circuit for boolean expressions so:
>>> row = ["#", "42"]
>>> if not row or row and row[0].startswith("#"):
... print "Empty or comment!"
...
Empty or comment!
>>> row = []
>>> if not row or row and row[0].startswith("#"):
... print "Empty or comment!"
...
Empty or comment!
I suspect that there are lines with an empty first cell, so row[0][0] tries to access the first character of the empty string.
You should try:
for row in csv.reader(input):
if not row[0].startswith('#'):
writer.writerow(row)

Python, appending printed output to excel file

I was hoping someone may be able to point me in the right direction, or give an example on how I can put the following script output into an Excel spreadsheet using xlwt. My script prints out the following text on screen as required, however I was hoping to put this output into an Excel into two columns of time and value. Here's the printed output..
07:16:33.354 1
07:16:33.359 1
07:16:33.364 1
07:16:33.368 1
My script so far is below.
import re
f = open("C:\Results\16.txt", "r")
searchlines = f.readlines()
searchstrings = ['Indicator']
timestampline = None
timestamp = None
f.close()
a = 0
tot = 0
while a<len(searchstrings):
for i, line in enumerate(searchlines):
for word in searchstrings:
if word in line:
timestampline = searchlines[i-33]
for l in searchlines[i:i+1]: #print timestampline,l,
#print
for i in line:
str = timestampline
match = re.search(r'\d{2}:\d{2}:\d{2}.\d{3}', str)
if match:
value = line.split()
print '\t',match.group(),'\t',value[5],
print
print
tot = tot+1
break
print 'total count for', '"',searchstrings[a],'"', 'is', tot
tot = 0
a = a+1
I have had a few goes using xlwt or CSV writer, but each time i hit a wall and revert bact to my above script and try again. I am hoping to print match.group() and value[5] into two different columns on an Excel worksheet.
Thanks for your time...
MikG
What kind of problems do you have with xlwt? Personally, I find it very easy to use, remembering basic workflow:
import xlwt
create your spreadsheet using eg.
my_xls=xlwt.Workbook(encoding=your_char_encoding),
which returns you spreadsheet handle to use for adding sheets and saving whole file
add a sheet to created spreadsheet with eg.
my_sheet=my_xls.add_sheet("sheet name")
now, having sheet object, you can write on it's cells using sheet_name.write(row,column, value):
my_sheet.write(0,0,"First column title")
my sheet.write(0,1,"Second column title")
Save whole thing using spreadsheet.save('file_name.xls')
my_xls.save("results.txt")
It's a simplest of working examples; your code should of course use sheet.write(row,column,value) within loop printing data, let it be eg.:
import re
import xlwt
f = open("C:\Results\VAMOS_RxQual_Build_Update_Fri_04-11.08-16.txt", "r")
searchlines = f.readlines()
searchstrings = ['TSC Set 2 Indicator']
timestampline = None
timestamp = None
f.close()
a = 0
tot = 0
my_xls=xlwt.Workbook(encoding="utf-8") # begin your whole mighty xls thing
my_sheet=my_xls.add_sheet("Results") # add a sheet to it
row_num=0 # let it be current row number
my_sheet.write(row_num,0,"match.group()") # here go column headers,
my_sheet.write(row_num,1,"value[5]") # change it to your needs
row_num+=1 # let's change to next row
while a<len(searchstrings):
for i, line in enumerate(searchlines):
for word in searchstrings:
if word in line:
timestampline = searchlines[i-33]
for l in searchlines[i:i+1]: #print timestampline,l,
#print
for i in line:
str = timestampline
match = re.search(r'\d{2}:\d{2}:\d{2}.\d{3}', str)
if match:
value = line.split()
print '\t',match.group(),'\t',value[5],
# here goes cell writing:
my_sheet.write(row_num,0,match.group())
my_sheet.write(row_num,1,value[5])
row_num+=1
# and that's it...
print
print
tot = tot+1
break
print 'total count for', '"',searchstrings[a],'"', 'is', tot
tot = 0
a = a+1
# don't forget to save your file!
my_xls.save("results.xls")
A catch:
native date/time data writing to xls was a nightmare to me, as excel
internally doesn't store date/time data (nor I couldn't figure it
out),
be careful about data types you're writing into cells. For simple reporting at the begining it's enough to pass everything as a string,
later you should find xlwt documentation quite useful.
Happy XLWTing!

CSV parsing in Python

I want to parse a csv file which is in the following format:
Test Environment INFO for 1 line.
Test,TestName1,
TestAttribute1-1,TestAttribute1-2,TestAttribute1-3
TestAttributeValue1-1,TestAttributeValue1-2,TestAttributeValue1-3
Test,TestName2,
TestAttribute2-1,TestAttribute2-2,TestAttribute2-3
TestAttributeValue2-1,TestAttributeValue2-2,TestAttributeValue2-3
Test,TestName3,
TestAttribute3-1,TestAttribute3-2,TestAttribute3-3
TestAttributeValue3-1,TestAttributeValue3-2,TestAttributeValue3-3
Test,TestName4,
TestAttribute4-1,TestAttribute4-2,TestAttribute4-3
TestAttributeValue4-1-1,TestAttributeValue4-1-2,TestAttributeValue4-1-3
TestAttributeValue4-2-1,TestAttributeValue4-2-2,TestAttributeValue4-2-3
TestAttributeValue4-3-1,TestAttributeValue4-3-2,TestAttributeValue4-3-3
and would like to turn this into tab seperated format like in the following:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
Number of TestAttributes vary from test to test. For some tests there are only 3 values, for some others 7, etc. Also as in TestName4 example, some tests are executed more than once and hence each execution has its own TestAttributeValue line. (in the example testname4 is executed 3 times, hence we have 3 value lines)
I am new to python and do not have much knowledge but would like to parse the csv file with python. I checked 'csv' library of python and could not be sure whether it will be enough for me or shall I write my own string parser? Could you please help me?
Best
I'd use a solution using the itertools.groupby function and the csv module. Please have a close look at the documentation of itertools -- you can use it more often than you think!
I've used blank lines to differentiate the datasets, and this approach uses lazy evaluation, storing only one dataset in memory at a time:
import csv
from itertools import groupby
with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
# Use the csv module to handle reading and writing of delimited files.
reader = csv.reader(ifile)
writer = csv.writer(ofile, delimiter='\t')
# Skip info line
next(reader)
# Group datasets by the condition if len(row) > 0 or not, then filter
# out all empty lines
for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
test_data = list(group)
# Write header
writer.writerow([test_data[0][1]])
# Write transposed data
writer.writerows(zip(*test_data[1:]))
# Write blank line
writer.writerow([])
Output, given that the supplied data is stored in my_data.csv:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
The following does what you want, and only reads up to one section at a time (saves memory for a large file). Replace in_path and out_path with the input and output file paths respectively:
import csv
def print_section(section, f_out):
if len(section) > 0:
# find maximum column length
max_len = max([len(col) for col in section])
# build and print each row
for i in xrange(max_len):
f_out.write('\t'.join([col[i] if len(col) > i else '' for col in section]) + '\n')
f_out.write('\n')
with csv.reader(open(in_path, 'r')) as f_in, open(out_path, 'w') as f_out:
line = f_in.next()
section = []
for line in f_in:
# test for new "Test" section
if len(line) == 3 and line[0] == 'Test' and line[2] == '':
# write previous section data
print_section(section, f_out)
# reset section
section = []
# write new section header
f_out.write(line[1] + '\n')
else:
# add line to section
section.append(line)
# print the last section
print_section(section, f_out)
Note that you'll want to change 'Test' in the line[0] == 'Test' statement to the correct word for indicating the header line.
The basic idea here is that we import the file into a list of lists, then write that list of lists back out using an array comprehension to transpose it (as well as adding in blank elements when the columns are uneven).

Have csv.reader tell when it is on the last line

Apparently some csv output implementation somewhere truncates field separators from the right on the last row and only the last row in the file when the fields are null.
Example input csv, fields 'c' and 'd' are nullable:
a|b|c|d
1|2||
1|2|3|4
3|4||
2|3
In something like the script below, how can I tell whether I am on the last line so I know how to handle it appropriately?
import csv
reader = csv.reader(open('somefile.csv'), delimiter='|', quotechar=None)
header = reader.next()
for line_num, row in enumerate(reader):
assert len(row) == len(header)
....
Basically you only know you've run out after you've run out. So you could wrap the reader iterator, e.g. as follows:
def isLast(itr):
old = itr.next()
for new in itr:
yield False, old
old = new
yield True, old
and change your code to:
for line_num, (is_last, row) in enumerate(isLast(reader)):
if not is_last: assert len(row) == len(header)
etc.
I am aware it is an old question, but I came up with a different answer than the ones presented. The reader object already increments the line_num attribute as you iterate through it. Then I get the total number of lines at first using row_count, then I compare it with the line_num.
import csv
def row_count(filename):
with open(filename) as in_file:
return sum(1 for _ in in_file)
in_filename = 'somefile.csv'
reader = csv.reader(open(in_filename), delimiter='|')
last_line_number = row_count(in_filename)
for row in reader:
if last_line_number == reader.line_num:
print "It is the last line: %s" % row
If you have an expectation of a fixed number of columns in each row, then you should be defensive against:
(1) ANY row being shorter -- e.g. a writer (SQL Server / Query Analyzer IIRC) may omit trailing NULLs at random; users may fiddle with the file using a text editor, including leaving blank lines.
(2) ANY row being longer -- e.g. commas not quoted properly.
You don't need any fancy tricks. Just an old-fashioned if-test in your row-reading loop:
for row in csv.reader(...):
ncols = len(row)
if ncols != expected_cols:
appropriate_action()
if you want to get exactly the last row try this code:
with open("\\".join([myPath,files]), 'r') as f:
print f.readlines()[-1] #or your own manipulations
If you want to continue working with values from row do the following:
f.readlines()[-1].split(",")[0] #this would let you get columns by their index
Just extend the row to the length of the header:
for line_num, row in enumerate(reader):
while len(row) < len(header):
row.append('')
...
Could you not just catch the error when the csv reader reads the last line in a
try:
... do your stuff here...
except: StopIteration
condition ?
See the following python code on stackoverflow for an example of how to use the try: catch: Python CSV DictReader/Writer issues
If you use for row in reader:, it will just stop the loop after the last item has been read.

Categories

Resources