I am trying to build a small crawler to grab twitter handles. I cannot for the life get around an error I keep having. It seems to be the exact same error for re.search. re.findall and re.finditer. The error is TypeError: expected string or buffer.
The data is structured as followed from the CSV:
30,"texg",#handle,,,,,,,,
Note that the print row works fine, the test = re.... errors out before getting to the print line.
def read_urls(filename):
f = open(filename, 'rb')
reader = csv.reader(f)
data = open('Data.txt', 'w')
dict1 = {}
for row in reader:
print row
test = re.search(r'#(\w+)', row)
print test.group(1)
Also not I have been working through this problem at a number of different threads but all solutions explained have not worked. It just seems like re isn't able to read the row call...
Take a look at your code carefully:
for row in reader:
print row
test = re.search(r'#(\w+)', row)
print test.group(1)
Note that row is a list not a string and according to search documentation:
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
That means you should create a string and check whether test is not None
for row in reader:
print row
test = re.search(r'#(\w+)', ''.join(row))
if test:
print test.group(1)
Also open file without b flag like
f = open(filename, 'r')
You're trying to read a list after you run the file through the reader.
import re
f = open('file1.txt', 'r')
for row in f:
print(row)
test = re.search(r'#(\w+)', row)
print(test.group(1))
f.close()
https://repl.it/JCng/1
If you want to use the CSV reader, you can loop through the list.
Related
Hope you can help. I'm trying to iterate over a .csv file and delete rows where the first character of the first item is a #.
Whilst my code does indeed delete the necessary rows, I'm being presented with a "string index out of range" error.
My code is as below:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if (row[0][0]) != '#':
writer.writerow(row)
input.close()
output.close()
As far as I can tell, I have no empty rows that I'm trying to iterate over.
Check if the string is empty with if row[0] before trying to index:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if row[0] and row[0][0] != '#': # here
writer.writerow(row)
input.close()
output.close()
Or simply use if row[0].startswith('#') as your condition
You are likely running into an empty string.
Perhaps try
`if row and row[0][0] != '#':
Then why don't you make sure you don't bump into any of those even if they exist by checking if the line is empty first like so:
input=open("/home/stephen/Desktop/paths_output.csv", 'rb')
output=open("/home/stephen/Desktop/paths_output2.csv", "wb")
writer = csv.writer(output)
for row in csv.reader(input):
if row:
if (row[0][0]) != '#':
writer.writerow(row)
else:
continue
input.close()
output.close()
Also when working with *.csv files it is good to have a look at them in a text editor to make sure the delimiters and end_of_line characters are like you think they are. The sniffer is also a good read.
Cheers
Why not provide working code (imports included) and wrap as usual the physical resources in context managers?
Like so:
#! /usr/bin/env python
"""Only strip those rows that start with hash (#)."""
import csv
IN_F_PATH = "/home/stephen/Desktop/paths_output.csv"
OUT_F_PATH = "/home/stephen/Desktop/paths_output2.csv"
with open(IN_F_PATH, 'rb') as i_f, open(OUT_F_PATH, "wb") as o_f:
writer = csv.writer(o_f)
for row in csv.reader(i_f):
if row and row[0].startswith('#'):
continue
writer.writerow(row)
Some notes:
The closing of the files is automated by leaving the context blocks,
the names are better chosen, as input is well a keyword ...
you may want to include empty lines, I only read you want to strip comment lines from the question, so detect these and continue.
it is row[0] that is the first columns string and that startswith # natively mapped to the best matching simple string "method".
In case you also might want to strip empty lines, than one could use the following condition to continueinstead:
if not row or row and row[0].startswith('#'):
and you should be ready to go.
HTH
To answer a comment on the above code line that causes also the skipping of blank input "Lines".
In Python we have left to right (lazy evaluation) and short circuit for boolean expressions so:
>>> row = ["#", "42"]
>>> if not row or row and row[0].startswith("#"):
... print "Empty or comment!"
...
Empty or comment!
>>> row = []
>>> if not row or row and row[0].startswith("#"):
... print "Empty or comment!"
...
Empty or comment!
I suspect that there are lines with an empty first cell, so row[0][0] tries to access the first character of the empty string.
You should try:
for row in csv.reader(input):
if not row[0].startswith('#'):
writer.writerow(row)
Here is a section of the log file I want to parse:
And here is the code I am writing:
import csv
with open('Coplanarity_PedCheck.log','rt') as tsvin, open('YX2.csv', 'wt') as csvout:
read_tsvin = csv.reader(tsvin, delimiter='\t')
for row in read_tsvin:
print(row)
filters = row[0]
if "#Log File Initialized!" in filters:
print(row)
datetime = row[0]
print("looklook",datetime[23:46])
csvout.write(datetime[23:46]+",")
BS = row[0]
print("looklook",BS[17:21])
csvout.write(datetime[17:21]+",")
csvout.write("\n")
csvout.close()
I need to get the date and time information from row1, then get "left" from row2, then need to skip section 4. How should I do it?
Since the csv.reader makes row1 an list with only 1 element I converted it to string again to split out the datetime info I need. But I think it is not efficient.
I did same thing for row2, then I want to skip row 3-6, but I don't know how.
Also, csv.reader converts my float data into text, how can I convert them back before I write them into another file?
You are going to want to learn to use regular expressions.
For example, you could do something like this:
with open('Coplanarity_PedCheck.log','rt') as tsvin, open('YX2.csv', 'wt') as csvout:
read_tsvin = csv.reader(tsvin, delimiter='\t')
# Get the first line and use the first field
header = read_tsvin.next()[0]
m = re.search('\[([0-9: -]+)\]', row)
datetime = m.group(1)
csvout.write(datetime, ',')
# Find if 'Left' is in line 1
direction = read_tsvin.next()[0]
m = re.search('Left', direction)
if m:
# If m is found print the whole line
csvout.write(m.group(0))
csvout.write('\n')
# Skip lines 3-6
for i in range(4):
null = read_tsvin.next()
# Loop over the rest of the rows
for row in tsvin:
# Parse the data
csvout.close()
Modifying this to look for a line containing '#Log File Initialized!' rather than hard coding for the first line would be fairly simple using regular expressions. Take a look at the regular expression documentation
This probably isn't exactly what you want to do, but rather a suggestion for a good starting point.
i have following output from a csv file:
word1|word2|word3|word4|word5|word6|01:12|word8
word1|word2|word3|word4|word5|word6|03:12|word8
word1|word2|word3|word4|word5|word6|01:12|word8
what i need to do is change the time string like this 00:01:12.
my idea is to extract the list item [7] and add a "00:" as string to the front.
import csv
with open('temp', 'r') as f:
reader = csv.reader(f, delimiter="|")
for row in reader:
fixed_time = (str("00:") + row[7])
begin = row[:6]
end = row[:8]
print begin + fixed_time +end
get error message:
TypeError: can only concatenate list (not "str") to list.
i also had a look on this post.
how to change [1,2,3,4] to '1234' using python
i neeed to know if my approach to soloution is the right way. maybe need to use split or anything else for this.
thx for any help
The line that's throwing the exception is
print begin + fixed_time +end
because begin and end are both lists and fixed_time is a string. Whenever you take a slice of a list (that's the row[:6] and row[:8] parts), a list is returned. If you just want to print it out, you can do
print begin, fixed_time, end
and you won't get an error.
Corrected code:
I'm opening a new file for writing (I'm calling it 'final', but you can call it whatever you want), and I'm just writing everything to it with the one modification. It's easiest to just change the one element of the list that has the line (row[6] here), and use '|'.join to write a pipe character between each column.
import csv
with open('temp', 'r') as f, open('final', 'w') as fw:
reader = csv.reader(f, delimiter="|")
for row in reader:
# just change the element in the row to have the extra zeros
row[6] = '00:' + row[6]
# 'write the row back out, separated by | characters, and a new line.
fw.write('|'.join(row) + '\n')
you can use regex for that:
>>> txt = """\
... word1|word2|word3|word4|word5|word6|01:12|word8
... word1|word2|word3|word4|word5|word6|03:12|word8
... word1|word2|word3|word4|word5|word6|01:12|word8"""
>>> import re
>>> print(re.sub(r'\|(\d\d:\d\d)\|', r'|00:\1|', txt))
word1|word2|word3|word4|word5|word6|00:01:12|word8
word1|word2|word3|word4|word5|word6|00:03:12|word8
word1|word2|word3|word4|word5|word6|00:01:12|word8
I want to parse a csv file which is in the following format:
Test Environment INFO for 1 line.
Test,TestName1,
TestAttribute1-1,TestAttribute1-2,TestAttribute1-3
TestAttributeValue1-1,TestAttributeValue1-2,TestAttributeValue1-3
Test,TestName2,
TestAttribute2-1,TestAttribute2-2,TestAttribute2-3
TestAttributeValue2-1,TestAttributeValue2-2,TestAttributeValue2-3
Test,TestName3,
TestAttribute3-1,TestAttribute3-2,TestAttribute3-3
TestAttributeValue3-1,TestAttributeValue3-2,TestAttributeValue3-3
Test,TestName4,
TestAttribute4-1,TestAttribute4-2,TestAttribute4-3
TestAttributeValue4-1-1,TestAttributeValue4-1-2,TestAttributeValue4-1-3
TestAttributeValue4-2-1,TestAttributeValue4-2-2,TestAttributeValue4-2-3
TestAttributeValue4-3-1,TestAttributeValue4-3-2,TestAttributeValue4-3-3
and would like to turn this into tab seperated format like in the following:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
Number of TestAttributes vary from test to test. For some tests there are only 3 values, for some others 7, etc. Also as in TestName4 example, some tests are executed more than once and hence each execution has its own TestAttributeValue line. (in the example testname4 is executed 3 times, hence we have 3 value lines)
I am new to python and do not have much knowledge but would like to parse the csv file with python. I checked 'csv' library of python and could not be sure whether it will be enough for me or shall I write my own string parser? Could you please help me?
Best
I'd use a solution using the itertools.groupby function and the csv module. Please have a close look at the documentation of itertools -- you can use it more often than you think!
I've used blank lines to differentiate the datasets, and this approach uses lazy evaluation, storing only one dataset in memory at a time:
import csv
from itertools import groupby
with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
# Use the csv module to handle reading and writing of delimited files.
reader = csv.reader(ifile)
writer = csv.writer(ofile, delimiter='\t')
# Skip info line
next(reader)
# Group datasets by the condition if len(row) > 0 or not, then filter
# out all empty lines
for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
test_data = list(group)
# Write header
writer.writerow([test_data[0][1]])
# Write transposed data
writer.writerows(zip(*test_data[1:]))
# Write blank line
writer.writerow([])
Output, given that the supplied data is stored in my_data.csv:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
The following does what you want, and only reads up to one section at a time (saves memory for a large file). Replace in_path and out_path with the input and output file paths respectively:
import csv
def print_section(section, f_out):
if len(section) > 0:
# find maximum column length
max_len = max([len(col) for col in section])
# build and print each row
for i in xrange(max_len):
f_out.write('\t'.join([col[i] if len(col) > i else '' for col in section]) + '\n')
f_out.write('\n')
with csv.reader(open(in_path, 'r')) as f_in, open(out_path, 'w') as f_out:
line = f_in.next()
section = []
for line in f_in:
# test for new "Test" section
if len(line) == 3 and line[0] == 'Test' and line[2] == '':
# write previous section data
print_section(section, f_out)
# reset section
section = []
# write new section header
f_out.write(line[1] + '\n')
else:
# add line to section
section.append(line)
# print the last section
print_section(section, f_out)
Note that you'll want to change 'Test' in the line[0] == 'Test' statement to the correct word for indicating the header line.
The basic idea here is that we import the file into a list of lists, then write that list of lists back out using an array comprehension to transpose it (as well as adding in blank elements when the columns are uneven).
My .csv has two columns: SKU and LongDesc. I want to search the two rows below, specifically in the LongDesc column, for specific strings. If a string is found, a variable will increase. The variable will only increase once for each time a string is found. If the string is not found, the variable will not remain the same.
I started out with this:
import csv
answer=0
with open('sku.csv') as f:
reader = csv.reader(f)
for row in reader:
def test(x):
while 'x' in row:
answer== answer+1
return answer
print test('e')
Trying to search for the string "e" within the file. However, the only result I get is zero. I'm clearly not coding correctly to have the reader check each row, and it's not searching for the right string.
import csv
def main()
find_text = 'this'
with open('sku.csv') as f:
reader = csv.reader(f)
found = sum(1 for sku,descr in reader if descr.find(find_text) > -1)
if __name__=="__main__":
main()