Python csv reader incomplete file line iteration - python

Here is my problem. I need to parse a comma separated file and I've got my code working how I would like, however while testing it and attempting to break things I've come across a problem.
Here is the example code:
import csv
compareList=["testfield1","testfield2","testfield3","testfield4"]
z=open("testFile",'r')
x=csv.reader(z,quotechar='\'')
testDic={}
iter=0
for lineList in x:
try:
for item in compareList:
testDic[item]=lineList[iter]
iter+=1
iter=0
except IndexError:
iter=0
lineList=[]
for item in compareList:
testList.append("")
testDic[item]=lineList[iter]
iter+=1
iter=0
for item in compareList:
testFile.write(testDic[item])
if compareList.index(item)!=len(compareList)-1
testFile.write(",")
testFile.write('\n')
testFile.close()
z.close()
So what this is supposed to do is check and make sure that each line of the csv file matches the length of a list. If the length of the line does not match the length of the list, then the line is converted to null values(commas) that equal the length of compareList.
Here is an example of what is in the file:
,,"sometext",343434
,,"moretext",343434
,,"stuff",4543343
,,"morestuff",3434354
The code works just fine if the line is missing an item. So the output of at file containing:
,"sometext",343434
,,"moretext",343434
,,"stuff",4543343
,,"morestuff",3434354
will look like this:
,,,,
,,"moretext",343434
,,"stuff",4543343
,,"morestuff",3434354
The problem I have induced is when the line looks something like this:
,"sometext",343434
,,"moretext",343434
,,"St,'",uff",4543343
,,"morestuff",3434354
The ouput of this file will be:
,,,,
,,"moretext",343434
,,,,
So it will apply the change as expected and null out lines 1 and 3, but it just stops processing at that line. I've been pulling my hair out trying to figure out what is going on here, with no luck.
As always I greatly appreciate any help you are willing to give.

Just print each line returned by csv.reader to understand what is the problem:
>>> import csv
>>> z=open("testFile",'r')
>>> x=csv.reader(z,quotechar='\'')
>>> for lineList in x:
... print lineList
...
['', '"sometext"', '343434']
['', '', '"moretext"', '343434']
['', '', '"St', '",uff",4543343\n,,"morestuff",3434354\n']
The last 2 lines are just one line for csv.reader.
Now, just remove quotechar='\''
>>> import csv
>>> z=open("testFile",'r')
>>> x=csv.reader(z)
>>> for lineList in x:
... print lineList
...
['', 'sometext', '343434']
['', '', 'moretext', '343434']
['', '', "St,'", 'uff"', '4543343']
['', '', 'morestuff', '3434354']

Related

Python script using import RE to put list of words into bracket

I would like to split the following string into a list. I have tried:
import re
mystr = """
MA1-ETLP-01
MA1-ETLP-02
MA1-ETLP-03
MA1-ETLP-04
MA1-ETLP-05
"""
wordList = re.sub("[^\w]"," ",mystr).split()
print wordList
I get the output:
['MA1', 'ETLP', '01', 'MA1', 'ETLP', '02', 'MA1', 'ETLP', '03', 'MA1', 'ETLP', '04', 'MA1', 'ETLP', '05']
I want it to look more like:
['MA1-ETLP-01', 'MA1-ETLP-02', 'MA1-ETLP-03', 'MA1-ETLP-04', 'MA1-ETLP-05']
How can I achieve the second output?
You don't need a regular expression for that. Just send the string to split().
>>> mystr = """
...
...
... MA1-ETLP-01
... MA1-ETLP-02
... MA1-ETLP-03
... MA1-ETLP-04
... MA1-ETLP-05
...
... """
>>> mystr.split()
['MA1-ETLP-01', 'MA1-ETLP-02', 'MA1-ETLP-03', 'MA1-ETLP-04', 'MA1-ETLP-05']
The following will do the trick:
mystr.split()
If you can have spaces in the lines you will want splitlines instead of split and to filter the empty lines:
mystr = """
MA1-ETLP-01
MA1-ETLP-02
MA1-ETLP-03
MA1-ETLP-04
MA1-ETLP-05
"""
print([line for line in mystr.splitlines() if line])
Based on the script name OpenFileAndFormat it seems you are reading from a file which if you are you need not split anything, you can read line by line into a list stripping newlines and filtering empty lines:
with open("your_file") as f:
lines = [line for line in map(str.strip, f) if line]

Python - Get specific lines from file

How can I get specific lines from a file in Python? I know how to read files and get it in a list etc, but this is a bit harder for me. Let me explain what I need:
I have a file that looks like this:
lcl|AF033819.3_cds_AAC82593.1_1 [gene=gag] [protein=Gag] [protein_id=AAC82593.1] [location=336..1838]
ATGGGTGCGAGAGCGTCAGTATTAAGCGGGGGAGAATTAGATCGATGGGAAAAAATTCGGTTAAGGCCAG
GGGGAAAGAAAAAATATAAATTAAAACATATAGTATGGGCAAGCAGGGAGCTAGAACGATTCGCAGTTAA
TCACTCTTTGGCAACGACCCCTCGTCACAATAA
lcl|AF033819.3_cds_AAC82598.2_2 [gene=pol] [protein=Pol] [partial=5'] [protein_id=AAC82598.2] [location=<1631..4642]
TTTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGACCAGAGCCA
ACAGCCCCACCAGAAGAGAGCTTCAGGTCTGGGGTAGAGACAACAACTCCCCCTCAGAAGCAGGAGCCGA
lcl|AF033819.3_cds_AAC82594.1_3 [gene=vif] [protein=Vif] [protein_id=AAC82594.1] [location=4587..5165]
ATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGGATGAGGATTAGAACATGGAAAAGTT
TAGTAAAACACCATATGTATGTTTCAGGGAAAGCTAGGGGATGGTTTTATAGACATCACTATGAAAGCCC
I need to remove every line that contains:
lcl|AF033819.3_cds_AAC82594.1_3 [gene=vif] [protein=Vif] [protein_id=AAC82594.1] [location=4587..5165]
All the letters I need to store in a list, file, etc. I know how that works. Can anyone help me with the code in Python? How do I only delete lines that contain:
lcl
The answer is use regular expressions. It will be something like this:
>>> import re
>>> a = 'beginlcl|AF033819.3_cds_AAC82593.1_1 [gene=gag] [protein=Gag] [protein_id=AAC82593.1] [location=336..1838]end'
>>> re.sub('lcl.*?location.*?\]', '', a)
'beginend'
Why not use startswith()?
with open('lcl.txt', 'r') as f:
for line in f.readlines():
if line.startswith("lcl|"):
print ("lcl line dropping it")
continue
else:
print (line)
Result:
lcl line dropping it
ATGGGTGCGAGAGCGTCAGTATTAAGCGGGGGAGAATTAGATCGATGGGAAAAAATTCGGTTAAGGCCAG GGGGAAAGAAAAAATATAAATTAAAACATATAGTATGGGCAAGCAGGGAGCTAGAACGATTCGCAGTTAATCACTCTTTGGCAACGACCCCTCGTCACAATAA
lcl line dropping it
TTTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGACCAGAGCCA ACAGCCCCACCAGAAGAGAGCTTCAGGTCTGGGGTAGAGACAACAACTCCCCCTCAGAAGCAGGAGCCGA
lcl line dropping it
ATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGGATGAGGATTAGAACATGGAAAAGTT TAGTAAAACACCATATGTATGTTTCAGGGAAAGCTAGGGGATGGTTTTATAGACATCACTATGAAAGCCC
Note: I am assuming that there are newlines in the right places here!

IndexError: list index out of range; split is causing the issue probably

I am just two days to Python and also to this Forum. Please bear if my question looks silly. I tried searching the stack overflow, but couldn't correlate the info whats been given.
Please help me on this
>>> import re
>>> file=open("shortcuts","r")
>>> for i in file:
... i=i.split(' ',2)
... if i[1] == '^/giftfordad$':
... print i[1]
...
^/giftfordad$
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
IndexError: list index out of range
I am receiving my desire output but along with list index out of range error.
My file shortcuts contains four columns which are delimited with SPACE.
Also please help me how to find the pattern if the value resides in a variable.
For example
var1=giftcard
How to find ^/var$ in my file. Thanks for help in advance.
From Jon's point, I arrive at this. But still looking for answer. the below gives me an empty result. Need some tweak in regex for [^/ and $]
>>> import re
>>> with open('shortcut','r') as fin:
... lines = (line for line in fin if line.strip() and not line.strip().startswith('#'))
... for line in lines:
... if re.match("^/giftfordad$",line):
... print line
From my shell command, i arrive at the answer easily. Can somebody please write/correct this piece of code achieving the results, I'm looking for. Many thanks
$grep "\^\/giftfordad\\$" shortcut
RewriteRule ^/giftfordad$ /home/sathya/?id=456 [R,L]
In this:
i=i.split(' ',2)
if i[1] == '^/giftfordad$':
It would imply that i is now a list of length 1 (ie, there was no ' ' character to split on).
Also, it looks like you might be trying to use a regular expression and if i[1] == '^/giftfordad$' is not the way Python does those. That comparison would be written as:
if i[1] == '/giftfordad':
However, that's a completely valid string if you're grabbing it from a file of a list of regular expressions ;)
Just seen your example:
If you're processing an .htaccess like file, you'll want to ignore blank lines and presumably commented lines...
with open('shortcuts') as fin:
lines = (line for line in fin if line.strip() and not line.strip().startswith('#'))
for line in lines:
stuff = line.split(' ', 2)
# etc...
I took the line you gave as an example, and here is what I found, hoping that fits your expectations (I understood you wanted to replace ^/tv$ by ^/giftfordad$ in column #2):
>>> s = 'RewriteRule ^/tv$ /home/sathya?id=123 [R=301,L]'
>>> parts = s.split()
>>> parts
['RewriteRule', '^/tv$', '/home/sathya?id=123', '[R=301,L]']
>>> if len(parts) > 1:
part = parts[1]
if not "^/giftfordad$" in part:
print ' '.join([parts[0]] + ["^/giftfordad$"] + parts[2:])
else:
print s
RewriteRule ^/giftfordad$ /home/sathya?id=123 [R=301,L]
The line with join is the most complex: I recreate a list by concatenating:
the 1st column unchanged
the 2nd column replaced by ^/giftfordad$
the rest of the columns unchanged
join is then used to join all these elements as a string.

How to split a string based on comma as separator with comma within double quotes remaining as it is in python

I want to separate a string based on comma, but when the string is within double quotes the commas should be kept as it is. For that I wrote the following code. However, the code given below does not seem to work. Can someone please help me figure out as to what the error is?
>>> from csv import reader
>>> l='k,<livesIn> "Dayton,_Ohio"'
>>> l1=[]
>>> l1.append(l)
>>> for line1 in reader(l1):
print line1
The output which I am getting is:
['k', '<livesIn> "Dayton', '_Ohio"']
Whereas I want the output as: ['k', '<livesIn> "Dayton,_Ohio"'] i.e. I don't want "Dayton,_Ohio" to get separated.
So here is a way.
>>> from csv import reader
>>> l='k,<livesIn> "Dayton,_Ohio"'
>>> l1=[]
>>> l1.append(l)
>>> for line in reader(l1):
... print list((line[0], ','.join(line[1:])))
...
['k', '<livesIn> "Dayton,_Ohio"']

Using the split function in Python

I am working with the CSV module, and I am writing a simple program which takes the names of several authors listed in the file, and formats them in this manner: john.doe
So far, I've achieved the results that I want, but I am having trouble with getting the code to exclude titles such as "Mr."Mrs", etc. I've been thinking about using the split function, but I am not sure if this would be a good use for it.
Any suggestions? Thanks in advance!
Here's my code so far:
import csv
books = csv.reader(open("books.csv","rU"))
for row in books:
print '.'.join ([item.lower() for item in [row[index] for index in (1, 0)]])
It depends on how much messy the strings are, in worst cases this regexp-based solution should do the job:
import re
x=re.compile(r"^\s*(mr|mrs|ms|miss)[\.\s]+", flags=re.IGNORECASE)
x.sub("", text)
(I'm using re.compile() here since for some reasons Python 2.6 re.sub doesn't accept the flags= kwarg..)
UPDATE: I wrote some code to test that and, although I wasn't able to figure out a way to automate results checking, it looks like that's working fine.. This is the test code:
import re
x=re.compile(r"^\s*(mr|mrs|ms|miss)[\.\s]+", flags=re.IGNORECASE)
names = ["".join([a,b,c,d]) for a in ['', ' ', ' ', '..', 'X'] for b in ['mr', 'Mr', 'miss', 'Miss', 'mrs', 'Mrs', 'ms', 'Ms'] for c in ['', '.', '. ', ' '] for d in ['Aaaaa', 'Aaaa Bbbb', 'Aaa Bbb Ccc', ' aa ']]
print "\n".join([" => ".join((n,x.sub('',n))) for n in names])
Depending on the complexity of your data and the scope of your needs you may be able to get away with something as simple as stripping titles from the lines in the csv using replace() as you iterate over them.
Something along the lines of:
titles = ["Mr.", "Mrs.", "Ms", "Dr"] #and so on
for line in lines:
line_data = line
for title in titles:
line_data = line_data.replace(title,"")
#your code for processing the line
This may not be the most efficient method, but depending on your needs may be a good fit.
How this could work with the code you posted (I am guessing the Mr./Mrs. is part of column 1, the first name):
import csv
books = csv.reader(open("books.csv","rU"))
for row in books:
first_name = row[1]
last_name = row[0]
for title in titles:
first_name = first_name.replace(title,"")
print '.'.(first_name, last_name)

Categories

Resources