How do you make tables with previously stored strings? - python

So the question basically gives me 19 DNA sequences and wants me to makea basic text table. The first column has to be the sequence ID, the second column the length of the sequence, the third is the number of "A"'s, 4th is "G"'s, 5th is "C", 6th is "T", 7th is %GC, 8th is whether or not it has "TGA" in the sequence. Then I get all these values and write a table to "dna_stats.txt"
Here is my code:
fh = open("dna.fasta","r")
Acount = 0
Ccount = 0
Gcount = 0
Tcount = 0
seq=0
alllines = fh.readlines()
for line in alllines:
if line.startswith(">"):
seq+=1
continue
Acount+=line.count("A")
Ccount+=line.count("C")
Gcount+=line.count("G")
Tcount+=line.count("T")
genomeSize=Acount+Gcount+Ccount+Tcount
percentGC=(Gcount+Ccount)*100.00/genomeSize
print "sequence", seq
print "Length of Sequence",len(line)
print Acount,Ccount,Gcount,Tcount
print "Percent of GC","%.2f"%(percentGC)
if "TGA" in line:
print "Yes"
else:
print "No"
fh2 = open("dna_stats.txt","w")
for line in alllines:
splitlines = line.split()
lenstr=str(len(line))
seqstr = str(seq)
fh2.write(seqstr+"\t"+lenstr+"\n")
I found that you have to convert the variables into strings. I have all of the values calculated correctly when I print them out in the terminal. However, I keep getting only 19 for the first column, when it should go 1,2,3,4,5,etc. to represent all of the sequences. I tried it with the other variables and it just got the total amounts of the whole file. I started trying to make the table but have not finished it.
So my biggest issue is that I don't know how to get the values for the variables for each specific line.
I am new to python and programming in general so any tips or tricks or anything at all will really help.
I am using python version 2.7

Well, your biggest issue:
for line in alllines: #1
...
fh2 = open("dna_stats.txt","w")
for line in alllines: #2
....
Indentation matters. This says "for every line (#1), open a file and then loop over every line again(#2)..."
De-indent those things.

This puts the info in a dictionary as you go and allows for DNA sequences to go over multiple lines
from __future__ import division # ensure things like 1/2 is 0.5 rather than 0
from collections import defaultdict
fh = open("dna.fasta","r")
alllines = fh.readlines()
fh2 = open("dna_stats.txt","w")
seq=0
data = dict()
for line in alllines:
if line.startswith(">"):
seq+=1
data[seq]=defaultdict(int) #default value will be zero if key is not present hence we can do +=1 without originally initializing to zero
data[seq]['seq']=seq
previous_line_end = "" #TGA might be split accross line
continue
data[seq]['Acount']+=line.count("A")
data[seq]['Ccount']+=line.count("C")
data[seq]['Gcount']+=line.count("G")
data[seq]['Tcount']+=line.count("T")
data[seq]['genomeSize']+=data[seq]['Acount']+data[seq]['Gcount']+data[seq]['Ccount']+data[seq]['Tcount']
line_over = previous_line_end + line[:3]
data[seq]['hasTGA']= data[seq]['hasTGA'] or ("TGA" in line) or (TGA in line_over)
previous_line_end = str.strip(line[-4:]) #save previous_line_end for next line removing new line character.
for seq in data.keys():
data[seq]['percentGC']=(data[seq]['Gcount']+data[seq]['Ccount'])*100.00/data[seq]['genomeSize']
s = '%(seq)d, %(genomeSize)d, %(Acount)d, %(Ccount)d, %(Tcount)d, %(Tcount)d, %(percentGC).2f, %(hasTGA)s'
fh2.write(s % data[seq])
fh.close()
fh2.close()

Related

how to replace (or delete) a part of string from txt file in python

i am very new in python (and programming in general) and here is my issue. i would like to replace (or delete) a part of a string from a txt file which contains hundreds or thousands of lines. each line starts with the very same string which i want to delete.
i have not found a method to delete it so i tried a replace it with empty string but for some reason it doesn't work.
here is what i have written:
file = "C:/Users/experimental/Desktop/testfile siera.txt"
siera_log = open(file)
text_to_replace = "Chart: Bar Backtest: NQU8-CME [CB] 1 Min #1 | Study: free dll = 0 |"
for each_line in siera_log:
new_line = each_line.replace("text_to_replace", " ")
print(new_line)
when i print it to check if it was done, i can see that the lines are as they were before. no change was made.
can anyone help me to find out why?
each line starts with the very same string which i want to delete.
The problem is you're passing a string "text_to_replace" rather than the variable text_to_replace.
But, for this specific problem, you could just remove the first n characters from each line:
text_to_replace = "Chart: Bar Backtest: NQU8-CME [CB] 1 Min #1 | Study: free dll = 0 |"
n = len(text_to_replace)
for each_line in siera_log:
new_line = each_line[n:]
print(new_line)
If you quote a variable it becomes a string literal and won't be evaluated as a variable.
Change your line for replacement to:
new_line = each_line.replace(text_to_replace, " ")

Python get previous few elements if match checked element

I have some structured data in a text file:
Parse.txt
name1
detail:
aaaaaaaa
bbbbbbbb
cccccccc
detail1:
dddddddd
detail2:
eeeeeeee
detail3:
ffffffff
detail4:
gggggggg
some of the detail4s do not have data and would be replaced by "-":
name2
detail:
aaaaaaaa
bbbbbbbb
cccccccc
detail1:
dddddddd
detail2:
eeeeeeee
detail3:
ffffffff
detail4:
-
How do i parse the data to get the elements below detail1, detail2 and detail3 of only the data with empty detail4s?
So far i have a partially working code but the problem is that it gets each item 40 times. Please help.
Code:
data = []
with open("parse.txt","r",encoding="utf-8") as text_file:
for line in text_file:
data.append(line)
det4li = []
finali= []
for elem,det4 in zip(data,data[1:]):
if "detail4" in elem:
det4li .append(det4)
if "-" in det4:
for elem1,det1,det2,det3 in zip(data,data[1:],data[3:],data[5:]):
if "detail1:" in elem1:
finali.append(det1.strip() + "," + det2.strip() + "," + det3)
Current Output: 40 records of dddddddd,eeeeeeee,ffffffff
Desired Output: dddddddd,eeeeeeee,ffffffff
Don't try to look ahead. Look behind, by storing preceding data:
final = []
with open("parse.txt","r",encoding="utf-8") as text_file:
section = {}
last_header = None
for line in text_file:
line = line.strip()
if line.startswith('detail'):
# detail line, record for later use
last_header = line.rstrip(':')
elif not last_header:
# name line, store as such
section['name'] = line
else:
section[last_header] = line
if last_header == 'detail4':
# section complete, process
if line == '-':
# A section we want to keep
final.append(section)
# reset section data
section, last_header = {}, None
This has the added advantage that you now don't need to read the whole file into memory. If you turn this into a generator (by putting it into a function and replacing the final.append(section) line with yield section), you can even process those matching sections as you read the file without sacrificing readability.

Print match and line after match

I have this file containing 82 pairs of IDs:
EmuJ_000063620.1 EgrG_000063620.1 253 253
EmuJ_000065200.1 EgrG_000065200.1 128 128
EmuJ_000081200.1 EgrG_000081200.1 1213 1213
EmuJ_000096200.1 EgrG_000096200.1 295 298
EmuJ_000114700.1 EgrG_000114700.1 153 153
EmuJ_000133800.1 EgrG_000133800.1 153 153
EmuJ_000139900.1 EgrG_000144400.1 2937 2937
EmuJ_000164600.1 EgrG_000164600.1 167 167
and I have two other files with the sequences for EmuJ_* IDs and EgrG_* IDs as follows:
EgrG_sequences.fasta:
>EgrG_000632500.1
MKKKSHRKSPEGNHSLTKAANKDTAKCNEERGRNIGQSNEEENATRSEKDREGDEDRNLREYVISIAQKYYPHLVSCMRQDDDNQASADARGADGANDEEHCPKHCPRLNAQKYYLYSATCNHHCEDSQASCDEEGDGKRLLKQCLLWLTERYYPSLAARIRQCNDDQASSNAHGADETDDGDRRLKQALLLFAKKLYPCVTTCIRHCVADHTSHDARGVDEEVDGEQLLKQCLHSSAQKFYPRLAACVCHCDADHASTETCGALGVGNAERCPQQCPCLCAQQYYVQSATCVHHCDNEQSSPETRGVKEDVDVEQLLKQCLLMFAEKFHPTLAAGIRSCADDESSHVASVEGEDDADKQRLKQYLLLFAQKYYPHLIAYIQKRDDDQSSSSVRDSGEEANEEEERLKQCLLLFAQKLYPRLVAYTGRCDSNQSTSDGCSVDGEEAEKHYLKQSLLLLAQKYYPSLAAYLRQFDDNQSSSDVRSVDEEEAEKRHLKQGLLFFAEKYYPSLATYIRRCDDDQSSSDARVVDEVDDEDRRLKQGLLLLAQKYYPPLANYIRHSQSSFNVCGADEKEDEEHCLNQLPRLCAQEAYIRSSSCSHHCDDDQASNDTLVVDKEEEEKYRLKQGLLLLAQKFYPPLATCIHQCDDQSSHDTRGVDEEEAEEQLLKKCLLMFAEKFYPSLAATIHHSVYDQASFDMRDVDTENDETHCLSLSAENYSTASTTCIHHSDGDQSTSDACGVEEGDVEEQRLKRGLLLLAQKYYPSLAAYICQCDDYQPSSDVCGVGEEDTGEERLKQCLLLFAKKFYPSLASRNSQCGDNLILNDEVVGETVINSDTDTDEVTPVEKSTAVCDEVDEVPFKYVGSPTPLSDVDVDSLEKVIPPNDLTAHSSFQNSLDHSVEGGYPDRAFYIGRHTVESADSTAPLSKSSSTKLYFSNTDEFPTEEEVSSPIAPLSIQRRIRIYLEDLENVRKVSLIPLCKTDKFGNPQEEIIIDSNLDDDTDESKLSSVDVEFTMEQADATPLDLEAQDEDLKNCVAIILKHIWSELMECIRREGLSDVYELSLGDRRIEVPQDDVCLVR*
>EgrG_000006700.1
MTDTKGPDESYFEKEAFSSLPQPVDSPSASATDTDRIPVVAVSLPVSSGSIDVNCNCSCYLIICETKLIIDYQMTRKW*
and so on. The same for EmuJ_sequences.fasta
I need to get the sequences for each pair and write one after the other maintaining the order like this:
>EmuJ_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYENQCRLDTEVKRLCSNIQTFNRQVDMWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EgrG_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYDNQCRLDTEVKRLCSNIQTFNCQVDLWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EmuJ_000065200.1
MLCLITPFPSVVPVCVRTCVCMCPCPLLLILYTWSAYLVPFSLPLCLYAHFHIRFLPPFSSLSIPRFLTHSLFLPSYPPLTMLRMKKSLAPCPAERR*
>EgrG_000065200.1
MLCLVTSFPSAVPVCMRTCVCMCSCPLLLILYTWSAYLVPFSLPLCLYTHLHIRFLPPFPSLAIPRFLTHPLFLPTSLYVADKKEPSAMPRRASLRQMLLIVLLQELH*
>EmuJ_000081200.1
MNSLRIFAVVITCLMVVGFSYSIHPTFPSYQSVVWHSSANTGYECRDGICGYRCSNPWCHGFGSILHPQMGVQEMWGSAAHGRHAHSRAMTEFLAKASPEDVTMLIESTPNIDEVITSLDGEAVTILINKLPNLRRVMEELKPQTKMHIVSKLCGKVGSAMEWTEARRNDGSGMWNEYGSGWEGIDAIQDLEAEVIMRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPTPATSVSSMDPLVLRSIILNMPNLNDILMQVDPVYLQSALVHVPGFGAYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQRMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPRAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPTRFANVLLHVKPDYVRYIIEKHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPRVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAATASATEEPTVTVDYANLLRSKIPLIDNVIKMSDPEKVAILRDNLLDVSRILVNLDPTMLRNINSIIFNATKMLNELSVFLVEYPLEYLHKEGKSGVAVNKSEQVGTTGENGVSSIAVEKLQMVLLKIPLFDQFLKWIDQKKLHELLNKIPTLLEVIATANQETLDKINSLLHDAIATMNTAKKLIVTGICRKLAEEGKLRLPRVCPSAST*
>EgrG_000081200.1
MNLLRIFAVVITCLIVVGFGYPTHPTFPSYQTAVWHSSANTGYRCRAGICGYRCSSPWCHGFESALYPQMAMQEMWGSGAHGRHAHSRTMTEFLMKASPEDLTMLIESTPNIDEVITSLDSEAIIILINKLPNLRRVMEKLKPQTKMHIVSKLCDKVGNAMEWAGARRNDGSGMWNEYGSVWEGIDAIQDLEAEMITRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPMPATLVSNMDPHVLRSIILNMPNLDDILMQVDPVHLQSALMYVPGFGTYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQWMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPKAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPARFANVLLHVKPDYVRYIIENHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPNVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAPTASVTEASMVTVDYAHLLRSKIPLIDNVIKMSDPAKVAILRDNLLDVGTTDENGVSSITVEKLQMVLLKIPLFDQFLNWIDSKKLHALLQKIPTLLEVIATANQEALDKINLLLHDAIATMNTAKKLIVTSICRKLAEEGKLRLPRVCPSTST*
And so on.
I wrote a script in bash to do this and it worked like I wanted, it was very simple. Now I'm trying to do the same in Python (which I'm learning), but I'm having a hard time to do the same in a pythonic way.
I've tried this, but I've got only the first pair and then it stopped:
rbh=open('rbh_res_eg-not-sec.txt', 'r')
ems=open('em_seq.fasta', 'r')
egs=open('eg_seq.fasta', 'r')
for l in rbh:
emid=l.split('\t')[0]
egid=l.split('\t')[1]
# ids=emid+'\n'+egid
# print ids # just to check if split worked
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
for lg in egs:
if egid in lg:
print lg.strip()
print next(egs).strip()
I've tried some variations but I've got only the IDs, without the sequences.
So, how can I find the ID in the sequence file, print it and the line after it (the line with sequence referring to the ID)?
Please, let me know if I explained it clearly.
Iterating over a file moves the file pointer until it reaches the end of the file (the last line), so after the first iteration of your outer loop, the ems and egs files are exhausted.
The quick&dirty workaround would be to reset the ems and egs pointers to zero at the end of the outer loop, ie:
for line in rbh:
# no need to split twice
parts = line.split("\t")
emid, egid = parts[0].strip(), parts[1].strip()
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
ems.seek(0) # reset the file pointer
for lg in egs:
if egid in lg:
print lg.strip()
print next(egs).strip()
egs.seek(0) # reset the file pointer
Note that calling next(iterator) while already iterating over iterator will consume one more item of the iterator, as illustrated here:
>>> it = iter(range(20))
>>> for x in it:
... print x, next(it)
...
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15
16 17
18 19
As you can see, we don't iter on each element of our range here... Given your file format it should not be a huge problem but I thought I'd still warn you about it.
Now your algorithm is far from efficient - for each line of the rbh file it will read scan the whole ems and egs files again and again.
_NB : the following assumes that each emid / egid will appear at most once in the fasta files._
If your ems and egs files are not too large and you have enough available memory, you could load them into a pair of dicts and do a mere dict lookup (which is O(1) and possibly one of the most optimized operation in Python)
# warning: totally untested code
def fastamap(path):
d = dict()
with open(path) as f:
for num, line in enumerate(f, 1):
line = line.strip()
# skip empty lines.
if not line:
continue
# sanity check: we should only see
# lines starting with ">", the "value"
# lines being consumed by the `next(f)` call
if not line.startswith(">"):
raise ValueError(
"in file %s: line %s doesn't start with '>'" % (
path, num
))
# ok, proceed
d[line.lstrip(">")] = next(f).strip()
return d
ems = fastamap('em_seq.fasta')
egs = fastamap('eg_seq.fasta')
with open('rbh_res_eg-not-sec.txt') as rhb:
for line in rhb:
parts = line.split("\t")
emid, egid = parts[0].strip(), parts[1].strip()
if emid in ems:
print emid
print ems[emid]
if egid in egs:
print egid
print egs[egid]
If this doesn't fly because of memory issues, well bad luck you're stuck with sequential scan (unless you want to use some database system but this might be a bit overkill), but - always assuming the emid/egid each only appears once in the fasta files - you can at least exit the inner loops once you've find your target:
for l in rbh:
# no need to split twice, you can just unpack
emid, egid = l.split('\t')
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
break # no need to go further
ems.seek(0) # reset the file pointer
# etc...

python character count problems

I'm trying to count a line 8 characters or less at a time and have it count how many times lower case "f" shows up. The value for how many times f shows up keeps showing zero. Text1.txt has lower case f"" one time on line 1 and 4 times on line 2.
with open("text1.txt","r+") as r:
while True:
cCount = r.readlines(1)
charSet = cCount.count("f")
print charSet
if not cCount:
break
if charSet == 1:
print("hello")
Where has my python logic failed.
Try this:
with open("text1.txt","r") as r:
for line in r:
print(line.count("f"))
this is the proper way to iterate over a file
EDIT: to change " fghfghf" to "3ghgh"
with open("text1.txt","r") as r:
for line in r:
if line.count("f")==3:
print("3"+line.replace("f",""))

matching and dispalying specific lines through python

I have 15 lines in a log file and i want to read the 4th and 10 th line for example through python and display them on output saying this string is found :
abc
def
aaa
aaa
aasd
dsfsfs
dssfsd
sdfsds
sfdsf
ssddfs
sdsf
f
dsf
s
d
please suggest through code how to achieve this in python .
just to elaborate more on this example the first (string or line is unique) and can be found easily in logfile the next String B comes within 40 lines of the first one but this one occurs at lots of places in the log file so i need to read this string withing the first 40 lines after reading string A and print the same that these strings were found.
Also I cant use with command of python as this gives me errors like 'with' will become a reserved keyword in Python 2.6. I am using Python 2.5
You can use this:
fp = open("file")
for i, line in enumerate(fp):
if i == 3:
print line
elif i == 9:
print line
break
fp.close()
def bar(start,end,search_term):
with open("foo.txt") as fil:
if search_term in fil.readlines()[start,end]:
print search_term + " has found"
>>>bar(4, 10, "dsfsfs")
"dsfsfs has found"
#list of random characters
from random import randint
a = list(chr(randint(0,100)) for x in xrange(100))
#look for this
lookfor = 'b'
for element in xrange(100):
if lookfor==a[element]:
print a[element],'on',element
#b on 33
#b on 34
is one easy to read and simple way to do it. Can you give part of your log file as an example? There are other ways that may work better :).
after edits by author:
The easiest thing you can do then is:
looking_for = 'findthis' i = 1 for line in open('filename.txt','r'):
if looking_for == line:
print i, line
i+=1
it's efficient and easy :)

Categories

Resources