I am new to python and I am trying make a program that reads a file, and puts the information in its own vectors. the file is an xyz file that looks like this:
45
Fe -0.055 0.033 -0.047
N -0.012 -1.496 1.451
N 0.015 -1.462 -1.372
N 0.000 1.386 1.481
N 0.070 1.417 -1.339
C -0.096 -1.304 2.825
C 0.028 -1.241 -2.739
C -0.066 -2.872 1.251
C -0.0159 -2.838 -1.205
Starting from the 3rd line I need to place each in its own vectors, so far I have this:
file=open("Question4.xyz","r+")
A = []
B = []
C = []
D = []
counter=0
for line in file:
if counter>2: #information on particles start on the 2nd line
a,b,c,d=line.split()
A.append(a)
B.append(float(b))
C.append(float(c))
D.append(float(d))
counter=counter+1
I am getting this error:
File "<pyshell#72>", line 3, in <module>
a,b,c,d=line.split()
ValueError: need more than 0 values to unpack
Any ideas on where I am going wrong?
Thanks in advance!
It looks like you have lines in your that doesn't actually result in 4 items on splitting. Add a condition for that.
for line in file:
spl = line.strip().split()
if len(spl) == 4: # this will take care of both empty lines and
# lines containing greater than or less than four items
a, b, c, d = spl
A.append(a)
B.append(float(b))
C.append(float(c))
D.append(float(d))
Would you happen to have an empty line somewhere, by any chance (or with only a '\n') ?
You could force
if counter >= 2:
if line.strip():
(a,b,c,d) = line.strip().split()
An advantage of not checking whether your split line has a len of 4 is that it won't silently skip the line if it doesn't have the right number of fields (like you experienced yourself with the empty lines at the end of your files): you'll get an exception instead, which forces you to double-check your input (or your logic).
Related
I have a list of 3-gram terms of around 10000 in a .txt file. I want to match these terms within multiple .GHC files under a directory and count the occurrences of each of the terms.
One of these files looks like this:
ntdll.dll+0x1e8bd ntdll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50b ntdll.dll+0x1e8bd ntdll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50b ntdll.dll+0x1e8bd ntdll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50b ntdll.dll+0x1e8bd ntdll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50b kernel32.dll+0xb511 kernel32.dll+0x16d4f
I want the resulting output to be like this in a dataframe:
N_gram_term_1 N_gram_term_2 ............ N_gram_term_n
2 1 0
3 2 4
3 0 3
the 2nd line here indicates that N_gram_term_1 has appeared 2 times in one file and N_gram_term_2 1 time and so on.
the 3rd line indicates that N_gram_term_1 has appeared 3 times in second file and N_gram_term_2 2 times and so on.
If I need to be more clear about something, please let me know.
I am sure you have implementations for this purpoes, perhaps in sklearn. A simple implementation from scratch, though would be:
import sys
d = {} # dictionary that will have 1st key = file and 2 key = 3gram
for file in sys.argv[1:]: # These are all files to be analyzed
d[file] = {} # The value here is a nested dictionary
with open(file) as f: # Opening each file at a time
for line in f: # going through every row of the file
g = line.strip()
if g in d[file]:
d[file][g] +=1
else:
d[file][g] = 1
import pandas
print(pandas.DataFrame(d).T)
I have this file containing 82 pairs of IDs:
EmuJ_000063620.1 EgrG_000063620.1 253 253
EmuJ_000065200.1 EgrG_000065200.1 128 128
EmuJ_000081200.1 EgrG_000081200.1 1213 1213
EmuJ_000096200.1 EgrG_000096200.1 295 298
EmuJ_000114700.1 EgrG_000114700.1 153 153
EmuJ_000133800.1 EgrG_000133800.1 153 153
EmuJ_000139900.1 EgrG_000144400.1 2937 2937
EmuJ_000164600.1 EgrG_000164600.1 167 167
and I have two other files with the sequences for EmuJ_* IDs and EgrG_* IDs as follows:
EgrG_sequences.fasta:
>EgrG_000632500.1
MKKKSHRKSPEGNHSLTKAANKDTAKCNEERGRNIGQSNEEENATRSEKDREGDEDRNLREYVISIAQKYYPHLVSCMRQDDDNQASADARGADGANDEEHCPKHCPRLNAQKYYLYSATCNHHCEDSQASCDEEGDGKRLLKQCLLWLTERYYPSLAARIRQCNDDQASSNAHGADETDDGDRRLKQALLLFAKKLYPCVTTCIRHCVADHTSHDARGVDEEVDGEQLLKQCLHSSAQKFYPRLAACVCHCDADHASTETCGALGVGNAERCPQQCPCLCAQQYYVQSATCVHHCDNEQSSPETRGVKEDVDVEQLLKQCLLMFAEKFHPTLAAGIRSCADDESSHVASVEGEDDADKQRLKQYLLLFAQKYYPHLIAYIQKRDDDQSSSSVRDSGEEANEEEERLKQCLLLFAQKLYPRLVAYTGRCDSNQSTSDGCSVDGEEAEKHYLKQSLLLLAQKYYPSLAAYLRQFDDNQSSSDVRSVDEEEAEKRHLKQGLLFFAEKYYPSLATYIRRCDDDQSSSDARVVDEVDDEDRRLKQGLLLLAQKYYPPLANYIRHSQSSFNVCGADEKEDEEHCLNQLPRLCAQEAYIRSSSCSHHCDDDQASNDTLVVDKEEEEKYRLKQGLLLLAQKFYPPLATCIHQCDDQSSHDTRGVDEEEAEEQLLKKCLLMFAEKFYPSLAATIHHSVYDQASFDMRDVDTENDETHCLSLSAENYSTASTTCIHHSDGDQSTSDACGVEEGDVEEQRLKRGLLLLAQKYYPSLAAYICQCDDYQPSSDVCGVGEEDTGEERLKQCLLLFAKKFYPSLASRNSQCGDNLILNDEVVGETVINSDTDTDEVTPVEKSTAVCDEVDEVPFKYVGSPTPLSDVDVDSLEKVIPPNDLTAHSSFQNSLDHSVEGGYPDRAFYIGRHTVESADSTAPLSKSSSTKLYFSNTDEFPTEEEVSSPIAPLSIQRRIRIYLEDLENVRKVSLIPLCKTDKFGNPQEEIIIDSNLDDDTDESKLSSVDVEFTMEQADATPLDLEAQDEDLKNCVAIILKHIWSELMECIRREGLSDVYELSLGDRRIEVPQDDVCLVR*
>EgrG_000006700.1
MTDTKGPDESYFEKEAFSSLPQPVDSPSASATDTDRIPVVAVSLPVSSGSIDVNCNCSCYLIICETKLIIDYQMTRKW*
and so on. The same for EmuJ_sequences.fasta
I need to get the sequences for each pair and write one after the other maintaining the order like this:
>EmuJ_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYENQCRLDTEVKRLCSNIQTFNRQVDMWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EgrG_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYDNQCRLDTEVKRLCSNIQTFNCQVDLWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EmuJ_000065200.1
MLCLITPFPSVVPVCVRTCVCMCPCPLLLILYTWSAYLVPFSLPLCLYAHFHIRFLPPFSSLSIPRFLTHSLFLPSYPPLTMLRMKKSLAPCPAERR*
>EgrG_000065200.1
MLCLVTSFPSAVPVCMRTCVCMCSCPLLLILYTWSAYLVPFSLPLCLYTHLHIRFLPPFPSLAIPRFLTHPLFLPTSLYVADKKEPSAMPRRASLRQMLLIVLLQELH*
>EmuJ_000081200.1
MNSLRIFAVVITCLMVVGFSYSIHPTFPSYQSVVWHSSANTGYECRDGICGYRCSNPWCHGFGSILHPQMGVQEMWGSAAHGRHAHSRAMTEFLAKASPEDVTMLIESTPNIDEVITSLDGEAVTILINKLPNLRRVMEELKPQTKMHIVSKLCGKVGSAMEWTEARRNDGSGMWNEYGSGWEGIDAIQDLEAEVIMRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPTPATSVSSMDPLVLRSIILNMPNLNDILMQVDPVYLQSALVHVPGFGAYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQRMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPRAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPTRFANVLLHVKPDYVRYIIEKHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPRVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAATASATEEPTVTVDYANLLRSKIPLIDNVIKMSDPEKVAILRDNLLDVSRILVNLDPTMLRNINSIIFNATKMLNELSVFLVEYPLEYLHKEGKSGVAVNKSEQVGTTGENGVSSIAVEKLQMVLLKIPLFDQFLKWIDQKKLHELLNKIPTLLEVIATANQETLDKINSLLHDAIATMNTAKKLIVTGICRKLAEEGKLRLPRVCPSAST*
>EgrG_000081200.1
MNLLRIFAVVITCLIVVGFGYPTHPTFPSYQTAVWHSSANTGYRCRAGICGYRCSSPWCHGFESALYPQMAMQEMWGSGAHGRHAHSRTMTEFLMKASPEDLTMLIESTPNIDEVITSLDSEAIIILINKLPNLRRVMEKLKPQTKMHIVSKLCDKVGNAMEWAGARRNDGSGMWNEYGSVWEGIDAIQDLEAEMITRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPMPATLVSNMDPHVLRSIILNMPNLDDILMQVDPVHLQSALMYVPGFGTYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQWMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPKAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPARFANVLLHVKPDYVRYIIENHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPNVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAPTASVTEASMVTVDYAHLLRSKIPLIDNVIKMSDPAKVAILRDNLLDVGTTDENGVSSITVEKLQMVLLKIPLFDQFLNWIDSKKLHALLQKIPTLLEVIATANQEALDKINLLLHDAIATMNTAKKLIVTSICRKLAEEGKLRLPRVCPSTST*
And so on.
I wrote a script in bash to do this and it worked like I wanted, it was very simple. Now I'm trying to do the same in Python (which I'm learning), but I'm having a hard time to do the same in a pythonic way.
I've tried this, but I've got only the first pair and then it stopped:
rbh=open('rbh_res_eg-not-sec.txt', 'r')
ems=open('em_seq.fasta', 'r')
egs=open('eg_seq.fasta', 'r')
for l in rbh:
emid=l.split('\t')[0]
egid=l.split('\t')[1]
# ids=emid+'\n'+egid
# print ids # just to check if split worked
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
for lg in egs:
if egid in lg:
print lg.strip()
print next(egs).strip()
I've tried some variations but I've got only the IDs, without the sequences.
So, how can I find the ID in the sequence file, print it and the line after it (the line with sequence referring to the ID)?
Please, let me know if I explained it clearly.
Iterating over a file moves the file pointer until it reaches the end of the file (the last line), so after the first iteration of your outer loop, the ems and egs files are exhausted.
The quick&dirty workaround would be to reset the ems and egs pointers to zero at the end of the outer loop, ie:
for line in rbh:
# no need to split twice
parts = line.split("\t")
emid, egid = parts[0].strip(), parts[1].strip()
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
ems.seek(0) # reset the file pointer
for lg in egs:
if egid in lg:
print lg.strip()
print next(egs).strip()
egs.seek(0) # reset the file pointer
Note that calling next(iterator) while already iterating over iterator will consume one more item of the iterator, as illustrated here:
>>> it = iter(range(20))
>>> for x in it:
... print x, next(it)
...
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15
16 17
18 19
As you can see, we don't iter on each element of our range here... Given your file format it should not be a huge problem but I thought I'd still warn you about it.
Now your algorithm is far from efficient - for each line of the rbh file it will read scan the whole ems and egs files again and again.
_NB : the following assumes that each emid / egid will appear at most once in the fasta files._
If your ems and egs files are not too large and you have enough available memory, you could load them into a pair of dicts and do a mere dict lookup (which is O(1) and possibly one of the most optimized operation in Python)
# warning: totally untested code
def fastamap(path):
d = dict()
with open(path) as f:
for num, line in enumerate(f, 1):
line = line.strip()
# skip empty lines.
if not line:
continue
# sanity check: we should only see
# lines starting with ">", the "value"
# lines being consumed by the `next(f)` call
if not line.startswith(">"):
raise ValueError(
"in file %s: line %s doesn't start with '>'" % (
path, num
))
# ok, proceed
d[line.lstrip(">")] = next(f).strip()
return d
ems = fastamap('em_seq.fasta')
egs = fastamap('eg_seq.fasta')
with open('rbh_res_eg-not-sec.txt') as rhb:
for line in rhb:
parts = line.split("\t")
emid, egid = parts[0].strip(), parts[1].strip()
if emid in ems:
print emid
print ems[emid]
if egid in egs:
print egid
print egs[egid]
If this doesn't fly because of memory issues, well bad luck you're stuck with sequential scan (unless you want to use some database system but this might be a bit overkill), but - always assuming the emid/egid each only appears once in the fasta files - you can at least exit the inner loops once you've find your target:
for l in rbh:
# no need to split twice, you can just unpack
emid, egid = l.split('\t')
for lm in ems:
if emid in lm:
print lm.strip()
print next(ems).strip()
break # no need to go further
ems.seek(0) # reset the file pointer
# etc...
I'm fairly new to python and made something that had this output:
(The text is in a csv file so so:
1,A
2,B
3,C etc)
Number Letter
1 A
2 B
3 C
26 Z
Unfortunately, I spent a good amount of time making it using a complicated method in which I manually made spaces like this:
Updated Code rn
fx = int(input('Number?\n'))
f=open('nums.txt','r')
lines=f.readlines()
line = lines[fx - 1]
with open('nums.txt','r') as f:
for i, line in enumerate(f):
if i >= 5:
break
NUM, LTR, SMB = line.rsplit(',', 1)
print(NUM.ljust(13) + LTR.ljust(13) + SMB)
How do I get it to make 3 columns? Right now it comes up with a
ValueError: not enough values to unpack (expected 3, got 2)
So is there a simpler method of achieving this that doesn't move the strings around like this:
Number Letter
1 A
2 B
3 C
26 Z #< string moves with spaces.
For simple alignment, you can use ljust or rjust. There is also no need to read the entire file for each line you want to process:
with open('numberletter','r') as f:
for i, line in enumerate(f):
if i >= 5:
break
number, letter = line.rsplit(',', 1)
print(number.ljust(13) + letter)
For more complex output formatting, look at str.format() and the formatting syntax
You can use sys module for that.
import sys
a=[1,"A"]
sys.stdout.write("%-6s %-50s " % (a[0],a[1]))
I am a beginning python user (trying to learn for bioinformatics) and I am having difficulties in getting my final 'for loop' correct. I have used a web-based bioinformatic program to assess the subcellular localization of certain proteins (protein names and sequences contained within ORFs) and I am trying to parse the results (contained within targetp). The web-based program that I've used truncates the names of the proteins (and does not include sequences), and I would like to parse my results file such that I have the complete name and sequence of each protein in FASTA format (this entails having a '>' + the protein name on one line, and the protein sequence on the subsequent line). I think that everything is going well until the last block of code; I end up with the proper protein names, but they are all appended to the same sequence. I know that there must be something simple that I am doing wrong, but I just can't figure it out. Any ideas?
Thanks!
The ORFs file looks like this (it's FASTA, but the " shouldn't be there, only >):
">HsaNP_000700 branched chain keto acid dehydrogenase E1, alpha polypeptide
MAVAIAAARVWRLNRGLSQAALLLLRQPGARGLARSHPPRQQQQFSSLDDKPQFPGASAEFIDKLEFIQPNVISGIPIYRVMDRQGQIINPSEDPHLPKEKVLKLYKSMTLLNTMDRILYESQRQGRISFYMTNYGEEGTHVGSAAALDNTDLVFGQYREAGVLMYRDYPLELFMAQCYGNISDLGKGRQMPVHYGCKERHFVTISSPLATQIPQAVGAAYAAKRANANRVVICYFGEGAASEGDAHAGFNFAATLECPIIFFCRNNGYAISTPTSEQYRGDGIAARGPGYGIMSIRVDGNDVFAVYNATKEARRRAVAENQPFLIEAMTYRIGHHSTSDDSSAYRSVDEVNYWDKQDHPISRLRHYLLSQGWWDEEQEKAWRKQSRRKVMEAFEQAERKPKPNPNLLFSDVYQEMPAQLRKQQESLARHLQTYGEHYPLDHFDK
">HsaNP_060914 pyruvate dehydrogenase phosphatase precursor
MPAPTQLFFPLIRNCELSRIYGTACYCHHKHLCCSSSYIPQSRLRYTPHPAYATFCRPKENWWQYTQGRRYASTPQKFYLTPPQVNSILKANEYSFKVPEFDGKNVSSILGFDSNQLPANAPIEDRRSAATCLQTRGMLLGVFDGHAGCACSQAVSERLFYYIAVSLLPHETLLEIENAVESGRALLPILQWHKHPNDYFSKEASKLYFNSLRTYWQELIDLNTGESTDIDVKEALINAFKRLDNDISLEAQVGDPNSFLNYLVLRVAFSGATACVAHVDGVDLHVANTGDSRAMLGVQEEDGSWSAVTLSNDHNAQNERELERLKLEHPKSEAKSVVKQDRLLGLLMPFRAFGDVKFKWSIDLQKRVIESGPDQLNDNEYTKFIPPNYHTPPYLTAEPEVTYHRLRPQDKFLVLATDGLWETMHRQDVVRIVGEYLTGMHHQQPIAVGGYKVTLGQMHGLLTERRTKMSSVFEDQNAATHLIRHAVGNNEFGTVDHERLSKMLSLPEELARMYRDDITIIVVQFNSHVVGAYQNQE
The targetp file looks like this (the M is in position 57, but the formatting here throws this off):
HsaNP_000700 445 0.939 0.020 0.089 M 1
HsaNP_060914 537 0.309 0.073 0.629 _ 4
The leftmost column in targetp is the identifier (part of the header line in each protein sequence above), and I want to return only entries with an 'M' (i.e., not '_') in position 57, along with the protein name from ORFs (header line).
My script is:
#!/usr/bin/python
ORFs = open('Human.MitoCarta.fasta', 'U')
targetp = open('MitoCarta_TargetP_combined.out', 'U')
report = targetp.readlines()
protfile = open('mitocarta_no_mTP.fasta','w')
protid = []
seqdict = {}
for seq in ORFs:
seq = seq.rstrip()
if seq[0] == '':
continue
if seq[0] == '>':
name = seq[1:]
seqdict[name] = ''
continue
seqdict[name] += seq
for entry in report:
if entry.startswith('HsaNP'):
if entry[57] != 'M':
protid.append(entry[0:20])
protid = [x.strip(' ') for x in protid]
nameslist = seqdict.keys()
c = 0
for i in protid:
if i in nameslist[c]:
protfile.write('>%s\n%s\n\n' % (nameslist[c], seqdict[name]))
c += 1
protfile.close()
Yes, you are writing nameslist[c] and seqdict[name] but you never change 'name'. So you need to change 'name' if you want to get the different sequences. You should write:
protfile.write('>%s\n%s\n\n' % (nameslist[c], seqdict[nameslist[c]]))
That way you should get it right.
So the question basically gives me 19 DNA sequences and wants me to makea basic text table. The first column has to be the sequence ID, the second column the length of the sequence, the third is the number of "A"'s, 4th is "G"'s, 5th is "C", 6th is "T", 7th is %GC, 8th is whether or not it has "TGA" in the sequence. Then I get all these values and write a table to "dna_stats.txt"
Here is my code:
fh = open("dna.fasta","r")
Acount = 0
Ccount = 0
Gcount = 0
Tcount = 0
seq=0
alllines = fh.readlines()
for line in alllines:
if line.startswith(">"):
seq+=1
continue
Acount+=line.count("A")
Ccount+=line.count("C")
Gcount+=line.count("G")
Tcount+=line.count("T")
genomeSize=Acount+Gcount+Ccount+Tcount
percentGC=(Gcount+Ccount)*100.00/genomeSize
print "sequence", seq
print "Length of Sequence",len(line)
print Acount,Ccount,Gcount,Tcount
print "Percent of GC","%.2f"%(percentGC)
if "TGA" in line:
print "Yes"
else:
print "No"
fh2 = open("dna_stats.txt","w")
for line in alllines:
splitlines = line.split()
lenstr=str(len(line))
seqstr = str(seq)
fh2.write(seqstr+"\t"+lenstr+"\n")
I found that you have to convert the variables into strings. I have all of the values calculated correctly when I print them out in the terminal. However, I keep getting only 19 for the first column, when it should go 1,2,3,4,5,etc. to represent all of the sequences. I tried it with the other variables and it just got the total amounts of the whole file. I started trying to make the table but have not finished it.
So my biggest issue is that I don't know how to get the values for the variables for each specific line.
I am new to python and programming in general so any tips or tricks or anything at all will really help.
I am using python version 2.7
Well, your biggest issue:
for line in alllines: #1
...
fh2 = open("dna_stats.txt","w")
for line in alllines: #2
....
Indentation matters. This says "for every line (#1), open a file and then loop over every line again(#2)..."
De-indent those things.
This puts the info in a dictionary as you go and allows for DNA sequences to go over multiple lines
from __future__ import division # ensure things like 1/2 is 0.5 rather than 0
from collections import defaultdict
fh = open("dna.fasta","r")
alllines = fh.readlines()
fh2 = open("dna_stats.txt","w")
seq=0
data = dict()
for line in alllines:
if line.startswith(">"):
seq+=1
data[seq]=defaultdict(int) #default value will be zero if key is not present hence we can do +=1 without originally initializing to zero
data[seq]['seq']=seq
previous_line_end = "" #TGA might be split accross line
continue
data[seq]['Acount']+=line.count("A")
data[seq]['Ccount']+=line.count("C")
data[seq]['Gcount']+=line.count("G")
data[seq]['Tcount']+=line.count("T")
data[seq]['genomeSize']+=data[seq]['Acount']+data[seq]['Gcount']+data[seq]['Ccount']+data[seq]['Tcount']
line_over = previous_line_end + line[:3]
data[seq]['hasTGA']= data[seq]['hasTGA'] or ("TGA" in line) or (TGA in line_over)
previous_line_end = str.strip(line[-4:]) #save previous_line_end for next line removing new line character.
for seq in data.keys():
data[seq]['percentGC']=(data[seq]['Gcount']+data[seq]['Ccount'])*100.00/data[seq]['genomeSize']
s = '%(seq)d, %(genomeSize)d, %(Acount)d, %(Ccount)d, %(Tcount)d, %(Tcount)d, %(percentGC).2f, %(hasTGA)s'
fh2.write(s % data[seq])
fh.close()
fh2.close()