reporting the best alignment with pairwise2 - python

I have a fastq file of reads, say "reads.fastq". I want to align the sequnces to a string saved as a fasta file ref.faa. I am using the following code for this
reads_array = []
for x in Bio.SeqIO.parse("reads.fastq","fastq"):
reads_array.append(x)
for x in Bio.SeqIO.parse("ref.faa","fasta"):
refseq = x
result = open("alignments_G10_final","w")
aligned_reads = []
for x in reads_array:
alignments =pairwise2.align.globalms(str(refseq.seq).upper(),str(x.seq),2,-1,-5,-0.05)
for a in alignments:
result.write(format_alignment(*a))
aligned_reads.append(x)
But I want to report only the best alignment for each read. How can I choose this alignment from the scores in a[2]. I want to choose the alignment(s) with the highest value of a[2]

You could sort the alignments according to a[2]:
for x in reads_array:
alignments = pairwise2.align.globalms(
str(refseq.seq).upper(), str(x.seq), 2, -1, -5, -0.05)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2))
result.write(format_alignment(*sorted_alignments[0]))
aligned_reads.append(x)

I know this is an old question, but for anyone still looking for the correct answer, add one_alignment_only=True argument to your align method:
alignments =pairwise2.align.globalms(str(refseq.seq).upper(),
str(x.seq),
2,-1,-5,-0.05,
one_alignment_only=True)
I had to do some digging around in the documentation to find it, but this gives back the optimal score.

Related

Difflib displays wrong output

I'm trying to compare two files containing key-value pairs. Here is an example of such pairs:
ORIGINAL:
z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.304856846498851e-06, 1.304754136591639e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733914e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384797e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.0045412539447373e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664106e-06, 1.3049327764458588e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n
FILE 2:
z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.3048568464988513e-06, 1.3047541365916392e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733912e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384799e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.004541253944737e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664108e-06, 1.304932776445859e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n
When I run both files trough difflib.Differ().compare() I get the following output which is wrong:
'- z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.304856846498851e-06, 1.304754136591639e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733914e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384797e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.0045412539447373e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664106e-06, 1.3049327764458588e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n', '+ z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.3048568464988513e-06, 1.3047541365916392e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733912e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384799e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.004541253944737e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664108e-06, 1.304932776445859e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n'
If you look closely you'll see that the line from both files is extremely similar with some minor differences. For some reason difflib doesn't recognize the similar characters and just gives the whole line as a difference.
Does anyone have a solution for this problem?

applying the Similar function in Gensim.Doc2Vec

I am trying to get the doc2vec function to work in python 3.
I Have the following code:
tekstdata = [[ index, str(row["StatementOfTargetFiguresAndPoliciesForTheUnderrepresentedGender"])] for index, row in data.iterrows()]
def prep (x):
low = x.lower()
return word_tokenize(low)
def cleanMuch(data, clean):
output = []
for x, y in data:
z = clean(y)
output.append([str(x), z])
return output
tekstdata = cleanMuch(tekstdata, prep)
def tagdocs(docs):
output = []
for x,y in docs:
output.append(gensim.models.doc2vec.TaggedDocument(y, x))
return output
tekstdata = tagdocs(tekstdata)
print(tekstdata[100])
vectorModel = gensim.models.doc2vec.Doc2Vec(tekstdata, size = 100, window = 4,min_count = 3, iter = 2)
ranks = []
second_ranks = []
for x, y in tekstdata:
print (x)
print (y)
inferred_vector = vectorModel.infer_vector(y)
sims = vectorModel.docvecs.most_similar([inferred_vector], topn=1001, restrict_vocab = None)
rank = [docid for docid, sim in sims].index(y)
ranks.append(rank)
All works as far as I can understand until the rank function.
The error I get is that there is no zero in my list e.g. the documents I am putting in does not have 10 in list:
File "C:/Users/Niels Helsø/Documents/github/Speciale/Test/Data prep.py", line 59, in <module>
rank = [docid for docid, sim in sims].index(y)
ValueError: '10' is not in list
It seems to me that it is the similar function that does not work.
the model trains on my data (1000 documents) and build a vocab which is tagged.
The documentation I mainly have used is this:
Gensim dokumentation
Torturial
I hope that some one can help. If any additional info is need please let me know.
best
Niels
If you're getting ValueError: '10' is not in list, you can rely on the fact that '10' is not in the list. So have you looked at the list, to see what is there, and if it matches what you expect?
It's not clear from your code excerpts that tagdocs() is ever called, and thus unclear what form tekstdata is in when provided to Doc2Vec. The intent is a bit convoluted, and there's nothing to display what the data appears as in its raw, original form.
But perhaps the tags you are supplying to TaggedDocument are not the required list-of-tags, but rather a simple string, which will be interpreted as a list-of-characters. As a result, even if you're supplying a tags of '10', it will be seen as ['1', '0'] – and len(vectorModel.doctags) will be just 10 (for the 10 single-digit strings).
Separate comments on your setup:
1000 documents is pretty small for Doc2Vec, where most published results use tens-of-thousands to millions of documents
an iter of 10-20 is more common in Doc2Vec work (and even larger values might be helpful with smaller datasets)
infer_vector() often works better with non-default values in its optional parameters, especially a steps that's much larger (20-200) or a starting alpha that's more like the bulk-training default (0.025)

read multi-line list from file

I have a file with data like:
POTENTIAL
TYPE 1
-5.19998150116627E+07 -5.09571848744513E+07 -4.99354600752570E+07 -4.89342214499422E+07 -4.79530582388520E+07
-4.69915679183017E+07 -4.60493560354389E+07 -4.51260360464197E+07 -4.42212291578282E+07 -4.33345641712756E+07
-4.24656773311163E+07 -4.16142121752159E+07 -4.07798193887125E+07 -3.99621566607090E+07 -3.91608885438409E+07
-3.83756863166569E+07
-8.99995987594328E+07 -8.81884626368405E+07 -8.64137733336537E+07 -8.46747974037847E+07 -8.29708161608188E+07
-8.13011253809965E+07 -7.96650350121689E+07 -7.80618688886128E+07 -7.64909644515842E+07 -7.49516724754953E+07
-7.34433567996002E+07 -7.19653940650832E+07 -7.05171734574350E+07 -6.90980964540154E+07 -6.77075765766936E+07
-6.63450391494693E+07
Note as per Nsh's comment these data are not single line. They always have 5 data per line, and as per this example, 4 row, with only one data in 4th row. So, I have 16 float spread over 4 line. I always know the total number (i.e. 16 in this case)
My aim is to read them as a list (please let me know if there is better things). The row with the single entry denotes end of a list (e.g. the list[1] ends with -3.83756863166569E+07).
I tried to read it as:
if line.startswith("POTENTIAL"):
lines = f.readline()
if lines.startswith("TYPE "):
lines=f.readline()
lines=lines.split()
lines = [float(i) for i in lines]
pots.append(lines)
print(pots)
which gives result:
[[-51999815.0116627, -50957184.8744513, -49935460.075257, -48934221.4499422, -47953058.238852]]
i.e. just the first line from the list, and not going any further.
My aim is to get them as different list (possibly) as:
pots[1]=[-5.19998150116627E+07....-3.83756863166569E+07]
pots[2]=[-8.99995987594328E+07....-6.63450391494693E+07]
I have read searched google extensively (the present state itself is from another SO question), but due to my inexperience, I cant solve my problem.
Kindly help.
use + instead of append.
It will append the elements of lines to pots.
pots = pots + lines
I didn't see in the start:
pots = []
It is needed in this case...
ITEMS_PER_LIST = 16
lists = [[]] # list of lists with initialized first sublist
with open('data.txt') as f:
for line in f:
if line.startswith(("POTENTIAL", "TYPE")):
continue
if len(lists[-1]) == ITEMS_PER_LIST:
lists.append([]) # create new list
lists[-1].extend([float(i) for i in line.split()])
Additional tweaks are required to validate headers.

Format a python list and search for patterns

I am getting rows from a spreadsheet with mixtures of numbers, text and dates
I want to find elements within the list, some numbers and some text
for example
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(str, sg)
#sg = map(unicode, sg) #option?
if any("-1" in s for s in sg):
#do something if matched
I don't feel this is the correct way to do this, I am also trying to match stuff like -1.5 and -1.5C and other unexpected characters like OPEN15 compared to 15
I have also looked at
sg.index("-1")
If positive then its a match (Only good for direct matches)
Some help would be appreciated
If you want to call a function for each case, I would do it this way:
def stub1(elem):
#do something for match of type '-1'
return
def stub2(elem):
#do something for match of type 'SD4'
return
def stub3(elem):
#do something for match of type 'OPEN15'
return
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(unicode, sg)
patterns = {u"-1":stub1, u"SD4": stub2, u"OPEN15": stub3} # add more if you want
for elem in sg:
for k, stub in patterns.iteritems():
if k in elem:
stub(elem)
break
Where stub1, stub2, ... are the fonctions that contains the code for each case.
It will be called (max 1 time per strings) if the string contains a matching substring.
What do you mean by "I don't feel this is the correct way to do this" ? Are you not getting the result you expect ? Is it too slow ?
Maybe, you can organize your data by columns instead of rows and have a more specific filters. If you are looking for speed, I'd suggest using the numpy module which has a very intersting function called select()
Scipy select example
By transforming all your rows in a numpy array, you can test several columns in one pass. This function is amazingly efficient and powerful ! Basically it's used like this:
import numpy as np
a = array(...)
conds = [a < 10, a % 3 == 0, a > 25]
actions = [a + 100, a / 3, a * 10]
result = np.select(conds, actions, default = 0)
All values in a will be transformed as follow:
A value 100 will be added to any value of a which is smaller than 10
Any value in a which is a multiple of 3, will be divided by 3
Any value above 25 will be multiplied by 10
Any other value, not matching the previous conditions, will be set to 0
Bot conds and actions are lists, and must have the same number of arguments. The first element in conds has its action set as the first element of actions.
It could be used to determine the index in a vector for a particular value (eventhough this should be done using the nonzero() numpy function).
a = array(....)
conds = [a <= target, a > target]
actions = [1, 0]
index = select(conds, actions).sum()
This is probably a stupid way of getting an index, but it demonstrates how we can use select()... and it works :-)

Append Rows of Different Lengths to the Same Variable

I am trying to append a lengthy list of rows to the same variable. It works great for the first thousand or so iterations in the loop (all of which have the same lengths), but then, near the end of the file, the rows get a bit shorter, and while I still want to append them, I am not sure how to handle it.
The script gives me an out of range error, as expected.
Here is what the part of code in question looks like:
ii = 0
NNCat = []
NNCatelogue = []
while ii <= len(lines):
NNCat = (ev_id[ii], nn1[ii], nn2[ii], nn3[ii], nn4[ii], nn5[ii], nn6[ii], nn7[ii], nn8[ii], nn9[ii], nn10[ii], nn11[ii])
NNCatelogue.append(NNCat)
ii = ii + 1
print NNCatelogue, ii
Any help on this would be greatly appreciated!
I'll answer the question you didn't ask first ;) : how can this code be more pythonic?
Instead of
ii = 0
NNCat = []
NNCatelogue = []
while ii <= len(lines):
NNCat = (ev_id[ii], nn1[ii], nn2[ii], nn3[ii], nn4[ii], nn5[ii], nn6[ii], nn7[ii], nn8[ii], nn9[ii], nn10[ii], nn11[ii])
NNCatelogue.append(NNCat)
ii = ii + 1
you should do
NNCat = []
NNCatelogue = []
for ii, line in enumerate(lines):
NNCat = (ev_id[ii], nn1[ii], nn2[ii], nn3[ii], nn4[ii], nn5[ii], nn6[ii],
nn7[ii], nn8[ii], nn9[ii], nn10[ii], nn11[ii])
NNCatelogue.append(NNCat)
During each pass ii will be incremented by one for you and line will be the current line.
As for your short lines, you have two choices
Use a special value (such as None) to fill in when you don't have a real value
check the length of nn1, nn2, ..., nn11 to see if they are large enough
The second solution will be much more verbose, hard to maintain, and confusing. I strongly recommend using None (or another special value you create yourself) as a placeholder when there is no data.
def gvop(vals,indx): #get values or padding
return vals[indx] if indx<len(vals) else None
NNCatelogue = [(gvop(ev_id,ii), gvop(nn1,ii), gvop(nn2,ii), gvop(nn3,ii), gvop(nn4,ii),
gvop(nn5,ii), gvop(nn6,ii), gvop(nn7,ii), gvop(nn8,ii), gvop(nn9,ii),
gvop(nn10,ii), gvop(nn11,ii)) for ii in xrange(0, len(lines))]
By defining this other function to return either the correct value or padding, you can ensure rows are the same length. You can change the padding to anything, if None is not what you want.
Then the list comp creates a list of tuples as before, except containing padding in cases where some of the lines in the input are shorter.
from itertools import izip_longest
NNCatelogue = list(izip_longest(ev_id, nn1, nn2, ... nn11, fillvalue=None))
See here for documentation of izip. Do yourself a favour and skip the list around the iterator, if you don't need it. In many cases you can use the iterator as well as the list, and you save a lot of memory. Especially if you have long lists, that you're grouping together here.

Categories

Resources