Related
In my LIST(not dictionary) I have these strings:
"K:60",
"M:37",
"M_4:47",
"M_5:89",
"M_6:91",
"N:15",
"O:24",
"P:50",
"Q:50",
"Q_7:89"
in output I need to have
"K:60",
"M_6:91",
"N:15",
"O:24",
"P:50",
"Q_7:89"
What is the possible decision?
Or even maybe, how to take tag with the maximum among strings with the same tag.
Use re.split and list comprehension as shown below. Use the fact that when the dictionary dct is created, only the last value is kept for each repeated key.
import re
lst = [
"K:60",
"M:37",
"M_4:47",
"M_5:89",
"M_6:91",
"N:15",
"O:24",
"P:50",
"Q:50",
"Q_7:89"
]
dct = dict([ (re.split(r'[:_]', s)[0], s) for s in lst])
lst_uniq = list(dct.values())
print(lst_uniq)
# ['K:60', 'M_6:91', 'N:15', 'O:24', 'P:50', 'Q_7:89']
Probably far from the cleanest but here is a method quite easy to understand.
l = ["K:60", "M:37", "M_4:47", "M_5:89", "M_6:91", "N:15", "O:24", "P:50", "Q:50", "Q_7:89"]
reponse = []
val = []
complete_val = []
for x in l:
if x[0] not in reponse:
reponse.append(x[0])
complete_val.append(x.split(':')[0])
val.append(int(x.split(':')[1]))
elif int(x.split(':')[1]) > val[reponse.index(x[0])]:
val[reponse.index(x[0])] = int(x.split(':')[1])
for x in range(len(complete_val)):
print(str(complete_val[x]) + ":" + str(val[x]))
K:60
M:91
N:15
O:24
P:50
Q:89
I do not see any straight-forward technique. Other than iterating on entire thing and computing yourself, I do not see if any built-in can be used. I have written this where you do not require your values to be sorted in your input.
But I like the answer posted by Timur Shtatland, you can make us of that if your values are already sorted in input.
intermediate = {}
for item in a:
key, val = item.split(':')
key = key.split('_')[0]
val = int(val)
if intermediate.get(key, (float('-inf'), None))[0] < val:
intermediate[key] = (val, item)
ans = [x[1] for x in intermediate.values()]
print(ans)
which gives:
['K:60', 'M_6:91', 'N:15', 'O:24', 'P:50', 'Q_7:89']
I am quite new to python so still getting to grips with the language.
I have the following function which takes a string and apply it to an algorithm which tells us if it aligns to models 1, 2, 3, 4, or 5.
Currently this piece of code:
def apply_text(text):
test_str = [text]
test_new = tfidf_m.transform(test_str)
prediction = 0
for m in range(0,5):
percentage = '{P:.1%}'.format(M=cat[m], P=lr_m[m].predict_proba(test_new)[0][1])
print(percentage)
And running the following function: apply_text('Terrible idea.')
Gives the following output:
71.4%
33.1%
2.9%
1.6%
4.9%
With Model 1 = 71.4%, Model 2 = 33.1%, ... Model 5 = 4.9%.
I want to only output the Model number where there is the highest percentage. So in the above example, the answer would be 1 as this has 71.4%.
As the output is a string type I am finding it difficult to find ways of converting this to an int and then comparing each value (probably in a loop of some sort) to obtain the maximum value
I think you want to save the percentages along with the model number, sort it and then return the highest.
This can be done by something like this:
def apply_text(text):
test_str = [text]
test_new = tfidf_m.transform(test_str)
prediction = 0
percentage_list = []
for m in range(0,5):
percentage = '{P:.1}'.format(M=cat[m], P=lr_m[m].predict_proba(test_new)[0][1])
percentage_list.append([m+1, float(percentage)])
percentage_list.sort(reverse=True, key=lambda a: a[1])
return percentage_list[0][0]
Things to note:
Sorting in reverse order as default is ascending. You could skip reversing and access the last element of precentage_list by accessing -1 element
The key function is used as we need to sort using the percentage
Try putting values in a list then you can utilize list methods:
percentage = []
for m in range(0, 5):
percentage.append('{P:.1%}'.format(M=cat[m], P=lr_m[m].predict_proba(test_new)[0][1]))
print(*percentage, sep='\n')
print('Max on model', percentage.index(max(percentage)))
Or using a dictionary:
percentage = {}
for m in range(0, 5):
percentage['Model ' + str(m)] = '{P:.1%}'.format(M=cat[m], P=lr_m[m].predict_proba(test_new)[0][1])
print(*percentage, sep='\n')
print('Max on', max(percentage.keys(), key=(lambda key: percentage[key])))
I have a dictionary which looks like this:
cq={'A1_B2M_01':2.04, 'A2_B2M_01':2.58, 'A3_B2M_01':2.80, 'B1_B2M_02':5.00,
'B2_B2M_02':4.30, 'B2_B2M_02':2.40 etc.}
I need to calculate mean of triplets, where the keys[2:] agree. So, I would ideally like to get another dictionary which will be:
new={'_B2M_01': 2.47, '_B2M_02': 3.9}
The data is/should be in triplets so in theory I could just get the means of the consecutive values, but first of all, I have it in a dictionary so the keys/values will likely get reordered, besides I'd rather stick to the names, as a quality check for the triplets assigned to names (I will later add a bit showing error message when there will be more than three per group).
I've tried creating a dictionary where the keys would be _B2M_01 and _B2M_02 and then loop through the original dictionary to first append all the values that are assigned to these groups of keys so I could later calculate an average, but I am getting errors even in the first step and anyway, I am not sure if this is the most effective way to do this...
cq={'A1_B2M_01':2.4, 'A2_B2M_01':5, 'A3_B2M_01':4, 'B1_B2M_02':3, 'B2_B2M_02':7, 'B3_B2M_02':6}
trips=set([x[2:] for x in cq.keys()])
new={}
for each in trips:
for k,v in cq.iteritems():
if k[2:]==each:
new[each].append(v)
Traceback (most recent call last):
File "<pyshell#28>", line 4, in <module>
new[each].append(v)
KeyError: '_B2M_01'
I would be very grateful for any suggestions. It seems like a fairly easy operation but I got stuck.
An alternative result which would be even better would be to get a dictionary which contains all the names used as in cq, but with values being the means of the group. So the end result would be:
final={'A1_B2M_01':2.47, 'A2_B2M_01':2.47, 'A3_B2M_01':2.47, 'B1_B2M_02':3.9,
'B2_B2M_02':3.9, 'B2_B2M_02':3.9}
Something like this should work. You can probably make it a little more elegant.
cq = {'A1_B2M_01':2.04, 'A2_B2M_01':2.58, 'A3_B2M_01':2.80, 'B1_B2M_02':5.00, 'B2_B2M_02':4.30, 'B2_B2M_02':2.40 }
sum = {}
count = {}
mean = {}
for k in cq:
if k[2:] in sum:
sum[k[2:]] += cq[k]
count[k[2:]] += 1
else:
sum[k[2:]] = cq[k]
count[k[2:]] = 1
for k in sum:
mean[k] = sum[k] / count[k]
cq={'A1_B2M_01':2.4, 'A2_B2M_01':5, 'A3_B2M_01':4, 'B1_B2M_02':3, 'B2_B2M_02':7, 'B3_B2M_02':6}
sums = dict()
for k, v in cq.iteritems():
_, p2 = k.split('_', 1)
if p2 not in sums:
sums[p2] = [0, 0]
sums[p2][0] += v
sums[p2][1] += 1
res = {}
for k, v in sums.iteritems():
res[k] = v[0]/float(v[1])
print res
also could be done with one iteration
Grouping:
SEPARATOR = '_'
cq={'A1_B2M_01':2.4, 'A2_B2M_01':5, 'A3_B2M_01':4, 'B1_B2M_02':3, 'B2_B2M_02':7, 'B3_B2M_02':6}
groups = {}
for key in cq:
group_key = SEPARATOR.join(key.split(SEPARATOR)[1:])
if group_key in groups:
groups[group_key].append(cq[key])
else:
groups[group_key] = [cq[key]]
Generate means:
def means(groups):
for group, group_vals in groups.iteritems():
yield (group, float(sum(group_vals)) / len(group_vals),)
print list(means(groups))
I have to write a script to translate this sequence:
dict = {"TTT":"F|Phe","TTC":"F|Phe","TTA":"L|Leu","TTG":"L|Leu","TCT":"S|Ser","TCC":"S|Ser",
"TCA":"S|Ser","TCG":"S|Ser", "TAT":"Y|Tyr","TAC":"Y|Tyr","TAA":"*|Stp","TAG":"*|Stp",
"TGT":"C|Cys","TGC":"C|Cys","TGA":"*|Stp","TGG":"W|Trp", "CTT":"L|Leu","CTC":"L|Leu",
"CTA":"L|Leu","CTG":"L|Leu","CCT":"P|Pro","CCC":"P|Pro","CCA":"P|Pro","CCG":"P|Pro",
"CAT":"H|His","CAC":"H|His","CAA":"Q|Gln","CAG":"Q|Gln","CGT":"R|Arg","CGC":"R|Arg",
"CGA":"R|Arg","CGG":"R|Arg", "ATT":"I|Ile","ATC":"I|Ile","ATA":"I|Ile","ATG":"M|Met",
"ACT":"T|Thr","ACC":"T|Thr","ACA":"T|Thr","ACG":"T|Thr", "AAT":"N|Asn","AAC":"N|Asn",
"AAA":"K|Lys","AAG":"K|Lys","AGT":"S|Ser","AGC":"S|Ser","AGA":"R|Arg","AGG":"R|Arg",
"GTT":"V|Val","GTC":"V|Val","GTA":"V|Val","GTG":"V|Val","GCT":"A|Ala","GCC":"A|Ala",
"GCA":"A|Ala","GCG":"A|Ala", "GAT":"D|Asp","GAC":"D|Asp","GAA":"E|Glu",
"GAG":"E|Glu","GGT":"G|Gly","GGC":"G|Gly","GGA":"G|Gly","GGG":"G|Gly"}
seq = "TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA"
a=""
for y in range( 0, len ( seq)):
c=(seq[y:y+3])
#print(c)
for k, v in dict.items():
if seq[y:y+3] == k:
alle_amino = v[::3] #alle aminozuren op rijtje, a1.1 -a2.1- a.3.1-a1.2 enzo
print (v)
With this script I get the amino acids from the 3 frames under each other, but how can I sort this and get all the amino acids from frame 1 next to each other, and all the amino acids from frame 2 next to each other, and the same for frame 3?
for example , my results must be :
+3 SerIleLeuAlaStpProLysTrpGluProProTyrValAlaStpProIleTyrIleTyrTle
+2 PheAsnThrSerMetThrLysValGlyThrProLeuArgSerMetThrHisIleTyrIleTyr
+1 PheGlnTyrStpHisAspGlnSerGlyAsnProLeuThrStpHisAspProTyrIleTyrIle
TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA
I use Python 3.
i had one more question : can i make this results by some changes in mine own script ?
You can use (Note this would be ridiculously much more easier using biopython translate method):
dictio = {your dictionary here}
def translate(seq):
x = 0
aaseq = []
while True:
try:
aaseq.append(dicti[seq[x:x+3]])
x += 3
except (IndexError, KeyError):
break
return aaseq
seq = "TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA"
for frame in range(3):
print('+%i' %(frame+1), ''.join(item.split('|')[1] for item in translate(seq[frame:])))
Note I changed the name of your dictionary with dicti (not to overwrite dict).
Some comments to help you understand:
translate takes you sequence and returns it in the form of a list in which each item corresponds to the amino acid translation of the triplet coding that position. Like:
aaseq = ["L|Leu","L|Leu","P|Pro", ....]
you could process more this data (get only one or three letters code) inside translate or return it as it is to be processed latter as I have done.
translate is called in
''.join(item.split('|')[1] for item in translate(seq[frame:]))
for each frame. For frame value being 0, 1 or 2 it sends seq[frame:] as a parameter to translate. That is, you are sending the sequences corresponding to the three different reading frames processing them in series. Then, in
''.join(item.split('|')[1]
I split the one and three-letters codes for each amino acid and take the one at index 1 (the second). Then they are joined in a single string
Not too pretty, but does what you want
dct = {"TTT":"F|Phe","TTC":"F|Phe","TTA":"L|Leu","TTG":"L|Leu","TCT":"S|Ser","TCC":"S|Ser",
"TCA":"S|Ser","TCG":"S|Ser", "TAT":"Y|Tyr","TAC":"Y|Tyr","TAA":"*|Stp","TAG":"*|Stp",
"TGT":"C|Cys","TGC":"C|Cys","TGA":"*|Stp","TGG":"W|Trp", "CTT":"L|Leu","CTC":"L|Leu",
"CTA":"L|Leu","CTG":"L|Leu","CCT":"P|Pro","CCC":"P|Pro","CCA":"P|Pro","CCG":"P|Pro",
"CAT":"H|His","CAC":"H|His","CAA":"Q|Gln","CAG":"Q|Gln","CGT":"R|Arg","CGC":"R|Arg",
"CGA":"R|Arg","CGG":"R|Arg", "ATT":"I|Ile","ATC":"I|Ile","ATA":"I|Ile","ATG":"M|Met",
"ACT":"T|Thr","ACC":"T|Thr","ACA":"T|Thr","ACG":"T|Thr", "AAT":"N|Asn","AAC":"N|Asn",
"AAA":"K|Lys","AAG":"K|Lys","AGT":"S|Ser","AGC":"S|Ser","AGA":"R|Arg","AGG":"R|Arg",
"GTT":"V|Val","GTC":"V|Val","GTA":"V|Val","GTG":"V|Val","GCT":"A|Ala","GCC":"A|Ala",
"GCA":"A|Ala","GCG":"A|Ala", "GAT":"D|Asp","GAC":"D|Asp","GAA":"E|Glu",
"GAG":"E|Glu","GGT":"G|Gly","GGC":"G|Gly","GGA":"G|Gly","GGG":"G|Gly"}
seq = "TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA"
def get_amino_list(s):
for y in range(3):
yield [s[x:x+3] for x in range(y, len(s) - 2, 3)]
for n, amn in enumerate(get_amino_list(seq), 1):
print ("+%d " % n + "".join(dct[x][2:] for x in amn))
print(seq)
Here's my solution. I've called your "dict" variable "aminos". The function method3 returns a list of the values to the right of the "|". To merge them into a single string, just join them on "".
From looking at your code, I believe that your aminos dict contains all possible three-letter combinations. Therefore, I've removed the checks that verify this. It should run a lot faster as a result.
def overlapping_groups(seq, group_len=3):
"""Returns `N` adjacent items from an iterable in a sliding window style
"""
for i in range(len(seq)-group_len):
yield seq[i:i+group_len]
def method3(seq, aminos):
return [aminos[k][2:] for k in overlapping_groups(seq, 3)]
for i in range(3):
print("%d: %s" % (i, "".join(method3(seq[i:], aminos))))
I have two equal-length 1D numpy arrays, id and data, where id is a sequence of repeating, ordered integers that define sub-windows on data. For example:
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
I would like to aggregate data by grouping on id and taking either the max or the min.
In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.
Is there a way I can avoid Python loops and do this in a vectorized manner?
I've been seeing some very similar questions on stack overflow the last few days. The following code is very similar to the implementation of numpy.unique and because it takes advantage of the underlying numpy machinery, it is most likely going to be faster than anything you can do in a python loop.
import numpy as np
def group_min(groups, data):
# sort with major key groups, minor key data
order = np.lexsort((data, groups))
groups = groups[order] # this is only needed if groups is unsorted
data = data[order]
# construct an index which marks borders between groups
index = np.empty(len(groups), 'bool')
index[0] = True
index[1:] = groups[1:] != groups[:-1]
return data[index]
#max is very similar
def group_max(groups, data):
order = np.lexsort((data, groups))
groups = groups[order] #this is only needed if groups is unsorted
data = data[order]
index = np.empty(len(groups), 'bool')
index[-1] = True
index[:-1] = groups[1:] != groups[:-1]
return data[index]
In pure Python:
from itertools import groupby, imap, izip
from operator import itemgetter as ig
print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]
A variation:
print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]
Based on #Bago's answer:
import numpy as np
# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]
# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10 1]
If pandas is installed:
from pandas import DataFrame
df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1 7
# 2 10
# 3 1
I'm fairly new to Python and Numpy but, it seems like you can use the .at method of ufuncs rather than reduceat:
import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)
For example:
data_val = array([ 0.65753453, 0.84279716, 0.88189818, 0.18987882, 0.49800668,
0.29656994, 0.39542769, 0.43155428, 0.77982853, 0.44955868,
0.22080219, 0.4807312 , 0.9288989 , 0.10956681, 0.73215416,
0.33184318, 0.10936647])
ans = array([ 0.98969952, 0.84044947, 0.63460516, 0.92042078, 0.75738113,
0.37976055])
Of course this only makes sense if your data_id values are suitable for use as indices (i.e. non-negative integers and not huge...presumably if they are large/sparse you could initialize ans using np.unique(data_id) or something).
I should point out that the data_id doesn't actually need to be sorted.
with only numpy and without loops:
id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])
# max
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)
# min
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)
# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])
(pd_group.values == np_group.values).all() # TRUE
Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:
import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)
And so on
A slightly faster and more general answer than the already accepted one; like the answer by joeln it avoids the more expensive lexsort, and it works for arbitrary ufuncs. Moreover, it only demands that the keys are sortable, rather than being ints in a specific range. The accepted answer may still be faster though, considering the max/min isn't explicitly computed. The ability to ignore nans of the accepted solution is neat; but one may also simply assign nan values a dummy key.
import numpy as np
def group(key, value, operator=np.add):
"""
group the values by key
any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
"""
#upcast to numpy arrays
key = np.asarray(key)
value = np.asarray(value)
#first, sort by key
I = np.argsort(key)
key = key[I]
value = value[I]
#the slicing points of the bins to sum over
slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
#first entry of each bin is a unique key
unique_keys = key[slices]
#reduce over the slices specified by index
per_key_sum = operator.reduceat(value, slices)
#number of counts per key is the difference of our slice points. cap off with number of keys for last bin
key_count = np.diff(np.append(slices, len(key)))
return unique_keys, per_key_sum, key_count
names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]
unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values
I think this accomplishes what you're looking for:
[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]
For the outer list comprehension, from right to left, set(id) groups the ids, sorted() sorts them, for k ... iterates over them, and max takes the max of, in this case, another list comprehension. So moving to that inner list comprehension: enumerate(data) returns both the index and value from data, if id[val] == k picks out the data members corresponding to id k.
This iterates over the full data list for each id. With some preprocessing into sublists, it might be possible to speed it up, but it won't be a one-liner then.
The following solution only requires a sort on the data (not a lexsort) and does not require finding boundaries between groups. It relies on the fact that if o is an array of indices into r then r[o] = x will fill r with the latest value x for each value of o, such that r[[0, 0]] = [1, 2] will return r[0] = 2. It requires that your groups are integers from 0 to number of groups - 1, as for numpy.bincount, and that there is a value for every group:
def group_min(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)[::-1]
result[groups.take(order)] = data.take(order)
return result
def group_max(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)
result[groups.take(order)] = data.take(order)
return result