I'm working on a python program to compute a numerical coding of mutated residues and positions of a set of strings (protein sequences), stored in fasta format file with each protein sequence is separated by comma. I'm trying to find the position and sequences which are mutated.
My fasta file is as follows:
MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN
Example:
The following figure (based on another set of fasta file) will explain the algorithm behind this. In this figure first box represents alignment of input file sequences. The last box represents the output file. How can I do this with my fasta file in Python?
example input file:
MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
positions 1 2 3 4 5 6 1 2 3 4 5 6
protein sequence1 M T A Q D D T A D
protein sequence2 M T A Q D D T A D
protein sequence3 M T S Q E D T S E
protein sequence4 M T A Q D D T A D
protein sequence5 M K A Q H D K A H
PROTEIN SEQUENCE ALIGNMENT DISCARD NON-VARIABLE REGION
positions 2 2 3 3 5 5 5
protein sequence1 T A D
protein sequence2 T A D
protein sequence3 T S E
protein sequence4 T A D
protein sequence5 K A H
MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN
Output file should be like this:
position+residue 2T 2K 3A 3S 5D 5E 5H
sequence1 1 0 1 0 1 0 0
sequence2 1 0 1 0 1 0 0
sequence3 1 0 0 1 0 1 0
sequence4 1 0 1 0 1 0 0
sequence5 0 1 1 0 0 0 1
(RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)
Here are two ways I have tried to do it:
ls= 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'.split(',')
pos = [set(enumerate(x, 1)) for x in ls]
a=set().union(*pos)
alle = sorted(set().union(*pos))
print '\t'.join(str(x) + y for x, y in alle)
for p in pos:
print '\t'.join('1' if key in p else '0' for key in alle)
(here I'm getting columns of mutated as well as non-mutated residues, but I want only columns for mutated residues)
from pandas import *
data = 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'
df = DataFrame([list(row) for row in data.split(',')])
df = DataFrame({str(col+1)+val:(df[col]==val).apply(int) for col in df.columns for val in set(df[col])})
print df.select(lambda x: not df[x].all(), axis = 1)
(here it is giving output ,but not in orderly ie, first 2K then 2T then 3A like that.)
How should I be doing this?
The function get_dummies gets you most of the way:
In [11]: s
Out[11]:
0 T
1 T
2 T
3 T
4 K
Name: 1
In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]:
1K 1T
0 0 1
1 0 1
2 0 1
3 0 1
4 1 0
And those columns which have differing values:
In [21]: (df.ix[0] != df).any()
Out[21]:
0 False
1 True
2 True
3 False
4 True
5 False
Putting these together:
In [31]: I = df.columns[(df.ix[0] != df).any()]
In [32]: J = [pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I]
In [33]: df[[]].join(J)
Out[33]:
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
Note: I created the initial DataFrame as follows, however this may be done more efficiently depending on your situation:
df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))
Related
I have a dataset:
list1 list2
0 [1,3,4] [4,3,2]
1 [1,3,2] [0,4,6]
2 [4,5,8] NA
3 [6,3,7] [8,2,3]
Is there a process where i can find the count of the common term for- each of the index,
Expected output:
intersection_0, it will compare 0 of list1 with each of list2 and give output, intersection_1 which will compare 1 of list1 with each of list2
Expected_output:
Intersection_0 intersection_1 intersection_2 intersection_3
1 2 1 1
1 0 1 1
0 0 0 0
1 2 0 1
For intersection i was trying:
df['intersection'] = [len(set(a).intersection(b)) for a, b in zip(df1.list1, df1.list2)]
Is there a better way or faster way to achieve this? Thank you in advance
The double loop would go like this:
intersections = []
for l2 in df['list2']:
intersection = []
for l1 in df['list1']:
try:
i = len(np.intersect1d(l1,l2))
except:
i = 0
intersection.append(i)
intersections.append(intersection)
out = (pd.DataFrame(intersections))
Output:
0 1 2 3
0 2 2 1 1
1 1 0 1 1
2 0 0 0 0
3 1 2 1 1
matrix = []
for index, value in enumerate(['A','C','G','T']):
matrix.append([])
matrix[index].append(value + ':')
for i in range(len(lines[0])):
total = 0
for sequence in lines:
if sequence[i] == value:
total += 1
matrix[index].append(total)
unity = ''
for i in range(len(lines[0])):
column = []
for row in matrix:
column.append(row[1:][i])
maximum = column.index(max(column))
unity += ['A', 'C', 'G', 'T'][maximum]
print("Unity: " + unity)
for row in matrix:
print(' '.join(map(str, row)))
OUTPUT:
Unity: GGCTACGC
A: 1 2 0 2 3 2 0 0
C: 0 1 4 2 1 3 2 4
G: 3 3 2 0 1 2 4 1
T: 3 1 1 3 2 0 1 2
With this code I get this matrix but I want to form the matrix like this:
A C G T
G: 1 0 3 3
G: 2 1 3 1
C: 0 4 2 1
T: 2 2 0 3
A: 3 1 1 2
C: 2 3 2 0
G: 0 2 4 1
C: 0 4 1 2
But I don't know how. I hope someone can help me. Thanks already for the answers.
The sequences are:
AGCTACGT
TAGCTAGC
TAGCTACG
GCTAGCGC
TGCTAGCC
GGCTACGT
GTCACGTC
You're needing to do a transpose of your matrix. I've added comments in the code below to explain what has been changed to make the table.
matrix = []
for index, value in enumerate(['A','C','G','T']):
matrix.append([])
# Don't put colons in column headers
matrix[index].append(value)
for i in range(len(lines[0])):
total = 0
for sequence in lines:
if sequence[i] == value:
total += 1
matrix[index].append(total)
unity = ''
for i in range(len(lines[0])):
column = []
for row in matrix:
column.append(row[1:][i])
maximum = column.index(max(column))
unity += ['A', 'C', 'G', 'T'][maximum]
# Tranpose matrix
matrix = list(map(list, zip(*matrix)))
# Print header with tabs to make it look pretty
print( '\t'+'\t'.join(matrix[0]))
# Print rows in matrix
for row,unit in zip(matrix[1:],unity):
print(unit + ':\t'+'\t'.join(map(str, row)))
The following will be printed:
A C G T
G: 1 0 3 3
G: 2 1 3 1
C: 0 4 2 1
T: 2 2 0 3
A: 3 1 1 2
C: 2 3 2 0
G: 0 2 4 1
C: 0 4 1 2
I think that the best way is to convert your matrix to pandas dataframe and to then use transpose function.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.transpose.html
I have two dataframes net and M.
net =
i j d
0 5 3 3
1 2 0 2
2 3 2 1
3 4 5 2
4 0 1 3
5 0 3 4
M =
0 1 2 3 4 5
0 0 3 2 4 1 5
1 3 0 2 0 3 3
2 2 2 0 1 1 4
3 4 0 1 0 3 3
4 1 3 1 3 0 2
5 5 3 4 3 2 0
I want to find in M the same values of net['d'], choose randomly a cell in M and create a new dataframe containing the coordinate of that cell. For instance
net['d'][0] = 3
so in M I find:
M[0][1]
M[1][0]
M[1][4]
M[1][5]
...
Finally net1 would be something like that
net1 =
i1 j1 d1
0 1 5 3
1 5 4 2
2 2 3 1
3 1 2 2
4 1 5 3
5 3 0 4
This what I am doing:
I1 = []
J1 = []
for i in net.index:
tmp = net['d'][i]
ds = np.where( M == tmp)
size = len(ds[0])
ind = randint(size) ## find two random locations with distance ds
h = ds[0][ind]
w = ds[1][ind]
I1.append(h)
J1.append(w)
net1 = pd.DataFrame()
net1['i1'] = I1
net1['j1'] = J1
net1['d1'] = net['d']
I am wondering which is the best way to avoid that loop
You can stack the columns of M and then just sample it with replacement
net = pd.DataFrame({'i':[5,2,3,4,0,0],
'j':[3,0,2,5,1,3],
'd':[3,2,1,2,3,4]})
M = pd.DataFrame({0:[0,3,2,4,1,5],
1:[3,0,2,0,3,3],
2:[2,2,0,1,1,4],
3:[4,0,1,0,3,3],
4:[1,3,1,3,0,2],
5:[5,3,4,3,2,0]})
def random_net(net, M):
# make long table and randomize order of rows and rename columns
net1 = M.stack().reset_index()
net1.columns =['i1', 'j1', 'd1']
# get size of each group for random mapping
net1_id_length = net1.groupby('d1').size()
# add id column to uniquely identify row in net
net_copy = net.copy()
# first map gets size of each group and second gets random integer
net_copy['id'] = net_copy['d'].map(net1_id_length).map(np.random.randint)
net1['id'] = net1.groupby('d1').cumcount()
# make for easy lookup
net_copy = net_copy.set_index(['d', 'id'])
net1 = net1.set_index(['d1', 'id'])
# choose from net1 only those from original net
return net1.reindex(net_copy.index).reset_index('d').reset_index(drop=True).rename(columns={'d':'d1'})
random_net(net, M)
output
d1 i1 j1
0 3 5 1
1 2 0 2
2 1 3 2
3 2 1 2
4 3 3 5
5 4 0 3
Timings on 6 million rows
n = 1000000
net = pd.DataFrame({'i':[5,2,3,4,0,0] * n,
'j':[3,0,2,5,1,3] * n,
'd':[3,2,1,2,3,4] * n})
M = pd.DataFrame({0:[0,3,2,4,1,5],
1:[3,0,2,0,3,3],
2:[2,2,0,1,1,4],
3:[4,0,1,0,3,3],
4:[1,3,1,3,0,2],
5:[5,3,4,3,2,0]})
%timeit random_net(net, M)
1 loop, best of 3: 13.7 s per loop
I was about to create a matrix like :
33 12 23 42 11 32 43 22
33 − 1 1 1 0 0 1 1
12 1 − 1 1 0 0 1 1
23 1 1 − 1 1 1 0 0
42 1 1 1 − 1 1 0 0
11 0 0 1 1 − 1 1 1
32 0 0 1 1 1 − 1 1
43 1 1 0 0 1 1 − 1
22 1 1 0 0 1 1 1 −
I want to query by horizontal or vertical titles, so I created the matrix by:
a = np.matrix('99 33 12 23 42 11 32 43 22;33 99 1 1 1 0 0 1 1;12 1 99 1 1 0 0 1 1;23 1 1 99 1 1 1 0 0;42 1 1 1 99 1 1 0 0;11 0 0 1 1 99 1 1 1;32 0 0 1 1 1 99 1 1;43 1 1 0 0 1 1 99 1;22 1 1 0 0 1 1 1 99')
I want to have the certain data if I query a[23][11] = 1
so is there a way we can create a 2D dictionary, so that a[23][11] = 1?
Thanks
You're clearly asking for something outside of numpy.
A defauldict with the default_factory as dict gives a sense of the 2D dictionary you want:
>>> from collections import defaultdict
>>> a = defaultdict(dict)
>>> a[23][11] = 1
>>> a[23]
{11: 1}
>>> a[23][11]
1
Another possibility is to use tuples as the dictionary keys
dict((33,12):1, (23,12):1, ...]
scipy.sparse has a sparse matrix format that stores it's values in such a dictionary. With your values such a matrix would represent a 50x50 matrix with mostly 0 values, and just 1's at these selected coordinates.
Keep in mind that the keys of a dictionary (ordinary at least) are not ordered
What are going to be doing with this data? A dictionary, whether type or nested, is good for one kind of usage, but bad for others. A matrix such as you sample is better for other things, like operations along rows or columns. The dictionary format largely obscures that kind of ordered layout.
Are you looking for a dictionary with pairs as keys?
d = {}
d[33, 12] = 1
d[33, 23] = 1
# etc
Note that in python d[a, b] is just syntactic sugar for d[(a, b)]
If I understand correctly you just want to label your row/columns. To stay within the numpy array framework, a simple solution would be to create a mapping between the labels and the array order. I am also going to assume that it is OK to convert the labels into strings as they can be anything (though integers would also be fine).
l = {str(x) : ind for ind , x in enumerate((33,12,23,42,11,32,43,22))}
a = sp.linalg.circulant([99,1,1,1,0,0,1,1])
a[l['32'],l['23']]
What is the best way to implement the Apriori algorithm in pandas? So far I got stuck on transforming extracting out the patterns using for loops. Everything from the for loop onward does not work. Is there a vectorized way to do this in pandas?
import pandas as pd
import numpy as np
trans=pd.read_table('output.txt', header=None,index_col=0)
def apriori(trans, support=4):
ts=pd.get_dummies(trans.unstack().dropna()).groupby(level=1).sum()
#user input
collen, rowlen =ts.shape
#max length of items
tssum=ts.sum(axis=1)
maxlen=tssum.loc[tssum.idxmax()]
items=list(ts.columns)
results=[]
#loop through items
for c in range(1, maxlen):
#generate patterns
pattern=[]
for n in len(pattern):
#calculate support
pattern=['supp']=pattern.sum/rowlen
#filter by support level
Condit=pattern['supp']> support
pattern=pattern[Condit]
results.append(pattern)
return results
results =apriori(trans)
print results
When I insert this with support 3
a b c d e
0
11 1 1 1 0 0
666 1 0 0 1 1
10101 0 1 1 1 0
1010 1 1 1 1 0
414147 0 1 1 0 0
10101 1 1 0 1 0
1242 0 0 0 1 1
101 1 1 1 1 0
411 0 0 1 1 1
444 1 1 1 0 0
it should output something like
Pattern support
a 6
b 7
c 7
d 7
e 3
a,b 5
a,c 4
a,d 4
Assuming I understand what you're after, maybe
from itertools import combinations
def get_support(df):
pp = []
for cnum in range(1, len(df.columns)+1):
for cols in combinations(df, cnum):
s = df[list(cols)].all(axis=1).sum()
pp.append([",".join(cols), s])
sdf = pd.DataFrame(pp, columns=["Pattern", "Support"])
return sdf
would get you started:
>>> s = get_support(df)
>>> s[s.Support >= 3]
Pattern Support
0 a 6
1 b 7
2 c 7
3 d 7
4 e 3
5 a,b 5
6 a,c 4
7 a,d 4
9 b,c 6
10 b,d 4
12 c,d 4
14 d,e 3
15 a,b,c 4
16 a,b,d 3
21 b,c,d 3
[15 rows x 2 columns]
add support, confidence, and lift caculation。
def apriori(data, set_length=2):
import pandas as pd
df_supports = []
dataset_size = len(data)
for combination_number in range(1, set_length+1):
for cols in combinations(data.columns, combination_number):
supports = data[list(cols)].all(axis=1).sum() * 1.0 / dataset_size
confidenceAB = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[0]]==1])
confidenceBA = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[-1]]==1])
liftAB = confidenceAB * dataset_size / len(data[data[cols[-1]]==1])
liftBA = confidenceAB * dataset_size / len(data[data[cols[0]]==1])
df_supports.append([",".join(cols), supports, confidenceAB, confidenceBA, liftAB, liftBA])
df_supports = pd.DataFrame(df_supports, columns=['Pattern', 'Support', 'ConfidenceAB', 'ConfidenceBA', 'liftAB', 'liftBA'])
df_supports.sort_values(by='Support', ascending=False)
return df_supports