I'm writing a python program which has to compute a numerical coding of mutated residues and positions of a set of strings.These strings are protein sequences.These sequences are stored in fasta format file and each protein sequence is separated by comma.The sequence lengths may differ for different protein.In this I tried to find the position and sequence which are mutated.
I used following code for getting this.
a = 'AGFESPKLH'
b = 'KGFEHMKLH'
for i in range(len(a)):
if a[i] != b[i]:
print i, a[i], b[i]
But I want the sequence file as input file.The following figure will tell about my project.In this figure first box represents alignment of input file sequences.The last box represents the output file.
How can I do this in Python?
please help me.
Thank you for everyone for your time.
example:
input file
MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
positions 1 2 3 4 5 6 1 2 3 4 5 6
protein sequence1 M T A Q D D T A D
protein sequence2 M T A Q D D T A D
protein sequence3 M T S Q E D T S E
protein sequence4 M T A Q D D T A D
protein sequence5 M K A Q H D K A H
PROTEIN SEQUENCE ALIGNMENT DISCARD NON-VARIABLE REGION
positions 2 2 3 3 5 5 5
protein sequence1 T A D
protein sequence2 T A D
protein sequence3 T S E
protein sequence4 T A D
protein sequence5 K A H
MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN
Output file should be like this:
position+residue 2T 2K 3A 3S 5D 5E 5H
sequence1 1 0 1 0 1 0 0
sequence2 1 0 1 0 1 0 0
sequence3 1 0 0 1 0 1 0
sequence4 1 0 1 0 1 0 0
sequence5 0 1 1 0 0 0 1
(RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)
If you are to work with tabular data, consider pandas:
from pandas import *
data = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'
df = DataFrame([list(row) for row in data.split(',')])
print DataFrame({str(col)+val:(df[col]==val).apply(int)
for col in df.columns for val in set(df[col])})
output:
0M 1K 1T 2A 2S 3Q 4D 4E 4H 5D
0 1 0 1 1 0 1 1 0 0 1
1 1 0 1 1 0 1 1 0 0 1
2 1 0 1 0 1 1 0 1 0 1
3 1 0 1 1 0 1 1 0 0 1
4 1 1 0 1 0 1 0 0 1 1
If you want to drop the columns with all ones:
print df.select(lambda x: not df[x].all(), axis = 1)
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
Something like this?
ls = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')
pos = [set(enumerate(x, 1)) for x in ls]
alle = sorted(set().union(*pos))
print '\t'.join(str(x) + y for x, y in alle)
for p in pos:
print '\t'.join('1' if key in p else '0' for key in alle)
protein_sequence = "MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN"
#Parse the file
proteins = protein_sequence.split(",")
#For each protein sequence remove the duplicates
proteins = map(lambda x:"".join(set(list(x))), proteins)
#Create result
result = []
key_set = ['T', 'K', 'A', 'S', 'D', 'E', 'K', 'R', 'D', 'N', 'E', 'Y', 'M', 'L', 'P', 'N', 'Q']
for protein in proteins:
local_dict = dict(zip(key_set, [0] * len(key_set)))
#Split the protein in amino acid
components = list(protein)
for amino_acid in components:
local_dict[amino_acid] = 1
result.append((protein, local_dict))
You can use the pandas function get_dummies to do most of the hard work:
In [11]: s # a pandas Series (DataFrame's column)
Out[11]:
0 T
1 T
2 T
3 T
4 K
Name: 1
In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]:
1K 1T
0 0 1
1 0 1
2 0 1
3 0 1
4 1 0
To put your data into a DataFrame you could use:
df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))
In [20]: df
Out[20]:
0 1 2 3 4 5
0 M T A Q D D
1 M T A Q D D
2 M T S Q E D
3 M T A Q D D
4 M K A Q H D
And to find those columns which have differing values:
In [21]: (df.ix[0] != df).any()
Out[21]:
0 False
1 True
2 True
3 False
4 True
5 False
Putting this all together:
In [31]: I = df.columns[(df.ix[0] != df).any()]
In [32]: J = (pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I)
In [33]: df[[]].join(J)
Out[33]:
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
Related
Suppose I have an dataframe that looks like this:
A B C D E F G H
1 0 1 0 1 1 1 1
0 1 1 1 1 0 0 0
1 0 1 0 0 0 0 0
1 1 0 1 1 1 1 1
1 1 1 1 1 1 1 1
I want to choose the maximum number of data points given the constraint that they all have 1's when the other selected columns have 1s. We can prune rows to get this result as well but the objective is to get as many data points as possible (where data point is defined as one item/cell in the dataframe)
The expected output is unclear, but you can use aggregation to identify the similar columns and count the total number of 1s:
out = (df.T
.assign(count=df.sum())
.reset_index()
.groupby(list(df.index))
.agg({'index': list, 'count': 'sum'})
)
Output:
index count
0 1 2 3 4
0 1 0 1 1 [B, D] 6
1 0 0 1 1 [F, G, H] 9
1 0 1 1 1 [A] 4
1 1 0 1 1 [E] 4
1 1 1 0 1 [C] 4
You can then get the columns giving the max count:
out.loc[out['count'].idxmax(), 'index']
Output: ['F', 'G', 'H']
I am new to pandas. I am trying to move the items of a column to the columns of dataframe. I am struggling for hours but could not do so.
MWE
import numpy as np
import pandas as pd
df = pd.DataFrame({
'X': [10,20,30,40,50],
'Y': [list('abd'), list(), list('ab'),list('abefc'),list('e')]
})
print(df)
X Y
0 10 [a, b, d]
1 20 []
2 30 [a, b]
3 40 [a, b, e, f, c]
4 50 [e]
How to get the result like this:
X a b c d e
0 10 1 1 0 1 0
1 20 0 0 0 0 0
2 30 1 1 0 0 0
3 40 1 1 1 0 1
4 50 0 0 0 0 1
MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df[mlb.classes_] = mlb.fit_transform(df['Y'])
Pandas alternative
df.join(df['Y'].explode().str.get_dummies().groupby(level=0).max())
X Y a b c d e f
0 10 [a, b, d] 1 1 0 1 0 0
1 20 [] 0 0 0 0 0 0
2 30 [a, b] 1 1 0 0 0 0
3 40 [a, b, e, f, c] 1 1 1 0 1 1
4 50 [e] 0 0 0 0 1 0
You can try pandas.Series.str.get_dummies
out = df[['X']].join(df['Y'].apply(','.join).str.get_dummies(sep=','))
print(out)
X a b c d e f
0 10 1 1 0 1 0 0
1 20 0 0 0 0 0 0
2 30 1 1 0 0 0 0
3 40 1 1 1 0 1 1
4 50 0 0 0 0 1 0
My straight forward solution :
Check if the current col is in your Y list or add a 0 :
for col in ['a', 'b', 'c', 'd', 'e']:
df[col] = pd.Series([1 if col in df["Y"][x] else 0 for x in range(len(df.index))])
df = df.drop('Y', axis=1)
print(df)
Edit: Okay, the groupby is cleaner
Suppose, I have the following dataframe:
A B C D E F
1 1 1 0 0 0
0 0 0 0 0 0
1 1 0.9 1 0 0
0 0 0 0 -1.95 0
0 0 0 0 2.75 0
1 1 1 1 1 1
I want to select rows which have only zeros as well as ones (0 & 1) based on the columns C, D, E and F. For this example, the expected output is
A B C D E F
1 1 1 0 0 0
How can I do this with considering a range of columns in pandas?
Thanks in advance.
Let's try boolean indexing with loc to filter the rows:
c = ['C', 'D', 'E', 'F']
df.loc[df[c].isin([0, 1]).all(1) & df[c].eq(0).any(1) & df[c].eq(1).any(1)]
Result:
A B C D E F
0 1 1 1.0 0 0.0 0
Try apply and loc:
print(df.loc[df.apply(lambda x: sorted(x.drop_duplicates().tolist()) == [0, 1], axis=1)])
Output:
A B C D E F
0 1 1 1.0 0 0.0 0
HI everybody i need some help with python.
I'm working with an excel with several rows, some of this rows has zero value in all the columns, so i need to delete that rows.
In
id a b c d
a 0 1 5 0
b 0 0 0 0
c 0 0 0 0
d 0 0 0 1
e 1 0 0 1
Out
id a b c d
a 0 1 5 0
d 0 0 0 1
e 1 0 0 1
I think in something like show the rows that do not contain zeros, but do not work because is deleting all the rows with zero and without zero
path = '/Users/arronteb/Desktop/excel/ejemplo1.xlsx'
xlsx = pd.ExcelFile(path)
df = pd.read_excel(xlsx,'Sheet1')
df_zero = df[(df.OTC != 0) & (df.TM != 0) & (df.Lease != 0) & (df.Maint != 0) & (df.Support != 0) & (df.Other != 0)]
Then i think like just show the columns with zero
In
id a b c d
a 0 1 5 0
b 0 0 0 0
c 0 0 0 0
d 0 0 0 1
e 1 0 0 1
Out
id a b c d
b 0 0 0 0
c 0 0 0 0
So i make a little change and i have something like this
path = '/Users/arronteb/Desktop/excel/ejemplo1.xlsx'
xlsx = pd.ExcelFile(path)
df = pd.read_excel(xlsx,'Sheet1')
df_zero = df[(df.OTC == 0) & (df.TM == 0) & (df.Lease == 0) & (df.Maint == 0) & (df.Support == 0) & (df.Other == 0)]
In this way I just get the column with zeros. I need a way to remove this 2 rows from the original input, and receive the output without that rows. Thanks, and sorry for the bad English, I'm working on that too
Given your input you can group by whether all the columns are zero or not, then access them, eg:
groups = df.groupby((df.drop('id', axis= 1) == 0).all(axis=1))
all_zero = groups.get_group(True)
non_all_zero = groups.get_group(False)
For this dataframe:
df
Out:
id a b c d e
0 a 2 0 2 0 1
1 b 1 0 1 1 1
2 c 1 0 0 0 1
3 d 2 0 2 0 2
4 e 0 0 0 0 2
5 f 0 0 0 0 0
6 g 0 2 1 0 2
7 h 0 0 0 0 0
8 i 1 2 2 0 2
9 j 2 2 1 2 1
Temporarily set the index:
df = df.set_index('id')
Drop rows containing all zeros and reset the index:
df = df[~(df==0).all(axis=1)].reset_index()
df
Out:
id a b c d e
0 a 2 0 2 0 1
1 b 1 0 1 1 1
2 c 1 0 0 0 1
3 d 2 0 2 0 2
4 e 0 0 0 0 2
5 g 0 2 1 0 2
6 i 1 2 2 0 2
7 j 2 2 1 2 1
I have a DataFrame of authors and their papers:
author paper
0 A z
1 B z
2 C z
3 D y
4 E y
5 C y
6 F x
7 G x
8 G w
9 B w
I want to get a matrix of how many papers each pair of authors has together.
A B C D E F G
A
B 1
C 1 1
D 1 0 1
E 0 0 1 1
F 0 0 0 0 0
G 0 1 0 0 0 1
Is there a way to transform the DataFrame using pandas to get this results? Or is there a more efficient way (like with numpy) to do this so that it is scalable?
get_dummies, which I first reached for, isn't as convenient here as hoped; needed to add an extra groupby. Instead, it's actually simpler to add a dummy column or use a custom aggfunc. For example, if we start from a df like this (note that I've added an extra paper a so that there's at least one pair who's written more than one paper together)
>>> df
author paper
0 A z
1 B z
2 C z
[...]
10 A a
11 B a
We can add a dummy tick column, pivot, and then use the "it's simply a dot product" observation from this question:
>>> df["dummy"] = 1
>>> dm = df.pivot("author", "paper").fillna(0)
>>> dout = dm.dot(dm.T)
>>> dout
author A B C D E F G
author
A 2 2 1 0 0 0 0
B 2 3 1 0 0 0 1
C 1 1 2 1 1 0 0
D 0 0 1 1 1 0 0
E 0 0 1 1 1 0 0
F 0 0 0 0 0 1 1
G 0 1 0 0 0 1 2
where the diagonal counts how many papers an author has written. If you really want to obliterate the diagonal and above, we can do that too:
>>> dout.values[np.triu_indices_from(dout)] = 0
>>> dout
author A B C D E F G
author
A 0 0 0 0 0 0 0
B 2 0 0 0 0 0 0
C 1 1 0 0 0 0 0
D 0 0 1 0 0 0 0
E 0 0 1 1 0 0 0
F 0 0 0 0 0 0 0
G 0 1 0 0 0 1 0