Count and sort co-occurence matrix - python

I have a co-occurrence matrix in pandas. How do I get the co-occurence values of all the combinations, sorted descending without looping?
(I didn't write the values on the other side of the diagonal, but they are there, and hold mirrored values)
Input:
A B C D E F
A 0 1 0 1 2 0
B 0 3 1 1 1
C 0 1 8 9
D 0 2 6
E 0 9
F 0
Output:
CF 9
EF 9
CE 8
DF 6
BC 3
AE 2
DE 2
AB 1
AD 1
BD 1
BE 1
BF 1
CD 1
AC 0
AF 0

i, j = np.triu_indices(len(df), 1)
pd.Series(
df.values[i, j], df.index[i] + df.index[j]
).sort_values(ascending=False)
EF 9
CF 9
CE 8
DF 6
BC 3
DE 2
AE 2
CD 1
BF 1
BE 1
BD 1
AD 1
AB 1
AF 0
AC 0
dtype: object
Setup
txt = """\
A B C D E F
A 0 1 0 1 2 0
B 0 3 1 1 1
C 0 1 8 9
D 0 2 6
E 0 9
F 0"""
df = pd.read_fwf(pd.io.common.StringIO(txt), index_col=0).fillna('')
df
A B C D E F
A 0 1 0 1 2 0
B 0 3 1 1 1
C 0 1 8 9
D 0 2 6
E 0 9
F 0

You can loop through row and columns using combinations from itertools and add to the list.
from itertools import combinations
explode_list = []
_ = [explode_list.append([r + c, df.loc[r][c]]) for r, c in combinations(df.columns, 2)]
Output
[['AB', 1],
['AC', 0],
...
]

Related

Drop rows of a Pandas dataframe if the value of range columns 0

Dataframe:
df = pd.DataFrame({'a':['NA','W','Q','M'], 'b':[0,0,4,2], 'c':[0,12,0,2], 'd':[22, 3, 34, 12], 'e':[0,0,3,6], 'f':[0,2,0,0], 'h':[0,1,1,0] })
df
a b c d e f h
0 NA 0 0 22 0 0 0
1 W 0 12 3 0 2 1
2 Q 4 0 34 3 0 1
3 M 2 2 12 6 0 0
I want to drop the entire row if the value of column b and all columns e contain 0
Basically I want to get something like this
a b c d e f h
1 W 0 12 3 0 2 1
2 Q 4 0 34 3 0 1
3 M 2 2 12 6 0 0
If want test from e to end columns and b columns added by DataFrame.assign use DataFrame.loc for selecing, test for not equal by DataFrame.ne and then if aby values match (it means no all 0) with DataFrame.any and last filter by boolean indexing:
df = df[df.loc[:, 'e':].assign(b = df['b']).ne(0).any(axis=1)]
print (df)
a b c d e f h
1 W 0 12 3 0 2 1
2 Q 4 0 34 3 0 1
3 M 2 2 12 6 0 0

rename index using index and name column

I have the dataframe df
import pandas as pd
b=np.array([0,1,2,2,0,1,2,2,3,4,4,4,5,6,0,1,0,0]).reshape(-1,1)
c=np.array(['a','a','a','a','b','b','b','b','b','b','b','b','b','b','c','c','d','e']).reshape(-1,1)
df = pd.DataFrame(np.hstack([b,c]),columns=['Start','File'])
df
Out[22]:
Start File
0 0 a
1 1 a
2 2 a
3 2 a
4 0 b
5 1 b
6 2 b
7 2 b
8 3 b
9 4 b
10 4 b
11 4 b
12 5 b
13 6 b
14 0 c
15 1 c
16 0 d
17 0 e
I would like to rename the index using index_File
in order to have 0_a, 1_a, ...17_e as indeces
You use set_index with or without the inplace=True
df.set_index(df.File.radd(df.index.astype(str) + '_'))
Start File
File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e
At the expense of a few more code characters, we can quicken this up and take care of the unnecessary index name
df.set_index(df.File.values.__radd__(df.index.astype(str) + '_'))
Start File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e
You can directly assign to the index, first by converting the default index to str using astype and then concatenate the str as usual:
In[41]:
df.index = df.index.astype(str) + '_' + df['File']
df
Out[41]:
Start File
File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e

Transforming a 'repetitive' 2D-array into a matrix using python

I have a text file that includes information in the form of:
A 0
B 1
C 4
D 0
E 1
A 0
B 0
C 2
D 1
E 1
A 1
B 0
C 2
D 0
E 0
...
Note that the total number of ABCDE cycles (here only 3 shown) is not known without counting them.
I would like, using Python, to transform this into a matrix that has the form:
A 0 0 1 ...
B 1 0 0 ...
C 4 2 2 ...
D 0 1 0 ...
E 1 1 0 ...
I am not sure what is the best way to do such kind of transformation, does anyone as a python script that does this? Are there any function in Numpy or Pandas that would enable to do this easily? Or should I instead do it without Numpy or Pandas?
Many thanks in advance for your help!
Pandas solution:
import pandas as pd
from pandas.compat import StringIO
temp=u"""
A 0
B 1
C 4
D 0
E 1
A 0
B 0
C 2
D 1
E 1
A 1
B 0
C 2
D 0
E 0"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="\s+", header=None)
print (df)
0 1
0 A 0
1 B 1
2 C 4
3 D 0
4 E 1
5 A 0
6 B 0
7 C 2
8 D 1
9 E 1
10 A 1
11 B 0
12 C 2
13 D 0
14 E 0
df = pd.pivot(index=df[0], columns=df.groupby(0).cumcount(), values=df[1])
print (df)
0 1 2
0
A 0 0 1
B 1 0 0
C 4 2 2
D 0 1 0
E 1 1 0
option 1
add an index level and unstack
s.index = [s.index, np.arange(len(s)) // 5]
s.unstack()
option 2
reconstruct
pd.DataFrame(s.values.reshape(5, -1), s.index[:5])
setup
I assumed a series with an index as the first column.
import pandas as pd
from pandas.compat import StringIO
txt = """A 0
B 1
C 4
D 0
E 1
A 0
B 0
C 2
D 1
E 1
A 1
B 0
C 2
D 0
E 0"""
s = pd.read_csv(StringIO(txt), sep="\s+", header=None, index_col=0, squeeze=True)

Pivoting a table with hierarchical index

This is a simple problem but for some reason I am not able to find an easy solution.
I have a hierarchically indexed Series, for example:
s = pd.Series(data=randint(0, 3, 45),
index=pd.MultiIndex.from_tuples(list(itertools.product('pqr',[0,1,2],'abcde')),
names=['Index1', 'Index2', 'Index3']), name='P')
s = s.map({0:'A', 1:'B', 2:'C'})
So it looks like
Index1 Index2 Index3
p 0 a A
b A
c C
d B
e C
1 a B
b C
c C
d B
e B
q 0 a B
b C
c C
d C
e C
1 a A
b A
c B
d C
e A
I want to do a frequency count by value so that the output looks like
Index1 Index2 P
p 0 A 2
B 1
C 2
1 A 0
B 3
C 2
q 0 A 0
B 1
C 4
1 A 3
B 1
C 1
You can apply value_counts to the Series groupby:
In [11]: s.groupby(level=[0, 1]).value_counts() # equiv .apply(pd.value_counts)
Out[11]:
Index1 Index2
p 0 C 2
A 2
B 1
1 B 3
A 2
2 A 3
B 1
C 1
q 0 A 3
B 1
C 1
1 B 2
C 2
A 1
2 C 3
B 1
A 1
r 0 A 3
B 1
C 1
1 B 3
C 2
2 B 3
C 1
A 1
dtype: int64
If you want to include the 0s (which the above won't) you could use cross_tab:
In [21]: ct = pd.crosstab(rows=[s.index.get_level_values(0), s.index.get_level_values(1)],
cols=s.values,
aggfunc=len,
rownames=s.index.names[:2],
colnames=s.index.names[2:3])
In [22]: ct
Out[22]:
Index3 A B C
Index1 Index2
p 0 2 1 2
1 2 3 0
2 3 1 1
q 0 3 1 1
1 1 2 2
2 1 1 3
r 0 3 1 1
1 0 3 2
2 1 3 1
In [23]: ct.stack()
Out[23]:
Index1 Index2 Index3
p 0 A 2
B 1
C 2
1 A 2
B 3
C 0
2 A 3
B 1
C 1
q 0 A 3
B 1
C 1
1 A 1
B 2
C 2
2 A 1
B 1
C 3
r 0 A 3
B 1
C 1
1 A 0
B 3
C 2
2 A 1
B 3
C 1
dtype: int64
Which may be slightly faster...

numerical coding of mutated residues and positions

I'm writing a python program which has to compute a numerical coding of mutated residues and positions of a set of strings.These strings are protein sequences.These sequences are stored in fasta format file and each protein sequence is separated by comma.The sequence lengths may differ for different protein.In this I tried to find the position and sequence which are mutated.
I used following code for getting this.
a = 'AGFESPKLH'
b = 'KGFEHMKLH'
for i in range(len(a)):
if a[i] != b[i]:
print i, a[i], b[i]
But I want the sequence file as input file.The following figure will tell about my project.In this figure first box represents alignment of input file sequences.The last box represents the output file.
How can I do this in Python?
please help me.
Thank you for everyone for your time.
example:
input file
MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
positions 1 2 3 4 5 6 1 2 3 4 5 6
protein sequence1 M T A Q D D T A D
protein sequence2 M T A Q D D T A D
protein sequence3 M T S Q E D T S E
protein sequence4 M T A Q D D T A D
protein sequence5 M K A Q H D K A H
PROTEIN SEQUENCE ALIGNMENT DISCARD NON-VARIABLE REGION
positions 2 2 3 3 5 5 5
protein sequence1 T A D
protein sequence2 T A D
protein sequence3 T S E
protein sequence4 T A D
protein sequence5 K A H
MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN
Output file should be like this:
position+residue 2T 2K 3A 3S 5D 5E 5H
sequence1 1 0 1 0 1 0 0
sequence2 1 0 1 0 1 0 0
sequence3 1 0 0 1 0 1 0
sequence4 1 0 1 0 1 0 0
sequence5 0 1 1 0 0 0 1
(RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)
If you are to work with tabular data, consider pandas:
from pandas import *
data = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'
df = DataFrame([list(row) for row in data.split(',')])
print DataFrame({str(col)+val:(df[col]==val).apply(int)
for col in df.columns for val in set(df[col])})
output:
0M 1K 1T 2A 2S 3Q 4D 4E 4H 5D
0 1 0 1 1 0 1 1 0 0 1
1 1 0 1 1 0 1 1 0 0 1
2 1 0 1 0 1 1 0 1 0 1
3 1 0 1 1 0 1 1 0 0 1
4 1 1 0 1 0 1 0 0 1 1
If you want to drop the columns with all ones:
print df.select(lambda x: not df[x].all(), axis = 1)
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
Something like this?
ls = 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')
pos = [set(enumerate(x, 1)) for x in ls]
alle = sorted(set().union(*pos))
print '\t'.join(str(x) + y for x, y in alle)
for p in pos:
print '\t'.join('1' if key in p else '0' for key in alle)
protein_sequence = "MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN"
#Parse the file
proteins = protein_sequence.split(",")
#For each protein sequence remove the duplicates
proteins = map(lambda x:"".join(set(list(x))), proteins)
#Create result
result = []
key_set = ['T', 'K', 'A', 'S', 'D', 'E', 'K', 'R', 'D', 'N', 'E', 'Y', 'M', 'L', 'P', 'N', 'Q']
for protein in proteins:
local_dict = dict(zip(key_set, [0] * len(key_set)))
#Split the protein in amino acid
components = list(protein)
for amino_acid in components:
local_dict[amino_acid] = 1
result.append((protein, local_dict))
You can use the pandas function get_dummies to do most of the hard work:
In [11]: s # a pandas Series (DataFrame's column)
Out[11]:
0 T
1 T
2 T
3 T
4 K
Name: 1
In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]:
1K 1T
0 0 1
1 0 1
2 0 1
3 0 1
4 1 0
To put your data into a DataFrame you could use:
df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))
In [20]: df
Out[20]:
0 1 2 3 4 5
0 M T A Q D D
1 M T A Q D D
2 M T S Q E D
3 M T A Q D D
4 M K A Q H D
And to find those columns which have differing values:
In [21]: (df.ix[0] != df).any()
Out[21]:
0 False
1 True
2 True
3 False
4 True
5 False
Putting this all together:
In [31]: I = df.columns[(df.ix[0] != df).any()]
In [32]: J = (pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I)
In [33]: df[[]].join(J)
Out[33]:
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1

Categories

Resources