I have a dataframe like this
a b c d e f g h i j k l m
mut1 0 0 0 0 0 1 1 1 1 1 1 1 1
mut2 0 0 0 0 0 1 1 1 1 1 0 0 0
mut3 0 0 0 0 0 1 1 0 0 0 0 0 0
mut4 0 0 0 0 0 1 0 0 0 0 0 0 0
mut5 0 0 0 0 0 0 0 1 1 0 0 0 0
mut6 0 0 0 0 0 0 0 1 0 0 0 0 0
mut7 0 0 0 0 0 0 0 0 0 1 0 0 0
mut8 0 0 0 0 0 0 0 0 0 0 1 1 1
mut9 0 0 0 0 0 0 0 0 0 0 1 1 0
mut10 0 0 0 0 0 0 0 0 0 0 0 0 1
mut11 1 1 1 1 1 0 0 0 0 0 0 0 0
mut12 1 1 1 0 0 0 0 0 0 0 0 0 0
mut13 1 1 0 0 0 0 0 0 0 0 0 0 0
mut14 1 0 0 0 0 0 0 0 0 0 0 0 0
mut15 0 0 0 1 0 0 0 0 0 0 0 0 0
mut16 0 0 0 0 1 0 0 0 0 0 0 0 0
and origianl corresponding string
(a:0,b:0,c:0,d:0,e:0,f:0,g:0,h:0,i:0,j:0,k:0,l:0,m:0):0
The algorithm I thought was like this.
In row mut1, we can see that f,g,h,i,j,k,l,m have the same features.
So the string can be modified into
(a:0,b:0,c:0,d:0,e:0,(f:0,g:0,h:0,i:0,j:0,k:0,l:0,m:0):0):0
In row mut2, we can see that f,g,h,i,j have the same features.
So the string can be modified into
(a:0,b:0,c:0,d:0,e:0,((f:0,g:0,h:0,i:0,j:0):0,k:0,l:0,m:0):0):0
Until mut10, it continues to cluster samples in f,g,h,i,j,k,l,m.
And the output will be
(a:0,b:0,c:0,d:0,e:0,(((f:0,g:0):0,(h:0,i:0):0,j:0):0,((k:0,l:0):0,m:0):0):0):0
(For a row with one "1", just skip the process)
From mut10, it stars to cluster samples a,b,c,d,e
and similarly, the final output will be
(((a:0,b:0):0,c:0):0,d:0,e:0,(((f:0,g:0):0,(h:0,i:0):0,j:0):0,((k:0,l:0):0,m:0):0):0):0
So the algorithm is
Cluster the samples with the same features.
After clustering, add ":0" behind the closing parenthesis.
Any suggestions on this process?
*p.s. I have uploaded similar question
Creating a newick format from dataframe with 0 and 1
but this one is more detailed.
Your question asks for a solution in Python, which I'm not familiar with. Hopefully, the following procedure in R will be helpful as well.
What your question describes is matrix representation of a tree. Such a tree can be retrieved from the matrix with a maximum parsimony method using the phangorn package. To manipulate trees in R, newick format is useful. Newick differs from the tree representation in your question by ending with a semicolon.
First, prepare a starting tree in phylo format.
library(phangorn)
tree0 <- read.tree(text = "(a,b,c,d,e,f,g,h,i,j,k,l,m);")
Second, convert your data.frame to a phyDat object, where the rows represent samples and columns features. The phyDat object also requires what levels are present in the data, which is 0 and 1 in this case. Combining the starting tree with the data, we calculate the maximum parsimony tree.
dat0 = read.table(text = " a b c d e f g h i j k l m
mut1 0 0 0 0 0 1 1 1 1 1 1 1 1
mut2 0 0 0 0 0 1 1 1 1 1 0 0 0
mut3 0 0 0 0 0 1 1 0 0 0 0 0 0
mut4 0 0 0 0 0 1 0 0 0 0 0 0 0
mut5 0 0 0 0 0 0 0 1 1 0 0 0 0
mut6 0 0 0 0 0 0 0 1 0 0 0 0 0
mut7 0 0 0 0 0 0 0 0 0 1 0 0 0
mut8 0 0 0 0 0 0 0 0 0 0 1 1 1
mut9 0 0 0 0 0 0 0 0 0 0 1 1 0
mut10 0 0 0 0 0 0 0 0 0 0 0 0 1
mut11 1 1 1 1 1 0 0 0 0 0 0 0 0
mut12 1 1 1 0 0 0 0 0 0 0 0 0 0
mut13 1 1 0 0 0 0 0 0 0 0 0 0 0
mut14 1 0 0 0 0 0 0 0 0 0 0 0 0
mut15 0 0 0 1 0 0 0 0 0 0 0 0 0
mut16 0 0 0 0 1 0 0 0 0 0 0 0 0")
dat1 <- phyDat(data = t(dat0),
type = "USER",
levels = c(0, 1))
tree1 <- optim.parsimony(tree = tree0, data = dat1)
plot(tree1)
The tree now contains a cladogram with no branch lengths. Class phylo is effectively a list, so the zero branch lengths can be added as an extra element.
tree2 <- tree1
tree2$edge.length <- rep(0, nrow(tree2$edge))
Last, we write the tree into a character vector in newick format and remove the semicolon at the end to match the requirement.
tree3 <- write.tree(tree2)
tree3 <- sub(";", "", tree3)
tree3
# [1] "((e:0,d:0):0,(c:0,(b:0,a:0):0):0,((m:0,(l:0,k:0):0):0,((i:0,h:0):0,j:0,(g:0,f:0):0):0):0)"
I need to transform a df into antoher, being the original (df1) like this:
value
A--A 4
A--B 2
A--C 1
B--B 2
C--C 3
D--B 2
E--E 6
Then I have this other df2, filled with 0:
A B C D E
A 0 0 0 0 0
B 0 0 0 0 0
C 0 0 0 0 0
D 0 0 0 0 0
E 0 0 0 0 0
F 0 0 0 0 0
G 0 0 0 0 0
I need to convert it to a final df3, getting the values from the pairs in the index from df1, separted by "--", and fill it like this:
A B C D E
A 4 2 1 0 0
B 2 2 0 2 0
C 1 0 3 0 0
D 0 2 0 0 0
E 0 0 0 0 6
F 0 0 0 0 0
G 0 0 0 0 0
There can be pairs in pd2 not existing in pd1. It that case it remains with 0. Any suggestions??
You can create this from df itself. First, set df.index to a MultiIndex using str.split, and then unstack and reindex.
df.index = pd.MultiIndex.from_arrays(zip(*df.index.str.split('--')))
(df['value'].unstack()
.reindex(index=df2.index, columns=df2.columns)
.fillna(0, downcast='infer'))
A B C D E
A 4 2 1 0 0
B 0 2 0 0 0
C 0 0 3 0 0
D 0 2 0 0 0
E 0 0 0 0 6
F 0 0 0 0 0
G 0 0 0 0 0
If you know what rows and columns you want to use, you don't even need df2.
(df['value'].unstack()
.reindex(index=list('ABCDEFG'), columns=list('ABCDE'))
.fillna(0, downcast='infer'))
A B C D E
A 4 2 1 0 0
B 0 2 0 0 0
C 0 0 3 0 0
D 0 2 0 0 0
E 0 0 0 0 6
F 0 0 0 0 0
G 0 0 0 0 0
As per OP's comment, to maintain symmetricity, use pivot your table so NaNs are preserved, then fillna with the transpose:
v = (df['value'].unstack()
.reindex(index=df2.index, columns=df2.columns))
v.fillna(v.T.reindex_like(v)).fillna(0, downcast='infer')
A B C D E
A 4 2 1 0 0
B 2 2 0 2 0
C 1 0 3 0 0
D 0 2 0 0 0
E 0 0 0 0 6
F 0 0 0 0 0
G 0 0 0 0 0
I would like to 'OR' between row and row+1
for example,
A B C D E F G
r0 0 1 1 0 0 1 0
r1 0 0 0 0 0 0 0
r2 0 0 1 0 1 0 1
and the expected output will be like this
result 0 1 1 0 1 1
I know only how to sum it.
df.loc['result'] = df.sum()
but in this case i would like to do OR
thank you in advance
You can apply any over the first axis.
>>> df
>>>
A B C D E F G
r0 0 1 1 0 0 1 0
r1 0 0 0 0 0 0 0
r2 0 0 1 0 1 0 1
>>>
>>> df.loc['result'] = df.any(axis=0).astype(int)
>>> df
>>>
A B C D E F G
r0 0 1 1 0 0 1 0
r1 0 0 0 0 0 0 0
r2 0 0 1 0 1 0 1
result 0 1 1 0 1 1 1
... assuming that in your output you forgot the last column.
I have an empty DataFrame with Multi-Index index and columns. I also have list of strings that is cordinates of second level indexes. Since all of my second level index are unique, I am hoping to find cordinates and input values with my list of strings. Take a look at below example
df=
DNA Cat2 ....
Item A B C D E F F H I J
DNA Item
Cat2 A 0 0 0 0 0 0 0 0 0 0
B 0 0 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0
F 0 0 0 0 0 0 0 0 0 0
....
str_cord = [(A,B),(A,H),(A,I),(B,H),(B,I),(H,I)]
#and my output should be like below.
df_result=
DNA Cat2 ....
Item A B C D E F F H I J
DNA Item
Cat2 A 0 1 0 0 0 0 0 1 1 0
B 0 0 0 0 0 0 0 1 1 0
C 0 0 0 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0
F 0 0 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 0 0 1 0
....
It looks kinda complicated, but all I want to do is use my str_cord[0] as my cordinate for df_result. I tried with .loc, but it seems like I need to input level 1 index. I am looking for the way that I do not have to input Multi-Index level1 and find cordinates with level2 strings. Hope it make sense and thanks in advance! (Oh the data itself is very big, so as efficient as possible)
You can use:
for i, j in str_cord:
idx = pd.IndexSlice
df.loc[idx[:, i], idx[:, j]] = 1
Sample:
L = list('ABCDEFGHIJ')
mux = pd.MultiIndex.from_product([['Cat1','Cat2'], L])
df = pd.DataFrame(0, index=mux, columns=mux)
print (df)
Cat1 Cat2
A B C D E F G H I J A B C D E F G H I J
Cat1 A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
J 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Cat2 A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
J 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
str_cord = [('A','B'),('A','H'),('A','I'),('B','H'),('B','I'),('H','I')]
for i, j in str_cord:
idx = pd.IndexSlice
df.loc[idx[:, i], idx[:, j]] = 1
print (df)
Cat1 Cat2
A B C D E F G H I J A B C D E F G H I J
Cat1 A 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 1 0
B 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
J 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Cat2 A 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 1 0
B 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
J 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I have this simple dataframe in a data.csv file:
I,C,v
a,b,1
b,a,2
e,a,1
e,c,0
b,d,1
a,e,1
b,f,0
I would like to pivot it, and then return a square table (as a matrix). So far I've read the dataframe and build a pivot table with:
df = pd.read_csv('data.csv')
d = pd.pivot_table(df,index='I',columns='C',values='v')
d.fillna(0,inplace=True)
correctly obtaining:
C a b c d e f
I
a 0 1 0 0 1 0
b 2 0 0 1 0 0
e 1 0 0 0 0 0
Now I would like to return a square table with the missing columns indices in the rows, so that the resulting table would be:
C a b c d e f
I
a 0 1 0 0 1 0
b 2 0 0 1 0 0
c 0 0 0 0 0 0
d 0 0 0 0 0 0
e 1 0 0 0 0 0
f 0 0 0 0 0 0
reindex can add rows and columns, and fill missing values with 0:
index = d.index.union(d.columns)
d = d.reindex(index=index, columns=index, fill_value=0)
yields
a b c d e f
a 0 1 0 0 1 0
b 2 0 0 1 0 0
c 0 0 0 0 0 0
d 0 0 0 0 0 0
e 1 0 0 0 0 0
f 0 0 0 0 0 0