I have a dataframe like this
a b c d e f g h i j k l m
mut1 0 0 0 0 0 1 1 1 1 1 1 1 1
mut2 0 0 0 0 0 1 1 1 1 1 0 0 0
mut3 0 0 0 0 0 1 1 0 0 0 0 0 0
mut4 0 0 0 0 0 1 0 0 0 0 0 0 0
mut5 0 0 0 0 0 0 0 1 1 0 0 0 0
mut6 0 0 0 0 0 0 0 1 0 0 0 0 0
mut7 0 0 0 0 0 0 0 0 0 1 0 0 0
mut8 0 0 0 0 0 0 0 0 0 0 1 1 1
mut9 0 0 0 0 0 0 0 0 0 0 1 1 0
mut10 0 0 0 0 0 0 0 0 0 0 0 0 1
mut11 1 1 1 1 1 0 0 0 0 0 0 0 0
mut12 1 1 1 0 0 0 0 0 0 0 0 0 0
mut13 1 1 0 0 0 0 0 0 0 0 0 0 0
mut14 1 0 0 0 0 0 0 0 0 0 0 0 0
mut15 0 0 0 1 0 0 0 0 0 0 0 0 0
mut16 0 0 0 0 1 0 0 0 0 0 0 0 0
and origianl corresponding string
(a:0,b:0,c:0,d:0,e:0,f:0,g:0,h:0,i:0,j:0,k:0,l:0,m:0):0
The algorithm I thought was like this.
In row mut1, we can see that f,g,h,i,j,k,l,m have the same features.
So the string can be modified into
(a:0,b:0,c:0,d:0,e:0,(f:0,g:0,h:0,i:0,j:0,k:0,l:0,m:0):0):0
In row mut2, we can see that f,g,h,i,j have the same features.
So the string can be modified into
(a:0,b:0,c:0,d:0,e:0,((f:0,g:0,h:0,i:0,j:0):0,k:0,l:0,m:0):0):0
Until mut10, it continues to cluster samples in f,g,h,i,j,k,l,m.
And the output will be
(a:0,b:0,c:0,d:0,e:0,(((f:0,g:0):0,(h:0,i:0):0,j:0):0,((k:0,l:0):0,m:0):0):0):0
(For a row with one "1", just skip the process)
From mut10, it stars to cluster samples a,b,c,d,e
and similarly, the final output will be
(((a:0,b:0):0,c:0):0,d:0,e:0,(((f:0,g:0):0,(h:0,i:0):0,j:0):0,((k:0,l:0):0,m:0):0):0):0
So the algorithm is
Cluster the samples with the same features.
After clustering, add ":0" behind the closing parenthesis.
Any suggestions on this process?
*p.s. I have uploaded similar question
Creating a newick format from dataframe with 0 and 1
but this one is more detailed.
Your question asks for a solution in Python, which I'm not familiar with. Hopefully, the following procedure in R will be helpful as well.
What your question describes is matrix representation of a tree. Such a tree can be retrieved from the matrix with a maximum parsimony method using the phangorn package. To manipulate trees in R, newick format is useful. Newick differs from the tree representation in your question by ending with a semicolon.
First, prepare a starting tree in phylo format.
library(phangorn)
tree0 <- read.tree(text = "(a,b,c,d,e,f,g,h,i,j,k,l,m);")
Second, convert your data.frame to a phyDat object, where the rows represent samples and columns features. The phyDat object also requires what levels are present in the data, which is 0 and 1 in this case. Combining the starting tree with the data, we calculate the maximum parsimony tree.
dat0 = read.table(text = " a b c d e f g h i j k l m
mut1 0 0 0 0 0 1 1 1 1 1 1 1 1
mut2 0 0 0 0 0 1 1 1 1 1 0 0 0
mut3 0 0 0 0 0 1 1 0 0 0 0 0 0
mut4 0 0 0 0 0 1 0 0 0 0 0 0 0
mut5 0 0 0 0 0 0 0 1 1 0 0 0 0
mut6 0 0 0 0 0 0 0 1 0 0 0 0 0
mut7 0 0 0 0 0 0 0 0 0 1 0 0 0
mut8 0 0 0 0 0 0 0 0 0 0 1 1 1
mut9 0 0 0 0 0 0 0 0 0 0 1 1 0
mut10 0 0 0 0 0 0 0 0 0 0 0 0 1
mut11 1 1 1 1 1 0 0 0 0 0 0 0 0
mut12 1 1 1 0 0 0 0 0 0 0 0 0 0
mut13 1 1 0 0 0 0 0 0 0 0 0 0 0
mut14 1 0 0 0 0 0 0 0 0 0 0 0 0
mut15 0 0 0 1 0 0 0 0 0 0 0 0 0
mut16 0 0 0 0 1 0 0 0 0 0 0 0 0")
dat1 <- phyDat(data = t(dat0),
type = "USER",
levels = c(0, 1))
tree1 <- optim.parsimony(tree = tree0, data = dat1)
plot(tree1)
The tree now contains a cladogram with no branch lengths. Class phylo is effectively a list, so the zero branch lengths can be added as an extra element.
tree2 <- tree1
tree2$edge.length <- rep(0, nrow(tree2$edge))
Last, we write the tree into a character vector in newick format and remove the semicolon at the end to match the requirement.
tree3 <- write.tree(tree2)
tree3 <- sub(";", "", tree3)
tree3
# [1] "((e:0,d:0):0,(c:0,(b:0,a:0):0):0,((m:0,(l:0,k:0):0):0,((i:0,h:0):0,j:0,(g:0,f:0):0):0):0)"
Given an array of size N containing only 0s, 1s, and 2s; sort the array in ascending order.
Example 1:
Input:
N = 5
arr[]= {0 2 1 2 0}
Output:
0 0 1 2 2
Explanation:
0s 1s and 2s are segregated
into ascending order.
INPUT CODE:
'''
class Solution:
def sort012(self,arr,n):
low = mid = arr[0]
high = len(arr)-1
while mid <= high:
if arr[mid] == 0:
arr[mid],arr[low] = arr[low],arr[mid]
mid+=1
low+=1
elif arr[mid] == 1:
mid+=1
else:
arr[mid],arr[high]= arr[high],arr[mid]
high-=1
return arr
# code here
# arr.sort()
# return arr
'''
ERROR TEST CASE
Input:
65754
2 0 2 0 0 1 2 2 2 1 1 0 1 1 1 2 0 1 2 1 0 1 2 0 0 0 2 0 1 0 0 0 1 2 1 1 1 2 1 2 1 2 2 1 1 2 0 2 0 0 1 2 1 2 1 1 2 1 2 0 0 1 0 2 1 1 2 0 2 0 1 2 2 2 2 1 0 1 2 2 0 1 1 1 0 1 2 0 0 2 1 0 0 2 2 1 0 0 0 2 1 0 2 1 0 0 2 0 2 1 2 1 1 1 2 1 1 2 0 1 0 0 2 0 1 2 0 0 2 1 0 0 2 0 2 2 0 2 2 2 0 1 0 2 1 1 0 1 2 1 0 0 2 0 1 0 1 1 2 2 0 1 0 0 0 2 1 0 1 0 2 1 1 1 0 2 2 2 1 0 1 0 1 0 0 0 1 1 0 0 2 0 1 0 1 0 2 2 0 1 0 1 1 2 0 1 2 0 2 2 1 0 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 0 2 0 2 0 1 0 0 0 2 0 1 2 2 1 0 0 2 0 0 .................
Its Correct output is:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .................
And Your Code's output is:
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .................
why not use simply the built-in sort()?
arr.sort() should do the trick.
Is there a unique way you are trying to solve this ?
def ZOT(arr):
low = count = 0
high = len(arr)-1
while(count<=high):
# if arr[count] == 2 swap(arr[count],arr[high]) high--
if arr[count] == 2:
arr[count],arr[high] = arr[high],arr[count]
high = high - 1
# if arr[count] == 1 Dont swap anything, Increment count i.e, count++
elif arr[count] == 1:
count = count +1
# if arr[count] == 0 swap(arr[count],arr[low]) count++, low++
elif arr[count] == 0:
arr[count],arr[low] = arr[low],arr[count]
count = count + 1
low = low + 1
else:
return -1
return arr
if __name__=="__main__":
arr = [0,1,2,1,2,0,2,0]
print(ZOT(arr))
#Output : [0, 0, 0, 1, 1, 2, 2, 2]
The documentation seems to be bare bone and the example given in their standard TF tutorial not highlighting a behavior I see. Lets say you have an imbalanced dataset of 1 and 0 (pos and neg), and you want to sample at weights [0.5, 0.5], such that you see the positives more frequently. You would do this:
pos_ds = tf.data.Dataset.from_tensor_slices(np.ones(shape=(16, 1)))
neg_ds = tf.data.Dataset.from_tensor_slices(np.zeros(shape=(128, 1)))
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
And if I want to see how the pos and neg are distributed as I go through the dataset:
xs = []
for x in resampled_ds:
xs.append(int(x.numpy()[0]))
xs = np.array(xs)
print(xs)
np.bincount(xs)
I see this:
[1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1
0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
array([128, 16])
There are 128 negatives and 16 positives. If I use this as my train_ds, it will be equivalent to no sampling done, and worse, the negatives are no longer uniformly distributed across the steps / epoch. I am guessing that the 0.5 sampling is happening in the beginning and once it "run out" of 1s, it just started sampling the zeros only. It clearly doesn't do sampling with replacement for the 1s. I think the 1s and 0s will only be 0.5/0.5 if you stop after all the 1s are sampled.
It looks like this is the behavior but it isn't the only sensible one. I want to sample the positives multiple times (i.e. sampling with replacement) in 1 epoch, with approx equal amount of pos and negs, is there any option for this API? Also, I have data augmentation so the positives are actually not the same when trained.
You can do something like this for the replacement issue:
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds.repeat(128 // 16), neg_ds], weights=[0.5, 0.5])
And the result is:
[1 1 1 0 0 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1
1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
1 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 0 0
0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1
1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 1 1
1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 1 1 0
0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Out[2]: array([128, 128], dtype=int64)
Actually, I also found the solution is right there on that TF tutorial imbalanced_data.ipynb (i totally missed this one in my own notebook).
pos_ds = pos_ds.shuffle(BUFFER_SIZE).repeat()
neg_ds = neg_ds.shuffle(BUFFER_SIZE).repeat()
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
The tutorial further suggest a heuristic to set the resampled_steps_per_epoch.
However, the shuffle + repeat, is still not equivalent to a true sampling with replacement for the minority class. A repeat() follow by a shuffle() may be do it. I can update this by trying both ways.
I am a beginner to programming in general, and my situation is as follows.
I am doing a computation using software (polymake) that I'm running interactively with my terminal, and my computation output some numeric data that looks like this:
facet 1 contains vertices:
1 0 0 1 0 0 0 0 0 0 0 0 0 0
1 -8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 1 0 0 0 0 0 0 0
1 8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 0 0 1 0 0 0 0 0 0
1 1323574716436937/2251799813685248 -7286977229400801/9007199254740992 0 0 0 0 0 1 0 0 0 0 0
1 -4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 1 0 0 0 0
1 4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 0 1 0 0 0
1 -3272056622340821/9007199254740992 -4252622667048423/36028797018963968 0 0 0 0 0 0 0 0 1 0 0
1 0 -6880887921216781/18014398509481984 0 0 0 0 0 0 0 0 0 1 0
1 1000927696824871/2251799813685248 -6629910960894707/18014398509481984 0 0 0 0 0 0 0 0 0 0 1
1 0 0 2 0 0 0 0 0 0 0 0 0 0
1 -8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 2 0 0 0 0 0 0 0 0
1 0 1 0 0 0 2 0 0 0 0 0 0 0
1 8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 0 0 2 0 0 0 0 0 0
1 1323574716436937/2251799813685248 -7286977229400801/9007199254740992 0 0 0 0 0 2 0 0 0 0 0
1 -4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 2 0 0 0 0
1 4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 0 2 0 0 0
1 -3272056622340821/9007199254740992 -4252622667048423/36028797018963968 0 0 0 0 0 0 0 0 2 0 0
1 0 -6880887921216781/18014398509481984 0 0 0 0 0 0 0 0 0 2 0
1 1000927696824871/2251799813685248 -6629910960894707/18014398509481984 0 0 0 0 0 0 0 0 0 0 2
facet 2 contains vertices:
1 0 0 1 0 0 0 0 0 0 0 0 0 0
1 -1323574716436937/2251799813685248 -7286977229400801/9007199254740992 0 1 0 0 0 0 0 0 0 0 0
1 -8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 1 0 0 0 0 0 0 0
1 8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 0 0 1 0 0 0 0 0 0
1 1323574716436937/2251799813685248 -7286977229400801/9007199254740992 0 0 0 0 0 1 0 0 0 0 0
1 -4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 1 0 0 0 0
1 4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 0 1 0 0 0
1 0 -6880887921216781/18014398509481984 0 0 0 0 0 0 0 0 0 1 0
1 1000927696824871/2251799813685248 -6629910960894707/18014398509481984 0 0 0 0 0 0 0 0 0 0 1
1 0 0 2 0 0 0 0 0 0 0 0 0 0
1 -1323574716436937/2251799813685248 -7286977229400801/9007199254740992 0 2 0 0 0 0 0 0 0 0 0
1 -8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 2 0 0 0 0 0 0 0 0
1 0 1 0 0 0 2 0 0 0 0 0 0 0
1 8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 0 0 2 0 0 0 0 0 0
1 1323574716436937/2251799813685248 -7286977229400801/9007199254740992 0 0 0 0 0 2 0 0 0 0 0
1 -4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 2 0 0 0 0
1 4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 0 2 0 0 0
1 0 -6880887921216781/18014398509481984 0 0 0 0 0 0 0 0 0 2 0
1 1000927696824871/2251799813685248 -6629910960894707/18014398509481984 0 0 0 0 0 0 0 0 0 0 2
I need to use this data to do computations, which I am doing using Python.
In order for me to run my algorithm on the data, I need to first organize it into numpy arrays as follows:
F_2 = np.array([
[0,0,1,0,0,0,0,0,0,0,0,0,0],
[-1323574716436937/2251799813685248,-7286977229400801/9007199254740992,0,1,0,0,0,0,0,0,0,0,0],
[-8566355578160561/9007199254740992,5566755204060609/18014398509481984,0,0,1,0,0,0,0,0,0,0,0],
[0,1,0,0,0,1,0,0,0,0,0,0,0],
[8566355578160561/9007199254740992,5566755204060609/18014398509481984,0,0,0,0,1,0,0,0,0,0,0],
[1323574716436937/2251799813685248,-7286977229400801/9007199254740992,0,0,0,0,0,1,0,0,0,0,0],
[-4044484486813853/18014398509481984,5566755204060609/18014398509481984,0,0,0,0,0,0,1,0,0,0,0],
[4044484486813853/18014398509481984,5566755204060609/18014398509481984,0,0,0,0,0,0,0,1,0,0,0],
[0,-6880887921216781/18014398509481984,0,0,0,0,0,0,0,0,0,1,0],
[1000927696824871/2251799813685248,-6629910960894707/18014398509481984,0,0,0,0,0,0,0,0,0,0,1],
[0,0,2,0,0,0,0,0,0,0,0,0,0],
[-1323574716436937/2251799813685248,-7286977229400801/9007199254740992,0,2,0,0,0,0,0,0,0,0,0],
[-8566355578160561/9007199254740992,5566755204060609/18014398509481984,0,0,2,0,0,0,0,0,0,0,0],
[0,1,0,0,0,2,0,0,0,0,0,0,0],
[8566355578160561/9007199254740992,5566755204060609/18014398509481984,0,0,0,0,2,0,0,0,0,0,0],
[1323574716436937/2251799813685248,-7286977229400801/9007199254740992,0,0,0,0,0,2,0,0,0,0,0],
[-4044484486813853/18014398509481984,5566755204060609/18014398509481984,0,0,0,0,0,0,2,0,0,0,0],
[4044484486813853/18014398509481984,5566755204060609/18014398509481984,0,0,0,0,0,0,0,2,0,0,0],
[0,-6880887921216781/18014398509481984,0,0,0,0,0,0,0,0,0,2,0],
[1000927696824871/2251799813685248,-6629910960894707/18014398509481984,0,0,0,0,0,0,0,0,0,0,2]
])
This is extremely tedious to do by hand, since I have to place the data manually into a 2D numpy array. This involves having to place commas separating the numbers, and putting the sequences of numbers on each line between square brackets to form the rows of the 2D array etc.
I am wondering if there is a way I can do this much faster with programming commands (especially since I have to do this many times)?
Thank you very much in advance.
use pandas
import pandas as pd
df = pd.read_csv('yourContent', delimiter=r' ')
You could copy-paste your data into a text file and then use numpy.genfromtxt(), e.g.:
import numpy as np
arr = np.genfromtxt(filepath)
more info no how to use it in the linked documentation.
An even more efficient approach would be to collect the output of your script.
One way of doing this in Python is by running the output-producing script via subprocess functionalities (e.g. subprocess.run()).