tf.data.experimental.sample_from_datasets not sampling as expected - python

The documentation seems to be bare bone and the example given in their standard TF tutorial not highlighting a behavior I see. Lets say you have an imbalanced dataset of 1 and 0 (pos and neg), and you want to sample at weights [0.5, 0.5], such that you see the positives more frequently. You would do this:
pos_ds = tf.data.Dataset.from_tensor_slices(np.ones(shape=(16, 1)))
neg_ds = tf.data.Dataset.from_tensor_slices(np.zeros(shape=(128, 1)))
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
And if I want to see how the pos and neg are distributed as I go through the dataset:
xs = []
for x in resampled_ds:
xs.append(int(x.numpy()[0]))
xs = np.array(xs)
print(xs)
np.bincount(xs)
I see this:
[1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1
0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
array([128, 16])
There are 128 negatives and 16 positives. If I use this as my train_ds, it will be equivalent to no sampling done, and worse, the negatives are no longer uniformly distributed across the steps / epoch. I am guessing that the 0.5 sampling is happening in the beginning and once it "run out" of 1s, it just started sampling the zeros only. It clearly doesn't do sampling with replacement for the 1s. I think the 1s and 0s will only be 0.5/0.5 if you stop after all the 1s are sampled.
It looks like this is the behavior but it isn't the only sensible one. I want to sample the positives multiple times (i.e. sampling with replacement) in 1 epoch, with approx equal amount of pos and negs, is there any option for this API? Also, I have data augmentation so the positives are actually not the same when trained.

You can do something like this for the replacement issue:
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds.repeat(128 // 16), neg_ds], weights=[0.5, 0.5])
And the result is:
[1 1 1 0 0 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1
1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
1 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 0 0
0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1
1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 1 1
1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 1 1 0
0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Out[2]: array([128, 128], dtype=int64)

Actually, I also found the solution is right there on that TF tutorial imbalanced_data.ipynb (i totally missed this one in my own notebook).
pos_ds = pos_ds.shuffle(BUFFER_SIZE).repeat()
neg_ds = neg_ds.shuffle(BUFFER_SIZE).repeat()
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
The tutorial further suggest a heuristic to set the resampled_steps_per_epoch.
However, the shuffle + repeat, is still not equivalent to a true sampling with replacement for the minority class. A repeat() follow by a shuffle() may be do it. I can update this by trying both ways.

Related

Python make newick format using dataframe with 0s and 1s

I have a dataframe like this
a b c d e f g h i j k l m
mut1 0 0 0 0 0 1 1 1 1 1 1 1 1
mut2 0 0 0 0 0 1 1 1 1 1 0 0 0
mut3 0 0 0 0 0 1 1 0 0 0 0 0 0
mut4 0 0 0 0 0 1 0 0 0 0 0 0 0
mut5 0 0 0 0 0 0 0 1 1 0 0 0 0
mut6 0 0 0 0 0 0 0 1 0 0 0 0 0
mut7 0 0 0 0 0 0 0 0 0 1 0 0 0
mut8 0 0 0 0 0 0 0 0 0 0 1 1 1
mut9 0 0 0 0 0 0 0 0 0 0 1 1 0
mut10 0 0 0 0 0 0 0 0 0 0 0 0 1
mut11 1 1 1 1 1 0 0 0 0 0 0 0 0
mut12 1 1 1 0 0 0 0 0 0 0 0 0 0
mut13 1 1 0 0 0 0 0 0 0 0 0 0 0
mut14 1 0 0 0 0 0 0 0 0 0 0 0 0
mut15 0 0 0 1 0 0 0 0 0 0 0 0 0
mut16 0 0 0 0 1 0 0 0 0 0 0 0 0
and origianl corresponding string
(a:0,b:0,c:0,d:0,e:0,f:0,g:0,h:0,i:0,j:0,k:0,l:0,m:0):0
The algorithm I thought was like this.
In row mut1, we can see that f,g,h,i,j,k,l,m have the same features.
So the string can be modified into
(a:0,b:0,c:0,d:0,e:0,(f:0,g:0,h:0,i:0,j:0,k:0,l:0,m:0):0):0
In row mut2, we can see that f,g,h,i,j have the same features.
So the string can be modified into
(a:0,b:0,c:0,d:0,e:0,((f:0,g:0,h:0,i:0,j:0):0,k:0,l:0,m:0):0):0
Until mut10, it continues to cluster samples in f,g,h,i,j,k,l,m.
And the output will be
(a:0,b:0,c:0,d:0,e:0,(((f:0,g:0):0,(h:0,i:0):0,j:0):0,((k:0,l:0):0,m:0):0):0):0
(For a row with one "1", just skip the process)
From mut10, it stars to cluster samples a,b,c,d,e
and similarly, the final output will be
(((a:0,b:0):0,c:0):0,d:0,e:0,(((f:0,g:0):0,(h:0,i:0):0,j:0):0,((k:0,l:0):0,m:0):0):0):0
So the algorithm is
Cluster the samples with the same features.
After clustering, add ":0" behind the closing parenthesis.
Any suggestions on this process?
*p.s. I have uploaded similar question
Creating a newick format from dataframe with 0 and 1
but this one is more detailed.
Your question asks for a solution in Python, which I'm not familiar with. Hopefully, the following procedure in R will be helpful as well.
What your question describes is matrix representation of a tree. Such a tree can be retrieved from the matrix with a maximum parsimony method using the phangorn package. To manipulate trees in R, newick format is useful. Newick differs from the tree representation in your question by ending with a semicolon.
First, prepare a starting tree in phylo format.
library(phangorn)
tree0 <- read.tree(text = "(a,b,c,d,e,f,g,h,i,j,k,l,m);")
Second, convert your data.frame to a phyDat object, where the rows represent samples and columns features. The phyDat object also requires what levels are present in the data, which is 0 and 1 in this case. Combining the starting tree with the data, we calculate the maximum parsimony tree.
dat0 = read.table(text = " a b c d e f g h i j k l m
mut1 0 0 0 0 0 1 1 1 1 1 1 1 1
mut2 0 0 0 0 0 1 1 1 1 1 0 0 0
mut3 0 0 0 0 0 1 1 0 0 0 0 0 0
mut4 0 0 0 0 0 1 0 0 0 0 0 0 0
mut5 0 0 0 0 0 0 0 1 1 0 0 0 0
mut6 0 0 0 0 0 0 0 1 0 0 0 0 0
mut7 0 0 0 0 0 0 0 0 0 1 0 0 0
mut8 0 0 0 0 0 0 0 0 0 0 1 1 1
mut9 0 0 0 0 0 0 0 0 0 0 1 1 0
mut10 0 0 0 0 0 0 0 0 0 0 0 0 1
mut11 1 1 1 1 1 0 0 0 0 0 0 0 0
mut12 1 1 1 0 0 0 0 0 0 0 0 0 0
mut13 1 1 0 0 0 0 0 0 0 0 0 0 0
mut14 1 0 0 0 0 0 0 0 0 0 0 0 0
mut15 0 0 0 1 0 0 0 0 0 0 0 0 0
mut16 0 0 0 0 1 0 0 0 0 0 0 0 0")
dat1 <- phyDat(data = t(dat0),
type = "USER",
levels = c(0, 1))
tree1 <- optim.parsimony(tree = tree0, data = dat1)
plot(tree1)
The tree now contains a cladogram with no branch lengths. Class phylo is effectively a list, so the zero branch lengths can be added as an extra element.
tree2 <- tree1
tree2$edge.length <- rep(0, nrow(tree2$edge))
Last, we write the tree into a character vector in newick format and remove the semicolon at the end to match the requirement.
tree3 <- write.tree(tree2)
tree3 <- sub(";", "", tree3)
tree3
# [1] "((e:0,d:0):0,(c:0,(b:0,a:0):0):0,((m:0,(l:0,k:0):0):0,((i:0,h:0):0,j:0,(g:0,f:0):0):0):0)"

*solved* How can I check if all rows in a numpy array equals to 1 when the number of arrays differ?

I'm looking for a way to test (element wise) which rows in N number of arrays that are equal to 1. I know there are good ways to do this when we know the number of arrays which I've found searching around. However, in my case, I will not (in a time efficent manner) be able to keep track of the number of array in which I want to test this on. Below is the solution and desired output when comparing two arrays.
I appreciate any help. Thank you!
A = np.array([1,2,3])
B = np.array([1,1,1])
C = np.logical_and(A==1,B==1)
array([ True, False, False])
I could also use np.where(A==1) if I have an array of floats with multiple arrays. However, this only gives me the occurances if one value is equal to 1.
Example array: (note that it wont always be 3 arrays but can be 5 or 15 as well).
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
1 0 0
1 0 0
1 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
1 0 0
1 0 0
1 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
1 0 0
1 0 0
1 0 0
0 0 0
0 0 0
0 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 1
1 0 0
1 0 0
1 0 0
1 0 1
1 0 1
1 0 1
1 0 0
1 0 1
1 0 0
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 0
0 0 0
0 0 0
0 0 0

How to put data produced by my terminal into a numpy array

I am a beginner to programming in general, and my situation is as follows.
I am doing a computation using software (polymake) that I'm running interactively with my terminal, and my computation output some numeric data that looks like this:
facet 1 contains vertices:
1 0 0 1 0 0 0 0 0 0 0 0 0 0
1 -8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 1 0 0 0 0 0 0 0
1 8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 0 0 1 0 0 0 0 0 0
1 1323574716436937/2251799813685248 -7286977229400801/9007199254740992 0 0 0 0 0 1 0 0 0 0 0
1 -4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 1 0 0 0 0
1 4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 0 1 0 0 0
1 -3272056622340821/9007199254740992 -4252622667048423/36028797018963968 0 0 0 0 0 0 0 0 1 0 0
1 0 -6880887921216781/18014398509481984 0 0 0 0 0 0 0 0 0 1 0
1 1000927696824871/2251799813685248 -6629910960894707/18014398509481984 0 0 0 0 0 0 0 0 0 0 1
1 0 0 2 0 0 0 0 0 0 0 0 0 0
1 -8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 2 0 0 0 0 0 0 0 0
1 0 1 0 0 0 2 0 0 0 0 0 0 0
1 8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 0 0 2 0 0 0 0 0 0
1 1323574716436937/2251799813685248 -7286977229400801/9007199254740992 0 0 0 0 0 2 0 0 0 0 0
1 -4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 2 0 0 0 0
1 4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 0 2 0 0 0
1 -3272056622340821/9007199254740992 -4252622667048423/36028797018963968 0 0 0 0 0 0 0 0 2 0 0
1 0 -6880887921216781/18014398509481984 0 0 0 0 0 0 0 0 0 2 0
1 1000927696824871/2251799813685248 -6629910960894707/18014398509481984 0 0 0 0 0 0 0 0 0 0 2
facet 2 contains vertices:
1 0 0 1 0 0 0 0 0 0 0 0 0 0
1 -1323574716436937/2251799813685248 -7286977229400801/9007199254740992 0 1 0 0 0 0 0 0 0 0 0
1 -8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 1 0 0 0 0 0 0 0
1 8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 0 0 1 0 0 0 0 0 0
1 1323574716436937/2251799813685248 -7286977229400801/9007199254740992 0 0 0 0 0 1 0 0 0 0 0
1 -4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 1 0 0 0 0
1 4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 0 1 0 0 0
1 0 -6880887921216781/18014398509481984 0 0 0 0 0 0 0 0 0 1 0
1 1000927696824871/2251799813685248 -6629910960894707/18014398509481984 0 0 0 0 0 0 0 0 0 0 1
1 0 0 2 0 0 0 0 0 0 0 0 0 0
1 -1323574716436937/2251799813685248 -7286977229400801/9007199254740992 0 2 0 0 0 0 0 0 0 0 0
1 -8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 2 0 0 0 0 0 0 0 0
1 0 1 0 0 0 2 0 0 0 0 0 0 0
1 8566355578160561/9007199254740992 5566755204060609/18014398509481984 0 0 0 0 2 0 0 0 0 0 0
1 1323574716436937/2251799813685248 -7286977229400801/9007199254740992 0 0 0 0 0 2 0 0 0 0 0
1 -4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 2 0 0 0 0
1 4044484486813853/18014398509481984 5566755204060609/18014398509481984 0 0 0 0 0 0 0 2 0 0 0
1 0 -6880887921216781/18014398509481984 0 0 0 0 0 0 0 0 0 2 0
1 1000927696824871/2251799813685248 -6629910960894707/18014398509481984 0 0 0 0 0 0 0 0 0 0 2
I need to use this data to do computations, which I am doing using Python.
In order for me to run my algorithm on the data, I need to first organize it into numpy arrays as follows:
F_2 = np.array([
[0,0,1,0,0,0,0,0,0,0,0,0,0],
[-1323574716436937/2251799813685248,-7286977229400801/9007199254740992,0,1,0,0,0,0,0,0,0,0,0],
[-8566355578160561/9007199254740992,5566755204060609/18014398509481984,0,0,1,0,0,0,0,0,0,0,0],
[0,1,0,0,0,1,0,0,0,0,0,0,0],
[8566355578160561/9007199254740992,5566755204060609/18014398509481984,0,0,0,0,1,0,0,0,0,0,0],
[1323574716436937/2251799813685248,-7286977229400801/9007199254740992,0,0,0,0,0,1,0,0,0,0,0],
[-4044484486813853/18014398509481984,5566755204060609/18014398509481984,0,0,0,0,0,0,1,0,0,0,0],
[4044484486813853/18014398509481984,5566755204060609/18014398509481984,0,0,0,0,0,0,0,1,0,0,0],
[0,-6880887921216781/18014398509481984,0,0,0,0,0,0,0,0,0,1,0],
[1000927696824871/2251799813685248,-6629910960894707/18014398509481984,0,0,0,0,0,0,0,0,0,0,1],
[0,0,2,0,0,0,0,0,0,0,0,0,0],
[-1323574716436937/2251799813685248,-7286977229400801/9007199254740992,0,2,0,0,0,0,0,0,0,0,0],
[-8566355578160561/9007199254740992,5566755204060609/18014398509481984,0,0,2,0,0,0,0,0,0,0,0],
[0,1,0,0,0,2,0,0,0,0,0,0,0],
[8566355578160561/9007199254740992,5566755204060609/18014398509481984,0,0,0,0,2,0,0,0,0,0,0],
[1323574716436937/2251799813685248,-7286977229400801/9007199254740992,0,0,0,0,0,2,0,0,0,0,0],
[-4044484486813853/18014398509481984,5566755204060609/18014398509481984,0,0,0,0,0,0,2,0,0,0,0],
[4044484486813853/18014398509481984,5566755204060609/18014398509481984,0,0,0,0,0,0,0,2,0,0,0],
[0,-6880887921216781/18014398509481984,0,0,0,0,0,0,0,0,0,2,0],
[1000927696824871/2251799813685248,-6629910960894707/18014398509481984,0,0,0,0,0,0,0,0,0,0,2]
])
This is extremely tedious to do by hand, since I have to place the data manually into a 2D numpy array. This involves having to place commas separating the numbers, and putting the sequences of numbers on each line between square brackets to form the rows of the 2D array etc.
I am wondering if there is a way I can do this much faster with programming commands (especially since I have to do this many times)?
Thank you very much in advance.
use pandas
import pandas as pd
df = pd.read_csv('yourContent', delimiter=r' ')
You could copy-paste your data into a text file and then use numpy.genfromtxt(), e.g.:
import numpy as np
arr = np.genfromtxt(filepath)
more info no how to use it in the linked documentation.
An even more efficient approach would be to collect the output of your script.
One way of doing this in Python is by running the output-producing script via subprocess functionalities (e.g. subprocess.run()).

Fast Knapsack Solver For big problems

I want to approximately solve the knapsack problem for big data sets using Python.
Right now, I am using this implementation, which works well for small examples like:
import knapsack
weight = np.random.randint(10, size = 10)
value = np.random.randint(10, size = 10)
capacity = 5
knapsack.knapsack(weight, value).solve(capacity)
but when we scale it up to:
import knapsack
weight = np.random.randint(10, size = 1000)
value = np.random.randint(10, size = 1000)
capacity = 500
knapsack.knapsack(weight, value).solve(capacity)
the program just gets stuck and gives an error. I was wondering if there is some implementation of the knapsack problem where we can state something like compute for 10 seconds and return me the best solution found so far, is this possible?
Here a small prototype 0-1-integer programming approach for the 0-1 knapsack!
This code:
is not doing everything perfectly!
e.g. constraints vs. bounds (latter more efficient; but too lazy to check cylp again for that; problems in the past)
not much support for windows!
windows users: go for pulp which brings the same solver (imho the best free open-source MIP-solver); although modelling looks quite different there!
no tuning!
observe: CoinOR's Cgl, which is used in the solver Cbc, supports extra knapsack-cuts!
as the logs show: example is too simple to effect in their usage!
bounded / unbounded knapsack-versions are easily handled by just modifying the bounds
The example here just solves one problem as defined by OP using a PRNG-seed of 1, where it takes 0.02 seconds, but that's not a scientific test! NP-hard problems are all about easy vs. hard instances (huge variance!) and because of that, data to check against is important! One can observe, that there is no real integrality-gap for this example.
Code
import numpy as np
import scipy.sparse as sp
from cylp.cy import CyClpSimplex
np.random.seed(1)
""" INSTANCE """
weight = np.random.randint(10, size = 1000)
value = np.random.randint(10, size = 1000)
capacity = 500
""" SOLVE """
n = weight.shape[0]
model = CyClpSimplex()
x = model.addVariable('x', n, isInt=True)
model.objective = -value
model += sp.eye(n) * x >= np.zeros(n) # could be improved
model += sp.eye(n) * x <= np.ones(n) # """
model += np.matrix(weight) * x <= capacity # cylp somewhat outdated in terms of np-usage!
cbcModel = model.getCbcModel() # Clp -> Cbc model / LP -> MIP
cbcModel.logLevel = True
status = cbcModel.solve()
x_sol = np.array(cbcModel.primalVariableSolution['x'].round()).astype(int) # assumes there is one
print(x_sol)
print(x_sol.dot(weight))
print(x_sol.dot(value))
Output
Welcome to the CBC MILP Solver
Version: 2.9.9
Build Date: Jan 15 2018
command line - ICbcModel -solve -quit (default strategy 1)
Continuous objective value is -1965.33 - 0.00 seconds
Cgl0004I processed model has 1 rows, 542 columns (542 integer (366 of which binary)) and 542 elements
Cutoff increment increased from 1e-05 to 0.9999
Cbc0038I Initial state - 1 integers unsatisfied sum - 0.333333
Cbc0038I Pass 1: suminf. 0.25000 (1) obj. -1965 iterations 1
Cbc0038I Solution found of -1965
Cbc0038I Branch and bound needed to clear up 1 general integers
Cbc0038I Full problem 1 rows 542 columns, reduced to 1 rows 128 columns
Cbc0038I Cleaned solution of -1965
Cbc0038I Before mini branch and bound, 540 integers at bound fixed and 0 continuous
Cbc0038I Mini branch and bound did not improve solution (0.02 seconds)
Cbc0038I After 0.02 seconds - Feasibility pump exiting with objective of -1965 - took 0.01 seconds
Cbc0012I Integer solution of -1965 found by feasibility pump after 0 iterations and 0 nodes (0.02 seconds)
Cbc0038I Full problem 1 rows 542 columns, reduced to 1 rows 2 columns
Cbc0001I Search completed - best objective -1965, took 0 iterations and 0 nodes (0.02 seconds)
Cbc0035I Maximum depth 0, 362 variables fixed on reduced cost
Cuts at root node changed objective from -1965.33 to -1965.33
Probing was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Gomory was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Knapsack was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Clique was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
FlowCover was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Result - Optimal solution found
Objective value: -1965.00000000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.02
Time (Wallclock seconds): 0.02
Total time (CPU seconds): 0.02 (Wallclock seconds): 0.02
[0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0
0 1 0 0 0 0 0 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0
0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 1
1 0 0 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 1
0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0
1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1 0 1 0 1 1 0
0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 0 1 1 0 0 1 0 1 0 0
0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1
0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1
0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0
0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1
0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 0 0
1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0
0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 1
1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 0
0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 1
0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 1
0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 1 1 0 0 0 1 1 1 0 0 1 0 0 0 1
1 1 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0
0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 0
0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 0 1 1
0 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0 1 0 1 1 0
0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 1 0 0
0 0 1 1 1 0 0 1 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 1
0]
500
1965

Performing PCA on a dataframe with Python with sklearn

I have a sample input file that has many rows of all variants, and columns represent the number of components.
A01_01 A01_02 A01_03 A01_04 A01_05 A01_06 A01_07 A01_08 A01_09 A01_10 A01_11 A01_12 A01_13 A01_14 A01_15 A01_16 A01_17 A01_18 A01_19 A01_20 A01_21 A01_22 A01_23 A01_24 A01_25 A01_26 A01_27 A01_28 A01_29 A01_30 A01_31 A01_32 A01_33 A01_34 A01_35 A01_36 A01_37 A01_38 A01_39 A01_40 A01_41 A01_42 A01_43 A01_44 A01_45 A01_46 A01_47 A01_48 A01_49 A01_50 A01_51 A01_52 A01_53 A01_54 A01_55 A01_56 A01_57 A01_58 A01_59 A01_60 A01_61 A01_62 A01_63 A01_64 A01_65 A01_66 A01_67 A01_69 A01_70 A01_71
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
I first import this .txt file as:
#!/usr/bin/env python
from sklearn.decomposition import PCA
inputfile=vcf=open('sample_input_file', 'r')
I would like to performing principal component analysis and plotting the first two components (meaning the first two columns)
I am not sure if this the way to go about it after reading about
sklearn
PCA for two components:
pca = PCA(n_components=2)
pca.fit(inputfile) #not sure how this read in this file
Therefore, I need help importing my input file as a dataframe for Python to perform PCA on it
sklearn works with numpy arrays.
So you want to use numpy.loadtxt:
data = numpy.loadtxt('sample_input_file', skiprows=1)
pca = PCA(n_components=2)
pca.fit(data)

Categories

Resources