The title pretty much says it all, I have a df with 40+ dimension which I'd like to process into the Umap algorithm in order to have a 2-d output.
I would like to know if it is possible to weight the input columns differently for the purpose of studying the possible different Umap outcomes.
Thank you for your time
P.S. I work in python
Why not simply applying UMAP to A:
A = X*W
where X is your Nx40 matrix and W=diag(w) is a 40x40 diagonal matrix of weights w=[w1, w2,..., w40]?
Consider using normalized weights wi, i=1,2,...,40 such that sum(w) == 1, to distribute normally your information.
Related
I am performing regression analysis on some reasonably large vectors (for now, working with numpy and other scientific tools is ok if I leave the computer working overnight) but they will grow by several factors eventually, and so I was looking to improve performance, moving the implementation to pytorch.
The regression is fairly simple. I have 2 vectors, predictions and betas, with dimensions (750, 6340) and (750, 4313) respectively. The least squares solution I am looking for is predictions * x = betas where x would have dimensions (6340, 4313), but I have to account for intercepts in the regression. With numpy I solved this by iterating through the second dimension in predictions, creating a vector with each column + a column of ones, and passing that as the first argument
for candidate in range(0, predictions.shape[1])): #each column is a candidate
prediction = predictions[:, candidate]
#allow for an intercept by adding a column with ones
prediction = np.vstack([prediction, np.ones(prediction.shape[0])]).T
sol = np.linalg.lstsq(prediction, betas, rcond=-1)
Question number 1 would be: is there a way to avoid iterating over each candidate in order to allow the least squares calculation to account for an intercept? That would improve computation time by a lot.
I tried using statsmodels.regression.linear_model.ols which allows for this by default (you can add a -1to the formula if you want it removed), but using this approach either forces me to iterate over each candidate (using apply was appealing but didn't really improve computation time noticeably) or there is something I'm missing. Question 1.5 then: can I use this tool in such a way or is all there is to it?
Similarly in pytorch I would do
t_predictions = torch.tensor(predictions, dtype=torch.float)
t_betas_roi = torch.tensor(betas, dtype=torch.float)
t_sol = torch.linalg.lstsq(t_predictions, t_betas_roi)
And it's fast indeed, but I'm missing the intercept here. I reckon if I did this with numpy instead of iterating as I do it would also be much faster but either way, if question 1 has a solution I imagine it could be similarly applied here, right?
My problem is this: I have GMM model with K multi-variate gaussians, and also I have N samples.
I want to create a N*K numpy matrix, which in it's [i,k] cell there is the pdf function of the k'th gaussian on the i'th sample, i.e. in this cell there is
In short, I'm intrested in the following matrix:
pdf matrix
This what I have now (I'm working with python):
Q = np.array([scipy.stats.multivariate_normal(mu_t[k], cov_t[k]).pdf(X) for k in range(self.K)]).T
X in the code is a matrix whose lines are my samples.
It's works fine on small toy dataset from small dimension, but the dataset I'm working with is 10,000 28*28 pictures, and on it this line run extremely slowly...
I want to find a solution that doesn't envolve loops but only vector\matrix operation (i.e. vectorization). The scipy 'multivariate_normal' function cannot parameters of more than 1 gaussians, as far as I understand it (but it's 'pdf' function can calculates on multiple samples at once).
Does someone have an Idea?
I am afraid, that the main speed killer in your problem is the inversion and deteminant calculation for the cov_t matrices. If you somehow managed to precalculate these, you could enroll the calculation and use np.add.outer to get all combinations of x_i - mu_k and then use array comprehension to calculate the probabilities with the full formula of the normal distribution function.
Try
S = np.add.outer(X,-mu_t)
cov_t_inv = ??
cov_t_inv_det = ??
Q = 1/(2*np.pi*cov_t_inv_det)**0.5 * np.exp(-0.5*np.einsum('ikr,krs,kis->ik',S,cov_t_inv,S))
Where you insert precalculated arrays cov_t_inv for the inverse covariance matrices and cov_t_inv_det for their determinants.
I'm trying to plot a 3-feature dataset with a binary classification on a matplotlib plot. This worked with an example dataset provided in a guide (http://www.apnorton.com/blog/2016/12/19/Visualizing-Multidimensional-Data-in-Python/) but when I try to instead insert my own dataset, the LinearDiscriminantAnalysis will only output a one-dimensional series, no matter what number I put in "n_components". Why would this not work with my own code?
Data = pd.read_csv("DataFrame.csv", sep=";")
x = Data.iloc[:, [3, 5, 7]]
y = Data.iloc[:, 8]
lda = LDA(n_components=2)
lda_transformed = pd.DataFrame(lda.fit_transform(x, y))
plt.scatter(lda_transformed[y==0][0], lda_transformed[y==0][1], label='Loss', c='red')
plt.scatter(lda_transformed[y==1][0], lda_transformed[y==1][1], label='Win', c='blue')
plt.legend()
plt.show()
In the case when the number of different class labels, C, is less than the number of observations (almost always), then linear discriminant analysis will always produce C - 1 discriminating components. Using n_components from the sklearn API is only a means to choose possibly fewer components, e.g. in the case when you know what dimensionality you'd like to reduce down to. But you could never use n_components to get more components.
This is discussed in the Wikipedia section on Multiclass LDA. The definition of the between-class scatter is given as
\Sigma_{b} = (1 / C) \sum_{i}^{C}( (\mu_{i} - mu)(\mu_{i} - mu)^{T}
which is the empirical covariance matrix among the population of class means. By definition, such a covariance matrix has rank at most C - 1.
... the variability between features will be contained in the subspace spanned by the eigenvectors corresponding to the C − 1 largest eigenvalues ...
So because LDA uses a decomposition of the class mean covariance matrix, it means the dimensionality reduction it can provide is based on the number of class labels, and not on the sample size nor the feature dimensionality.
In the example you linked, it doesn't matter how many features there are. The point is that the example uses 3 simulated cluster centers, so there are 3 class labels. This means linear discriminant analysis could produce projection of the data onto either 1-dimensional or 2-dimensional discriminating subspaces.
But in your data, you start out with only 2 class labels, a binary problem. This means the dimensionality of the linear discriminant model can be at most 1-dimensional, literally a line that forms the decision boundary between the two classes. Dimensionality reduction with LDA in this case would simply be the projection of data points onto a particular normal vector of that separating line.
If you want to specifically reduce down to two dimensions, you can try many of the other algorithms that sklearn provides: t-SNE, ISOMAP, PCA and kernel PCA, random projection, multi-dimensional scaling, among others. Many of these allow you to choose the dimensionality of the projected space, up to the original feature dimensionality, or sometimes you can even project into larger spaces, like with kernel PCA.
In the example you give, dimension reduction by LDA reduces the data from 13 features to 2 features, however in your example it reduces from 3 to 1 (even though you wanted to get 2 features), thus it is not possible to plot in 2D.
If you really want to select 2 features out of 3, you can use feature_selection.SelectKBest to choose 2 best features and there won't be any problems plotting in 2D.
For more information, please read this fantastic answer for PCA:
https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
Probably it's because of sklearn implementation that won't allowed you to do so if you only have 2 class. The problem has been stated in here, https://github.com/scikit-learn/scikit-learn/issues/1967.
I have two N x N co-occurrence matrices (484x484 and 1060x1060) that I have to analyze. The matrices are symmetrical along the diagonal and contain lots of zero values. The non-zero values are integers.
I want to group together the positions that are non-zero. In other words, what I want to do is the algorithm on this link. When order by cluster is selected, the matrix gets re-arranged in rows and columns to group the non-zero values together.
Since I am using Python for this task, I looked into SciPy Sparse Linear Algebra library, but couldn't find what I am looking for.
Any help is much appreciated. Thanks in advance.
If you have a matrix dist with pairwise distances between objects, then you can find the order on which to rearrange the matrix by applying a clustering algorithm on this matrix (http://scikit-learn.org/stable/modules/clustering.html). For example it might be something like:
from sklearn import cluster
import numpy as np
model = cluster.AgglomerativeClustering(n_clusters=20,affinity="precomputed").fit(dist)
new_order = np.argsort(model.labels_)
ordered_dist = dist[new_order] # can be your original matrix instead of dist[]
ordered_dist = ordered_dist[:,new_order]
The order is given by the variable model.labels_, which has the number of the cluster to which each sample belongs. A few observations:
You have to find a clustering algorithm that accepts a distance matrix as input. AgglomerativeClustering is such an algorithm (notice the affinity="precomputed" option to tell it that we are using pre-computed distances).
What you have seems to be a pairwise similarity matrix, in which case you need to transform it to a distance matrix (e.g. dist=1 - data/data.max())
In the example I assumed 20 clusters, you may have to play with this variable a bit. Alternatively, you might try to find the best one-dimensional representation of your data (using e.g. MDS) to describe the optimal ordering of samples.
Because your data is sparse, treat it as a graph, not a matrix.
Then try the various graph clustering methods. For example cliques are interesting on such data.
Note that not everything may cluster.
I am working with data from neuroimaging and because of the large amount of data, I would like to use sparse matrices for my code (scipy.sparse.lil_matrix or csr_matrix).
In particular, I will need to compute the pseudo-inverse of my matrix to solve a least-square problem.
I have found the method sparse.lsqr, but it is not very efficient. Is there a method to compute the pseudo-inverse of Moore-Penrose (correspondent to pinv for normal matrices).
The size of my matrix A is about 600'000x2000 and in every row of the matrix I'll have from 0 up to 4 non zero values. The matrix A size is given by voxel x fiber bundle (white matter fiber tracts) and we are expecting maximum 4 tracts to cross in a voxel. In most of the white matter voxels we expect to have at least 1 tract, but I will say that around 20% of the lines could be zeros.
The vector b should not be sparse, actually b contains the measure for each voxel, which is in general not zero.
I would need to minimize the error, but there are also some conditions on the vector x. As I tried the model on smaller matrices, I never needed to constrain the system in order to satisfy these conditions (in general 0
Is that of any help? Is there a way to avoid taking the pseudo-inverse of A?
Thanks
Update 1st June:
thanks again for the help.
I can't really show you anything about my data, because the code in python give me some problems. However, in order to understand how I could choose a good k I've tried to create a testing function in Matlab.
The code is as follow:
F=zeros(100000,1000);
for k=1:150000
p=rand(1);
a=0;
b=0;
while a<=0 || b<=0
a=random('Binomial',100000,p);
b=random('Binomial',1000,p);
end
F(a,b)=rand(1);
end
solution=repmat([0.5,0.5,0.8,0.7,0.9,0.4,0.7,0.7,0.9,0.6],1,100);
size(solution)
solution=solution';
measure=F*solution;
%check=pinvF*measure;
k=250;
F=sparse(F);
[U,S,V]=svds(F,k);
s=svds(F,k);
plot(s)
max(max(U*S*V'-F))
for s=1:k
if S(s,s)~=0
S(s,s)=1/S(s,s);
end
end
inv=V*S'*U';
inv*measure
max(inv*measure-solution)
Do you have any idea of what should be k compare to the size of F? I've taken 250 (over 1000) and the results are not satisfactory (the waiting time is acceptable, but not short).
Also now I can compare the results with the known solution, but how could one choose k in general?
I also attached the plot of the 250 single values that I get and their squares normalized. I don't know exactly how to better do a screeplot in matlab. I'm now proceeding with bigger k to see if suddently the value will be much smaller.
Thanks again,
Jennifer
You could study more on the alternatives offered in scipy.sparse.linalg.
Anyway, please note that a pseudo-inverse of a sparse matrix is most likely to be a (very) dense one, so it's not really a fruitful avenue (in general) to follow, when solving sparse linear systems.
You may like to describe a slight more detailed manner your particular problem (dot(A, x)= b+ e). At least specify:
'typical' size of A
'typical' percentage of nonzero entries in A
least-squares implies that norm(e) is minimized, but please indicate whether your main interest is on x_hat or on b_hat, where e= b- b_hat and b_hat= dot(A, x_hat)
Update: If you have some idea of the rank of A (and its much smaller than number of columns), you could try total least squares method. Here is a simple implementation, where k is the number of first singular values and vectors to use (i.e. 'effective' rank).
from scipy.sparse import hstack
from scipy.sparse.linalg import svds
def tls(A, b, k= 6):
"""A tls solution of Ax= b, for sparse A."""
u, s, v= svds(hstack([A, b]), k)
return v[-1, :-1]/ -v[-1, -1]
Regardless of the answer to my comment, I would think you could accomplish this fairly easily using the Moore-Penrose SVD representation. Find the SVD with scipy.sparse.linalg.svds, replace Sigma by its pseudoinverse, and then multiply V*Sigma_pi*U' to find the pseudoinverse of your original matrix.