I'm working on some cryptocurrency analysis and have encountered following problem.
Currently I have matrix of 20 cryptocurrencies and their correlations which looks like that:
Correlation matrix
Is there any way I could rearrange that matrix to block diagonal, so that cryptocurrencies which are strongly corelated between each other and not strongly correlated with others would be contained in blocks and matrix would look something like this:
example of output
I have no idea how to accomplish that.
Related
I have a plurality of timeseries of angular data. These values are not vectors (no magnitude), just angles. I need to determine among the various timeseries how correlated they are with each other (e.g., would like to obtain a correlation matrix) over the duration of the data. For example, some are measured very close to each other and I expect will be highly correlated, but I'm interested in also seeing how correlated the further measurements are.
How would I go about adapting this angular data in order to be able to obtain a correlation matrix? I thought about just vectorizing it (i.e., with unit vectors), but then I'm not sure how to do the correlation analysis with this two-dimensional data, as I've only done it with one dimensional previously. Of course, I can't simply analyze the correlation of the angles themselves, due to the nature of angular data (the reset at 0-360).
I'm working in Python, so if anyone has any recommendations on relevant packages I would appreciate it.
I have found a solution in the Astropy python package. The following function is suitable for circular correlation:
https://docs.astropy.org/en/stable/api/astropy.stats.circcorrcoef.html
Say I calculated the correlations of prices of 500 stocks, and stored them in a 500x500 correlation matrix, with 1s on the diagonal.
How can I cluster the correlations into smaller correlation matrices (in Python), such that the correlations of stocks in each matrix is maximized? Meaning to say, I would like to cluster the stocks such that in each cluster, the stock prices are all highly correlated with one another.
There is no upper bound to how many smaller matrices I can cluster into, although preferably, their sizes are similar i.e it is better to have 3 100x100 matrices and 1 200x200 matrix than say a 10x10 matrix, 90x90 matrix and 400x400 matrix. (i.e minimize standard deviation of matrix sizes).
Preferably to be done in Python. I've tried to look up SciPy's clustering libraries but have not yet found a solution (I'm new to SciPy and such statistical programming problems).
Any help that points me in the right direction is much appreciated!
The obvious choice here is hierarchical agglomerative clustering.
Beware that most tools (e.g., sklearn) expect a distance matrix. But it can be trivially implemented for a similarity matrix instead. Then you can use correlation. This is textbook stuff.
I am working with large datasets of protein-protein similarities generated in NCBI BLAST. I have stored the results in a large pairwise matrices (25,000 x 25,000) and I am using multidimensional scaling (MDS) to visualize the data. These matrices were too large to work with in RAM so I stored them on disk in HDF5 format and accessed them with the h5py module.
The sklearn manifold MDS method generated great visualization for small-scale data in 3D, so that is the one I am currently using. For the calculation, it requires a complete symmetric pairwise dissimilarity matrix. However, with large datasets, a sort of "crust" is formed that obscures the clusters that have formed.
I think the problem is that I am required to input a complete dissimilarity matrix. Some proteins are not related to each other, but in the pairwise dissimilarity matrix, I am forced to input a default max value of dissimilarity. In the documentation of sklearn MDS, it says that a value of 0 is considered a missing value, but inputting 0 where I want missing values does not seem to work.
Is there any way of inputting an incomplete dissimilarity matrix so unrelated proteins don't have to be inputted? Or is there a better/faster way to visualize the data in a pairwise dissimilarity matrix?
MDS requires a full dissimilarity matrix AFAIK. However, I think it is probably not the best tool for what you plan to achieve. Assuming that your dissimilarity matrix is metric (which need not be the case), it surely can be embedded in 25,000 dimensions, but "crushing" that to 3D will "compress" the data points together too much. That results in the "crust" you'd like to peel away.
I would rather run a hierarchical clustering algorithm on the dissimilarity matrix, then sort the leaves (i.e. the proteins) so that the similar ones are kept together, and then visualize the dissimilarity matrix with rows and columns permuted according to the ordering generated by the clustering. Assuming short distances are colored yellow and long distances are blue (think of the color blind! :-) ), this should result in a matrix with big yellow rectangles along the diagonal where the similar proteins cluster together.
You would have to downsample the image or buy a 25,000 x 25,000 screen :-) but I assume you want to have an "overall" low-resolution view anyway.
There are many algorithms under the name nonlineaer dimentionality reduction. You can find a long list of those algorithms on wikipedia, most of them are developed in recent years. If PCA doesn't work well for your data, I would try the method CCA or tSNE. The latter is especially good to show cluster structures.
I have a data set consisting of ~200 99x20 arrays of frequencies, with each column summing to unity. I have plotted these using heatmaps like . Each array is pretty sparse, with only about 1-7/20 values per 99 positions being nonzero.
However, I would like to cluster these samples in terms of how similar their frequency profiles are (minimum euclidean distance or something like that). I have arranged each 99x20 array into a 1980x1 array and aggregated them into a 200x1980 observation array.
Before finding the clusters, I have tried whitening the data using scipy.cluster.vq.whiten. whiten normalizes each column by its variance, but due to the way I've flattened my data arrays, I have some (8) columns with all zero frequencies, so the variance is zero. Therefore the whitened array has infinite values and the centroid finding fails (or gives ~200 centroids).
My question is, how should I go about resolving this? So far, I've tried
Don't whiten the data. This causes k-means to give different centroids every time it's run (somewhat expected), despite increasing the iter keyword considerably.
Transposing the arrays before I flatten them. The zero variance columns just shift.
Is it ok to just delete some of these zero variance columns? Would this bias the clustering in any way?
EDIT: I have also tried using my own whiten function which just does
for i in range(arr.shape[1]):
if np.abs(arr[:,i].std()) < 1e-8: continue
arr[:,i] /= arr[:,i].std()
This seems to work, but I'm not sure if this is biasing the clustering in any way.
Thanks
Removing the column of all 0's should not bias the data. If you have N dimensional data, but one dimension is all the same number, it is exactly the same as having N-1 dimensional data. This property of effective-dimensionality is called rank.
Consider 3-D data, but all of your data points are on the x=0 plane. Can you see how this is exactly the same as 2D data?
First of all, dropping constant columns is perfectly fine. Obviously they do not contribute information, so no reason to keep them.
However, K-means is not particularly good for sparse vectors. The problem is that most likely the resulting "centroids" will be more similar to each other than to the cluster members.
See, in sparse data, every object is to some extend an outlier. And K-means is quite sensitive to outliers because it tries to minimize the sum of squares.
I suggest that you do the following:
Find a similarity measure that works for your domain. Spend quite a lot of time on this, how to capture similarity for your particular use case.
Once you have that similarity, compute the 200x200 similarity matrix. As your data set is really tiny, you can actually run expensive clustering methods such as hierarchical clustering, that would not scale to thousands of objects. If you want, you could also try OPTICS clustering or DBSCAN. But in particular DBSCAN is actually more interesting if your data set is much larger. For tiny data sets, hierarchical clustering is fine.
I'm currently trying to classify text. My dataset is too big and as suggested here, I need to use a sparse matrix. My question is now, what is the right way to add an element to a sparse matrix? Let's say for example I have a matrix X which is my input .
X = np.random.randint(2, size=(6, 100))
Now this matrix X looks like an ndarray of an ndarray (or something like that).
If I do
X2 = csr_matrix(X)
I have the sparse matrix, but how can I add another element to the sparce matrix ?
for example this dense element: [1,0,0,0,1,1,1,0,...,0,1,0] to a sparse vector, how do I add it to the sparse input matrix ?
(btw, I'm very new at python, scipy,numpy,scikit ... everything)
Scikit-learn has a great documentation, with great tutorials that you really should read before trying to invent it yourself. This one is the first one to read it explains how to classify text, step-by-step, and this one is a detailed example on text classification using sparse representation.
Pay extra attention to the parts where they talk about sparse representations, in this section. In general, if you want to use svm with linear kernel and you large amount of data, LinearSVC (which is based on Liblinear) is better.
Regarding your question - I'm sure there are many ways to concatenate two sparse matrices (btw this is what you should look for in google for other ways of doing it), here is one, but you'll have to convert from csr_matrix to coo_matrix which is anther type of sparse matrix: Is there an efficient way of concatenating scipy.sparse matrices?.
EDIT: When concatenating two matrices (or a matrix and an array which is a 1 dimenesional matrix) the general idea is to concatenate X1.data and X2.data and manipulate their indices and indptrs (or row and col in case of coo_matrix) to point to the correct places. Some sparse representations are better for specific operations and more complex for other operations, you should read about csr_matrix and see if this is the best representation. But I really urge you to start from those tutorials I posted above.