I am really new to python and data science and I could really do with some help, please.
I have a dataframe with 440 observations and 6 describing variables. I am supposed to do a hierarchical clustering of the data, but ONLY with the help of numpy and pandas packages. I cannot use scipy or sklearn. So far, I was able to create the distance matrix (440x440 numpy array). I want only two clusters. Concerning the linkage method I want to use ward linkage, but centroid method would also be ok. How can I create two clusters out of the distance matrix based on the linkage criteria? I thought of something like "find the smallest distance, put the corresponding column/row value in one cluster, remove them from the distance matrix, re-do until the old matrix is empty and I have got a new matrix with tuples of as row/column index, and re-do that until I have only 2 rows/columns left which include all my original observations..."
I know, that´s not a good description but as I said I am really new to this and I am thankful for any advice.
Related
I am having a problem struggeling me for a few days now. :
There are 17 numpy arrays with values and corresponding latitude and longitude coordinates. Each of the them contains 360*600 points. These points are overlappping at some parts. What I want to do in the end is to have a composite of the data at one regular grid.
With the common scipy.interpolate.griddata function I am having the problem that in these overlapping regions I am having different values often. This results in strange artefacts you can see in the first image:
My first idea is to take the max value of the values used in the interpolation.
I have found out that scipy.interpolate.griddata uses triangulation to interpolate but actually I can't find a pipeline that I can adapt.
I hope you can understand that I do not share any code bc. dataset is huge and my question is more about to find the best practice or receive some interesting ideas to solve this problem. Thanks in advance for your support.
Maybe calculate first the distance matrix between your regular grid points (x and the existing irregular ones y:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html
Then, for each point, find the indices of the k smallest distances and take the maximum value of the value on the irregular grid.
Disclamer: I don't know how it scales - and what your requirements are regarding performance.
Edit: You might be able to pre-eliminate data-sets for specific regions, to minimise the effort to calculate all the distance matrices.
I have a collection of Points objects, containing latitude and longitude (along with a few other irrelevant properties). I want to form clusters i.e. collections of points that are close together, relative to other points.
Alternatively, I would like an algorithm which, if given a list of clusters containing close-by points and a new point, determines which cluster the new point belongs to (and adds it to a new cluster if it doesn't belong to an existing cluster).
I looked at Hierarchical Clustering algorithms but those run too slow. The k-means algorithm requires you to know the number of clusters beforehand, whcih is not really very helpful.
Thanks!
Try density based clustering methods.
DBSCAN is one of the most popular of those.
I am assuming you are using python.
Check out this:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
When you cluster based on GPS lat/lon, you may want to use a different distance calculation method than DBSCAN's default. Use its metric parameter to use your own distance calculation function or distance matrix. For distance calculations check out Haversine Formula.
I am working with large datasets of protein-protein similarities generated in NCBI BLAST. I have stored the results in a large pairwise matrices (25,000 x 25,000) and I am using multidimensional scaling (MDS) to visualize the data. These matrices were too large to work with in RAM so I stored them on disk in HDF5 format and accessed them with the h5py module.
The sklearn manifold MDS method generated great visualization for small-scale data in 3D, so that is the one I am currently using. For the calculation, it requires a complete symmetric pairwise dissimilarity matrix. However, with large datasets, a sort of "crust" is formed that obscures the clusters that have formed.
I think the problem is that I am required to input a complete dissimilarity matrix. Some proteins are not related to each other, but in the pairwise dissimilarity matrix, I am forced to input a default max value of dissimilarity. In the documentation of sklearn MDS, it says that a value of 0 is considered a missing value, but inputting 0 where I want missing values does not seem to work.
Is there any way of inputting an incomplete dissimilarity matrix so unrelated proteins don't have to be inputted? Or is there a better/faster way to visualize the data in a pairwise dissimilarity matrix?
MDS requires a full dissimilarity matrix AFAIK. However, I think it is probably not the best tool for what you plan to achieve. Assuming that your dissimilarity matrix is metric (which need not be the case), it surely can be embedded in 25,000 dimensions, but "crushing" that to 3D will "compress" the data points together too much. That results in the "crust" you'd like to peel away.
I would rather run a hierarchical clustering algorithm on the dissimilarity matrix, then sort the leaves (i.e. the proteins) so that the similar ones are kept together, and then visualize the dissimilarity matrix with rows and columns permuted according to the ordering generated by the clustering. Assuming short distances are colored yellow and long distances are blue (think of the color blind! :-) ), this should result in a matrix with big yellow rectangles along the diagonal where the similar proteins cluster together.
You would have to downsample the image or buy a 25,000 x 25,000 screen :-) but I assume you want to have an "overall" low-resolution view anyway.
There are many algorithms under the name nonlineaer dimentionality reduction. You can find a long list of those algorithms on wikipedia, most of them are developed in recent years. If PCA doesn't work well for your data, I would try the method CCA or tSNE. The latter is especially good to show cluster structures.
I currently have arrays that look something like this:
[ 5.23324730e-03 1.01221129e-04 5.23324730e-03 ...,]
There are 500 such rows and 64 columns. I would like to compare a row like the one above, to other rows in a similar format. That is, I want to compare the 1st element in one array to the first element in the second array and so on.
The idea is to work out how closely they match... Would anyone have any ideas how I might go about this efficiently? I should note that values may not be identical.... But if I could find values that differ by amounts under a certain threshold, that would be fine.
If anyone is wondering - I'm trying to compare SURF descriptors...
Thanks so much for your help!
You can save it as a numpy matrix and then calculate the cosine similarity of each row. This can be done efficiently using the numpy dot product product method
The question depends on your definition of closely match. One common way would be calculate euclidean distance.
How can the euclidean distance be calculated with numpy?
or
Distance between numpy arrays, columnwise
I have a data set consisting of ~200 99x20 arrays of frequencies, with each column summing to unity. I have plotted these using heatmaps like . Each array is pretty sparse, with only about 1-7/20 values per 99 positions being nonzero.
However, I would like to cluster these samples in terms of how similar their frequency profiles are (minimum euclidean distance or something like that). I have arranged each 99x20 array into a 1980x1 array and aggregated them into a 200x1980 observation array.
Before finding the clusters, I have tried whitening the data using scipy.cluster.vq.whiten. whiten normalizes each column by its variance, but due to the way I've flattened my data arrays, I have some (8) columns with all zero frequencies, so the variance is zero. Therefore the whitened array has infinite values and the centroid finding fails (or gives ~200 centroids).
My question is, how should I go about resolving this? So far, I've tried
Don't whiten the data. This causes k-means to give different centroids every time it's run (somewhat expected), despite increasing the iter keyword considerably.
Transposing the arrays before I flatten them. The zero variance columns just shift.
Is it ok to just delete some of these zero variance columns? Would this bias the clustering in any way?
EDIT: I have also tried using my own whiten function which just does
for i in range(arr.shape[1]):
if np.abs(arr[:,i].std()) < 1e-8: continue
arr[:,i] /= arr[:,i].std()
This seems to work, but I'm not sure if this is biasing the clustering in any way.
Thanks
Removing the column of all 0's should not bias the data. If you have N dimensional data, but one dimension is all the same number, it is exactly the same as having N-1 dimensional data. This property of effective-dimensionality is called rank.
Consider 3-D data, but all of your data points are on the x=0 plane. Can you see how this is exactly the same as 2D data?
First of all, dropping constant columns is perfectly fine. Obviously they do not contribute information, so no reason to keep them.
However, K-means is not particularly good for sparse vectors. The problem is that most likely the resulting "centroids" will be more similar to each other than to the cluster members.
See, in sparse data, every object is to some extend an outlier. And K-means is quite sensitive to outliers because it tries to minimize the sum of squares.
I suggest that you do the following:
Find a similarity measure that works for your domain. Spend quite a lot of time on this, how to capture similarity for your particular use case.
Once you have that similarity, compute the 200x200 similarity matrix. As your data set is really tiny, you can actually run expensive clustering methods such as hierarchical clustering, that would not scale to thousands of objects. If you want, you could also try OPTICS clustering or DBSCAN. But in particular DBSCAN is actually more interesting if your data set is much larger. For tiny data sets, hierarchical clustering is fine.