How to identify multiple lines/clusters in a single dataset

How to identify multiple lines/clusters in a single dataset - python

I'm currently struggling to wrap my head around how multi-linear regression could be done to find separate sets of linear models in a single data set. I can perform regression on single data set for a single regressor and coef with no problem, but what if there are a known-number of lines existing in a single data space?
My first approach was to use hierarchical clustering to identify the points using ML first, but it doesn't seem to capture individual
cluster variance in Euclidean space as expected. My second trial was KMeans, which still relies on Euclidean distance so it creates clusters with radii. My last thought process led to Kmedian, but at this point I was wondering what other people might think regarding this problem.
If this is the right direction, I know I would have to project points in a better space(i.e. an axis that captures more-or most- variance) before I apply these methods.
I would appreciate any comments or input in any shape or form.
Thank you,
3-line summary:
Linear regression on dataset with multiple lines
Clustering first, then multiple single linear regression?
or have you guys come across a module for this such thing?
I would truly appreciate any insights;

Related

Clustering method for three-dimensional vectors

I have N three-dimensional vectors
(x,y,z)
I want a simple yet effective approach for clustering these vectors (I do not know a priori the number of clusters, nor can I guess a valid number). I am not familiar with classical machine learning so any advice would be helpful.

The general Sklearn clustering page does a decent job of providing useful background on clustering methods and provides a nice overview of what the differences are between methods. Importantly for your case the table in section 2.3.1 lists the parameters of each method.
The differences in methods tend to be based on how the knowledge of the dataset you have matches the assumptions of each model. Some expect you to know the number the number of clusters (such as K-Means) while others will attempt to determine the number of clusters based on other input parameters (like DBSCAN).
While focusing on methods which attempt to find the number of clusters seems like it might be preferable, it is also possible to use a method which expects the number of clusters and simply test many different reasonable clusters to determine which one is optimal. One such example with K-Means is this.

The easiest algorithms for clustering can be K-Means (if your three features are numerical) and K-Medoids (allow any type of features).
This algorithms are quite easy to understanding. In few words, by calculating some distance measure between each observation of the dataset, they try to assign each observation to the cluster closer (in distance) to them. The main issue with these algorithms is that you have to specify how many clusters (K) you want, but there are techniques such as the Elbow method or the Silhouette that allows us to determine numerically which value of K would be a reasonable amount of clusters.

clustering algorithm with minimum number of points

I am trying to separate a data set that has 2 clusters that do not overlap in anyway and a single data point that is away from these two clusters.
When I use kmeans() to get the 2 clusters, it splits one of the "valid" cluster into half and considers the single data point as a separate cluster.
Is there a way to specify minimum number of points for this? I am using MATLAB.

There are several solutions:
Easy: try with 3 clusters;
Easy: remove the single data point (that you can detect as an outlier with any outlier detection technique;
To be tried: Use a k-medoids approach instead of k-means. This sometimes helps getting rid of outliers.
More complicated but surely works: Perform spectral clustering. This helps you get over the main issue of k-means, which is the brutal use of the euclidian distance
More explanations on the inadequate behaviour of k-means can be found on Cross Validated site (see here for instance).

piecewise linear regression python: arbitrary amount of knots

I have an experimental data, which is piecewise continuous, and each part should fit linearly. However, I would like to fit it without knowing where exactly are the knots (so the points where the slope is changing), since its not easy to determine them manually.
So far I found the advice of using py-earth for this task, but couldn't understand how to implement it; that is, having just a list of variables for X and Y, how to be able to do such piecewise linear regression.
Could anybody give me advice on how to do that?
UPD: Turned out, the problem was because of the different format of array. "X=numpy.array([X]).T" for my arrays solved it, and now py_earth is working. However, it's to "rough", showing one single line for several knots. Can anybody suggest some other solution for the piecewise linear regression?

method for implementing regression tree on raster data - python

I'm trying to build and implement a regression tree algorithm on some raster data in python, and can't seem to find the best way to do so. I will attempt to explain what I'm trying to do:
My desired output is a raster image, whose values represent lake depth, call it depth.tif. I have a series of raster images, each representing the reflectance values in different Landsat bands, say [B1.tif, B2.tif, ..., B7.tif] that I want to use as my independent variables to predict lake depth.
For my training data, I have a shapefile of ~6000 points of known lake depth. To create a tree, I extracted the corresponding reflectance values for each of those points, then exported that to a table. I then used that table in weka, a machine-learning software, to create a 600-branch regression tree that would predict depth values based on the set of reflectance values. But because the tree is so large, I can't write it in python manually. I ran into the python-weka-wrapper module so I can use weka in python, but have gotten stuck with the whole raster part. Since my data has an extra dimension (if converted to array, each independent variable is actually a set of ncolumns x nrows values instead of just a row of values, like in all of the examples), I don't know if it can do what I want. In all the examples for the weka-python-wrapper, I can't find one that deals with spatial data, and I think this is what's throwing me off.
To clarify, I want to use the training data (which is a point shapefile/table right now but can- if necessary- be converted into a raster of the same size as the reflectance rasters, with no data in all cells except for the few points I have known depth data at), to build a regression tree that will use the reflectance rasters to predict depth. Then I want to apply that tree to the same set of reflectance rasters, in order to obtain a raster of predicted depth values everywhere.
I realize this is confusing and I may not be doing the best job at explaining. I am open to other options besides just trying to implement weka in python, such as sklearn, as long as they are open source. My question is, can what I described be done? I'm pretty sure it can, as it's very similar to image classification, with the exception that the target values (depth) are continuous and not discrete classes but so far I have failed. If so, what is the best/most straight-forward method and/or are there any examples that might help?
Thanks

I have had some experience using LandSat Data for the prediction of environmental properties of soil, which seems to be somewhat related to the problem that you have described above. Although I developed my own models at the time, I could describe the general process that I went through in order to map the predicted data.
For the training data, I was able to extract the LandSat values (in addition to other properties) for the spatial points where known soil samples were taken. This way, I could use the LandSat data as inputs for predicting the environmental data. A part of this data would also be reserved for testing to confirm that the trained models were not overfitting to training data and that it predicted the outputs well.
Once this process was completed, it would be possible to map the desired area by getting the spatial information at each point of the desired area (matching the resolution of the desired image). From there, you should be able to input these LandSat factors into the model for prediction and the output used to map the predicted depth. You could likely just use Weka in this case to predict all of the cases, then use another tool to build the map from your estimates.
I believe I whipped up some code long ago to extract each of my required factors in ArcGIS, but it's been a while since I did this. There should be some good tutorials out there that could help you in that direction.
I hope this helps in your particular situation.

It sounds like you are not using any spatial information to build your tree
(such as information on neighboring pixels), just reflectance. So, you can
apply your decision tree to the pixels as if the pixels were all in a
one-dimensional list or array.
A 600-branch tree for a 6000 point training data file seems like it may be
overfit. Consider putting in an option that requires the tree to stop splitting
when there are fewer than N points at a node or something similar. There may
be a pruning factor that can be set as well. You can test different settings
till you find the one that gives you the best statistics from cross-validation or
a held-out test set.

Clustering conceptually similar documents together?

This is more of a conceptual question than an actual implementation and am hoping someone could clarify. My goal is the following: Given a set of documents, I want to cluster them such that documents belonging to the same cluster have the same "concept".
From what I understand, Latent Semantic Analysis lets me find a low rank approximation of a term-document matrix i.e. given a matrix X, it will decompose X as a product of three matrices, out of which one would be a diagonal matrix Σ:
Now, I would proceed by choosing a low rank approximation i.e. choose only the top-k values from Σ, and then calculate X'. Once I have this matrix, I have to apply some clustering algorithm and the end result would be set of clusters grouping documents with similar concepts. Is this the right way of applying clustering? I mean, calculating X' and then applying clustering on top of it or is there some other method that is followed?
Also, in a somewhat related question of mine, I was told that the meaning of a neighbor is lost as the number of dimensions increases. In that case, what is the justification for clustering these high dimensional data points from X'? I am guessing that the requirement to cluster similar documents is a real-world requirement in which case, how does one go about addressing this?

For your first part of your question: No, you do not need to perform any 'clustering' anymore. Such clustering is already available from your singular value decomposition. If this is still unclear, please study more on detailed manner your link Latent Semantic Analysis.
For your second part: please just figure out the first part of your question and then restate this part of your question based on that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.