Does Pyspark ML KMean have a way to get the explained variance? - python

As I was reading through the ML package for Pyspark here, it seems the KMeanModel doesn't have a way to compute the explained variance in order to draw an elbow curve, to establish the optimal number of clusters.
However in this example, the user seems to have a computeCost() function. Where did that function come from? I'm not having success in my program.
I am using Spark 1.6. Thanks in advance!

I was stuckked with same issue regarding computcost method in pyspark.
Instead of using the computecost you can use mahalanobis distance or WSSE after applying kmeans.
To compute the distance you have to write the code and and getting the
various result you can draw the graph to see the knee point for
optimum number of cluster .
Anomaly Detection Using PySpark this use case which helped me have a look.

Related

How to identify multiple lines/clusters in a single dataset

I'm currently struggling to wrap my head around how multi-linear regression could be done to find separate sets of linear models in a single data set. I can perform regression on single data set for a single regressor and coef with no problem, but what if there are a known-number of lines existing in a single data space?
My first approach was to use hierarchical clustering to identify the points using ML first, but it doesn't seem to capture individual
cluster variance in Euclidean space as expected. My second trial was KMeans, which still relies on Euclidean distance so it creates clusters with radii. My last thought process led to Kmedian, but at this point I was wondering what other people might think regarding this problem.
If this is the right direction, I know I would have to project points in a better space(i.e. an axis that captures more-or most- variance) before I apply these methods.
I would appreciate any comments or input in any shape or form.
Thank you,
3-line summary:
Linear regression on dataset with multiple lines
Clustering first, then multiple single linear regression?
or have you guys come across a module for this such thing?
I would truly appreciate any insights;

Find the appropriate polynomial fit for data in Python

Is there a function or library in Python to automatically compute the best polynomial fit for a set of data points? I am not really interested in the ML use case of generalizing to a set of new data, I am just focusing on the data I have. I realize the higher the degree, the better the fit. However, I want something that penalizes or looks at where the error elbows? When I say elbowing, I mean something like this (although usually it is not so drastic or obvious):
One idea I had was to use Numpy's polyfit: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.polyfit.html to compute polynomial regression for a range of orders/degrees. Polyfit requires the user to specify the degree of polynomial, which poses a challenge because I don't have any assumptions or preconceived notions. The higher the degree of fit, the lower the error will be but eventually it plateaus like the image above. Therefore if I want to automatically compute the degree of polynomial where the error curve elbows: if my error is E and d is my degree, I want to maximize (E[d+1]-E[d]) - (E[d+1] - E[d]).
Is this even a valid approach? Are there other tools and approaches in well-established Python libraries lik Numpy or Scipy that can help with finding the appropriate polynomial fit (without me having to specify the order/degree)? I would appreciate any thoughts or suggestions! Thanks!
To select the "right" fit and prevent over-fitting, you can use the Akiake Information Criterion or the Bayesian Information Criterion. Note that your fitting procedure can be non-Bayesian and you can still use these to compare fits. Here is a quick comparison between the two methods.

clustering algorithm with minimum number of points

I am trying to separate a data set that has 2 clusters that do not overlap in anyway and a single data point that is away from these two clusters.
When I use kmeans() to get the 2 clusters, it splits one of the "valid" cluster into half and considers the single data point as a separate cluster.
Is there a way to specify minimum number of points for this? I am using MATLAB.
There are several solutions:
Easy: try with 3 clusters;
Easy: remove the single data point (that you can detect as an outlier with any outlier detection technique;
To be tried: Use a k-medoids approach instead of k-means. This sometimes helps getting rid of outliers.
More complicated but surely works: Perform spectral clustering. This helps you get over the main issue of k-means, which is the brutal use of the euclidian distance
More explanations on the inadequate behaviour of k-means can be found on Cross Validated site (see here for instance).

How to find the optimal number of clusters using k-prototype in python

I am trying to cluster some big data by using the k-prototypes algorithm. I am unable to use K-Means algorithm as I have both categorical and numeric data. Via k prototype clustering method I have been able to create clusters if I define what k value I want.
How do I find the appropriate number of clusters for this.?
Will the popular methods available (like elbow method and silhouette score method) with only the numerical data works out for mixed data?
You can use this code:
#Choosing optimal K
cost = []
for num_clusters in list(range(1,8)):
kproto = KPrototypes(n_clusters=num_clusters, init='Cao')
kproto.fit_predict(Data, categorical=[0,1,2,3,4,5,6,7,8,9])
cost.append(kproto.cost_)
plt.plot(cost)
Source: https://github.com/aryancodify/Clustering
Most evaluation methods need a distance matrix.
They will then work with mixed data, as long as you have a distance function that helps solving your problem. But they will not be very scalable.
Yeah elbow method is good enough to get number of cluster. Because it based on total sum squared.

efficient filtering near/inside clusters after they found - python

essentially I applied a DBSCAN algorithm (sklearn) with an euclidean distance on a subset of my original data. I found my clusters and all is fine: except for the fact that I want to keep only values that are far enough from those on which I did not run my analysis on. I have a new distance to test such new stuff with and I wanted to understand how to do it WITHOUT numerous nested loops.
in a picture:
my found clusters are in blue whereas the red ones are the points to which I don't want to be near. the crosses are the points belonging to the cluster that are carved out as they are within the new distance I specified.
now, as much I could do something of the sort:
for i in red_points:
for j in blu_points:
if dist(i,j) < given_dist:
original_dataframe.remove(j)
I refuse to believe there isn't a vectorized method. also, I can't afford to do as above simply because I'll have huge tables to operate upon and I'd like to avoid my CPU to evaporate away.
any and all suggestions welcome
Of course you can vectoriue this, but it will then still be O(n*m). Better neighbor search algorithms are not vectorized. e.g. kd-tree and ball-tree.
Both are available in sklearn, and used by the DBSCAN module. Please see the sklearn.neighbors package.
If you need exact answers, the fastest implementation should be sklearn's pairwise distance calculator:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
If you can accept an approximate answer, you can do better with the kd tree's queryradius(): http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html

Categories

Resources