Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a set of data, the scatter plot of data is something like this:
I've shown the correct answer by a red area, it's almost in the center of the two branches. (The scatter plot is 'V' form)
I need an algorithm for finding this area and collecting all scatter data which contained in this area. (because there are another set data like this)
Both x,y data have been uploaded here:
Data
Based on your question so far, it is difficult to know how to evaluate what is correct(ie. why is this region correct? Is the based on values/coordinates of points, on point density in the region? Is it based on the position with respect to the larger structure(ie. centre of the branches) etc.).
That being said; there are a lot of machine learning algorithms available; eg. scikit-learn for python. Using a supervised learning algorithm you could train the solver on some data, then it could (try to) find the correct answer for other data.
More of an answer is difficult to provide before you rephrase your question.
If all your data looks like this, one option might be to do a PCA(ie, dimensional reduction) on the data to separate the branches into two clusters. You would then get some datapoints which can not clearly be identified as belonging to only one branch, which you could then select (scikit-learn's PCA docs). Note that while it should be reasonably accurate, you would never get a perfect circle using this.
If you only need it for this one dataset, which you already know the "radius" and centre of, you could identify a centre of your circle(ellipse) with its semi-major(& minor) a (& b) axes and then compute the distance using its canonical form.
It might then be simpler to use a square, though.
So it would look something like this(assuming 1d numpy.ndarrays):
#selecting points in a square
condition=(xarr>xmin) & (xarr<xmax) & (yarr>ymin) & (yarr<ymax)
#depending on what you want, coordinates or value at coordinates
xsq=xarr[condition]
ysq=yarr[condition]
squaredata=data[condition]
#for ellipse:
#x0, y0, a and b can be preset if only this function.
in_ellipse=np.vectorize(\
lambda x,y,x0,y0,a,b: np.sqrt(((x-x0)/a)**2 + ((y-y0)/b)**2)<=1.0)
ellipsedata=data[in_ellipse(xarr,yarr,1.6,-1125,0.1,10)]
x_ellipse=xarr[in_ellipse(xarr,yarr,1.6,-1125,0.1,10)]
y_ellipse=yarr[in_ellipse(xarr,yarr,1.6,-1125,0.1,10)]
The values for x0,y0, a and b were just estimated up by looking at the picture.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have an image represented as a uint16 numpy array (orig_arr) with a skewed distribution. I would like to create a new array (noise_arr) of random values, but that matches the mean and standard deviation of orig_img.
I believe this will require two main steps:
Measure the mean and distribution of orig_arr
Create a new array of random values using the mean and distribution measured in step 1
I'm pretty much lost on how to do this, but here's a sample image and a bit of code to get you started:
Sample image: https://drive.google.com/open?id=1bevwW-NHshIVRqni5O62QB7bxcxnUier (looks blank but it's not)
orig_arr = cv2.imread('sample_img.tif', -1)
orig_mean = np.mean(orig_arr)
orig_sd = np.std(orig_arr)
print(orig_mean)
18.676384933578962
print(orig_sd)
41.67964688299941
I think scipy.stats.skewnorm might do the trick. It lets you characterize skewed normal distributions, and also sample data from skewed normal distributions.
Now... maybe that's a bad assumption for your data... maybe it's not skew-normal, but this is the first thing I'd try.
# import skewnorm
from scipy.stats import skewnorm
# find params
a, loc, scale = skewnorm.fit(orig_arr)
# mimick orig distribution with skewnorm
# keep size and shape the same as orig_arr
noise_arr = skewnorm(a, loc, scale).rvs(orig_arr.size).astype('uint16').reshape(orig_array.shape)
There's more detail about exploring this kind of data... plotting it... comparing it... over here: How to create uint16 gaussian noise image?
Also... I think using imshow and setting vmin and vmax will let you look at an image or heatmap of your data that is sensitive to the range. That is demonstrated in the link above as well.
Hope that helps!
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Edit This question was written with little knowledge of clustering techniques and now in hindsight does not even meet the Standards of Stack Overflow Website, but SO won't let me delete it saying others have Invested time and Energy in this(Valid Point) and if I proceed to delete, I may not be able to ask questions for a while, So I am updating this question to make it relevant in a way that others can learn from this. Still it doesn't strictly comply with SO guidelines as I myself would flag this as too broad, but in it's current state it is of no value, so adding a little value to it is going to be worth the downvotes.
Updated Conversation topic
The Issue was to select the optimal number of cluster in a clustering algorithm which would be grouping various shapes which were the input of contour detection on an Image and then a deviation in cluster properties was to be marked as Noise or anomalies, The main point that raised the question at the time was that all datasets were different, the shapes obtained in them different, and the no of shapes would also vary from dataset to dataset. The proper solution to do this would be to go about using DBSCAN(Density based spatial clustering application with Noise) application of which can be find in scikit-learn which I was unaware of at the time, that works and now the product is in testing, I just wanted to come back to this and correct this old mistake.
Old Question
Old Title Dynamic selection of k in kmeans clustering
I have to generate a k-means clustering model in which number of classes are not known in advance, is there a way to automatically determine the value of k based on the Euclidean distance within the clusters.
How I want it to work. Start with a value of k, perform clustering, see if it satisfies threshold criterion and increase or decrease k accordingly. The problem is framework independent and If you have an Idea or implementation in a language other than Python, please share that as well.
I found this while researching the problem https://www.researchgate.net/publication/267752474_Dynamic_Clustering_of_Data_with_Modified_K-Means_Algorithm.
I couldn't find its Implementation.
I am looking for similar ideas to select the best and implement it myself, or an implementation that can be ported to my code.
Edit
The Ideas I am considering right now are:
The elbow method
X-means clustering
You can use elbow method. What this method basically do, is it use various values of k (no of clusters) and then calculate distance of each point from its cluster center. After certain number there won;t any major improvement this value you can take for k(no of cluster).
You can refer for further reading this link.
You iterate over the values of K and check your cluster validity using Silhouette Score
You can iterate through the score of k values of any range. Either you can check silhouette score for each k value or calculate the difference between SSE values for each k values. Where the difference is highest after 0.4 * number of k values will be the elbow point.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Problem : Find all the points which lie inside of a sphere with center C and Radius R
For example find the image below, which explains the problem for a simple 2D case. The label (N) and coordinates (x,y) for each point is known. I need to find all the point labels that lie within the red circle
Sample input file which contain coordinates of 7.25 M points is attached here point file.
I tried the following piece of code
import numpy as np
C = [50,50,50]
R = 20
centroid = np.loadtxt('centroid') #chk the file attached
def dist(x,y): return sum([(xi-yi)**2 for xi, yi in zip(x,y)])
elabels=[i+1 for i in range(len(centroid)) if dist(C,centroid[i])<=R**2]
Any Suggestions to make it faster ?
Thanks,
Prithivi
No, there is no built-in function to do this. However, there are constructs to make the search syntactically concise. There are also geometric packages that include a Point data type you might find useful, as well as supporting distance functions.
Without seeing the set-up code you've chosen, about all I can provide is something like this:
neighbours = [point for point in point_list if dist(C, point) < R]
Another way to approach this construction is to use the filter method on point list; you'll notice similarities in structure.
Response to Comment
Is the set-up as shown in your edited problem: the points are spaced regularly? In that case, drop the list C entirely and simply compute the neighbors with a couple of parameters. If the points are distributed haphazardly, then you can get some speed-up by building a graph of near neighbors to each point. Then you can use a distance-based graph traversal algorithm to gather the nearby points much faster than by doing a neighborhood search each time.
Simple-minded Insertion
As you read each point, check it against the existing points in your graph, building the neighbourhoods as you go. Most of all, use the triangle inequality as your weapon. For instance, if your current point x is at least 2*m units from point a, then no point in a's neighbourhood can be in x's neighbourhood.
If you wish, you can also maintain a few long-distance links among areas in the graph. This can allow you to eliminate more distant neighbourhoods from your search. In general, if you compute d(a,x) to be q and d(a,b) to be r, then
|q-r| <= d(x,b) <= q+r
If this range does not include 2*m, then you can similarly eliminate b's entire neighbourhood.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have series of line data (2-3 connected points).
What is the best machine learning algorithm that I can use to be able to classify lines to their location similarities? (image below)
Preferably python libraries such as SciKit-Learn.
Edit:
I have tried DBSCAN, but the problem I faced was if there are two lines intersect each other, sometimes DBSCAN consider them to one group even though they are completely in different direction.
Here is a solution I found so far:
GeoPath Clustering Algorithm
The idea here is to cluster geo paths that travel very similar to each other into groups.
Steps:
1- Cluster lines based on slope
2- Within each cluster from step 1, find centriod of lines and by using k-mean
algorithm cluster them into smaller groups
3- Within each geoup from step 2, calculate lenght of each line and group lines within defined length threshold
Result will be small groups of lines that have similar slope, close to each other and with similar travel distance.
Here are screen shots of visualization:
Yellow lines are all lines and red are cluster of paths travel together.
I'll throw an answer since I think the current one is incomplete...and I also think the comment of "simple heuristic" is premature. I think that if you cluster on points, you'll get a different result than what your diagram depicts. As the clusters will be near the end-points and you wouldn't get your nice ellipses.
So, if your data really does behave similarly to how you display it. I would take a stab at turning each set of 2/3 points into a longer list of points that basically trace out the lines. (you will need to experiment on how dense)
Then run HDBSCAN on the result see video ( https://www.youtube.com/watch?v=AgPQ76RIi6A ) to get your clusters. I believe "pip install hdbscan" installs it.
Now, when testing a new sample, first decompose it into many(N) points and fit them with your hdbscan model. I reckon that if you take a majority voting approach with your N points, you'll get the best overall cluster to which the "line" belongs.
So, while I sort of agree with the "simple heuristic" comment, it's not so simple if you want the whole thing automated. And once you watch the video you may be convinced that HDBSCAN, because of its density-based algorithm, will suit this problem(if you decide to create many points from each sample).
I'll wrap up by saying that I'm sure there are line-intersection models that have done this before...and that there does exist heuristics and rules that can do the job. Likely, they're computationally more economical too. My answer is just something organic using sklearn as you requested...and I haven't even tested it! It's just how I would proceed if I were in your shoes.
edit
I poked around and there a couple of line similarity measures you can possibly try. Frechet and Hausdorff distance measures.
Frechet: http://arxiv.org/pdf/1307.6628.pdf
Hausdorff: distance matrix of curves in python for a python example.
If you generate all pair-wise similarities and then group them according to similarity and/or into N bins, you can then call those bins your "clusters" (not kmeans clusters though!). For each new line, generate all similarities and see which bin it belongs to. I revise my original comment of possibly being computationally less intensive...you're lucky your lines only have 2 or 3 points!
The problem you're trying to solve is called clustering. For an overview of clustering algorithms in sklearn, see http://scikit-learn.org/stable/modules/clustering.html#clustering.
Edit 2: KMeans was what sprung to mind when I first saw your post, but based on feedback from the comments it looks like it's not a good fit. You may want to try sklearn's DBSCAN instead.
A potential transformation or extra feature you could add would be to fit a straight line to each set of points, and then use the (slope, intercept) pair. You may also want to use the centroid of each line.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm testing some things in image retrival and i was thinking about how to sort out bad pictures of a dataset. For e.g there are only pictures of houses and in between there is a picture of people and some of cars. So at the end i want to get only the houses.
At the Moment my approach looks like:
computing descriptors (Sift) of all pictures
clustering all descriptors with k-means
creating histograms of the pictures by computing the euclidean distance between the cluster centers and the descriptors of a picture
clustering the histograms again.
at this moment i have got a first sort (which isn't really good). Now my Idea is to take all pictures which are clustered to a center with len(center) > 1 and cluster them again and again. So the Result is that the pictures which are particular in a center will be sorted out. Maybe its enough to fit the result again to the same k-means without clustering again?!
the result isn't satisfying so maybe someone has got a good idea.
For Clustering etc. I'm using k-means of scikit learn.
K-means is not very robust to noise; and your "bad pictures" probably can be considered as such. Furthermore, k-means doesn't work too well for sparse data; as the means will not be sparse.
You may want to try other, more modern, clustering algorithms that can handle this situation much better.
I don't have the solution to your problem but here is a sanity check to perform prior to the final clustering, to check that the kind of features you extracted is suitable for your problem:
extract the histogram features for all the pictures in your dataset
compute the pairwise distances of all the pictures in your dataset using the histogram features (you can use sklearn.metrics.pairwise_distance)
np.argsort the raveled distances matrix to find the indices of the 20 top closest pairs of distinct pictures according to your features (you have to filter out the zero-valued diagonal elements of the distance matrix) and do the same to extract the top 20 most farest pairs of pictures based on your histogram features.
Visualize (for instance with plt.imshow) the pictures of top closest pairs and check that they are all pairs that you would expect to be very similar.
Visualize the pictures of the top farest pairs and check that they are all very dissimilar.
If one of those 2 checks fails, then it means that histogram of bag of SIFT words is not suitable to your task. Maybe you need to extract other kinds of features (e.g. HoG features) or reorganized the way your extract the cluster of SIFT descriptors, maybe using a pyramidal pooling structure to extract info on the global layout of the pictures at various scales.