I try to write an algorithm for clustering, now I like to create some easy 2D test cases: I like to generate points in [0, 1]x[0, 1] that build clusters.
E.g. something like this:
It would be better if the clusters have different (but random) shapes, e.g. like:
Is there an easy way to do this with python / numpy? Unfortunately the generation must be very efficient. I wrote some code, but the clusters always have the same shape and they are often far away from each other. Probably already a nice algorithm exists?
Thank you
No, there isn't a packaged way to do this. However, the generation algorithms aren't that hard to write. The first one appears to be a Gaussian distribution in each dimension (X and Y), repeating the generation for each of three centroids. Alternately, perhaps it's a uniform direction, with a "decay function" distance.
The second is a pair of sets: choose the radius from a Gaussian with a small variance, while the direction is uniform over the full circle. Do that for mean radius 1 and mean radius 3.
Does that get you moving?
Related
I have two sets of data points; effectively, one is from a preimage and the other from its image, but I do not know the rule between the two. This rule/function is nonlinear.
I've collected many data points of corresponding locations on both images, and I was wondering if anyone knew of a way to find a more complete mapping. That is, does anyone know the best way to find a mapping from R^2 to R^2 with an extensive set of sample points. This mapping is one-to-one and onto.
My goal is to use the data I've found to find a polynomial function that takes in some x,y coordinate from the preimage, and outputs the shifted coordinates.
edit: I have sample points along the domain and their corresponding points in the image, but not for every point in the domain. I want to be able to input any point (only integer values) in the domain and output the shifted point.
I don't think polynomial is easy (or easy to guarantee is a bijection). The obvious thing to do is to
Construct the delaunay triangulation of the known points in the domain.
For each delaunay triangle the mapping is just the linear mapping which interpolates the map on the vertices.
Then, when you have a random point, look up its delaunay triangle, and apply the requisite map.
I believe that all of the above can be done via scipy.spatial.delaunay.
The transformation you're trying to find sounds a lot like what's accomplished in Geographic Information Systems using a technique called rubber-sheeting https://en.wikipedia.org/wiki/Rubbersheeting
Igor Rivin's description of a process using a Delaunay triangulation is pretty much the solution that's used in such systems. Some systems will use a Barycentric coordinate system rather than a linear mapping to try to reduce the appearance of triangle-related artifacts in the transformed image.
What you are describing also sounds a bit like the "morphing" special effect used in video. Maybe a web search on that topic would turn up some leads for you.
I’m trying to compare some OpenFOAM CFD simulations that reproduce the flow through an almost spherical object(the object is a reconstruction, with an irregular shape), searching for the one with the smallest cluster of wall shear stress. So, I want to know if there is a way of running a clustering algorithm on this irregular surface, like K Means, EM or other unsupervised algorithm. In other words, I would like to numerically compare the colormap plotted in slightly different shapes, taking the area and the mean value of the clusters as parameters to do this comparison, for example. Someone has ever handled a similar situation?
I've already tried to project this object to a plane or a sphere, but the distortion was greater than expected, and this became no longer an option.
I'm referencing this question and this documentation in trying to turn a set of points (the purple dots in the image below) into an interpolated grid.
As you can see, the image has missing spots where dots should be. I'd like to figure out where those are.
import numpy as np
from scipy import interpolate
CIRCLES_X = 25 # There should be 25 circles going across
CIRCLES_Y = 10 # There should be 10 circles going down
points = []
values = []
# Points range from 0-800 ish X, 0-300 ish Y
for point in points:
points.append([points.x, points.y])
values.append(1) # Not sure what this should be
grid_x, grid_y = np.mgrid[0:CIRCLES_Y, 0:CIRCLES_X]
grid = interpolate.griddata(points, values, (grid_x, grid_y), method='linear')
print(grid)
Whenever I print out the result of the grid, I get nan for all of my values.
Where am I going wrong? Is my problem even the correct use case for interpolate.grid?
First, your uncertain points are mainly at an edge, so it's actually extrapolation. Second, interpolation methods built into scipy deal with continuous functions defined on the entire plane and approximate it as a polynomial. While yours is discrete (1 or 0), somewhat periodic rather than polynomial and only defined in a discrete "grid" of points.
So you have to invent some algorithm to inter/extrapolate your specific kind of function. Whether you'll be able to reuse an existing one - from scipy or elsewhere - is up to you.
One possible way is to replace it with some function (continuous or not) defined everywhere, then calculate that approximation in the missing points - whether as one step as scipy.interpolate non-class functions do or as two separate steps.
e.g. you can use a 3-D parabola with peaks in your dots and troughs exactly between them. Or just with ones in the dots and 0's in the blanks and hope the resulting approximation in the grid's points is good enough to give a meaningful result (random overswings are likely). Then you can use scipy.interpolate.RegularGridInterpolator for both inter- and extrapolation.
or as a harmonic function - then what you're seeking is Fourier transformation
Another possible way is to go straight for a discrete solution rather than try to shoehorn the continual mathanalysis' methods into your case: design a (probably entirely custom) algorithm that'll try to figure out the "shape" and "dimensions" of your "grids of dots" and then simply fill in the blanks. I'm not sure if it is possible to add it into the scipy.interpolate's harness as a selectable algorithm in addition to the built-in ones.
And last but not the least. You didn't specify whether the "missing" points are points where the value is unknown or are actual part of the data - i.e. are incorrect data. If it's the latter, simple interpolation is not applicable at all as it assumes that all the data are strictly correct. Then it's a related but different problem: you can approximate the data but then have to somehow "throw away irregularities" (higher order of smallness entities after some point).
Suppose we have 1000 random data points in a cube (as shown in the following image). The distribution of points in X and Y directions are uniform but not in Z direction. As we get deeper, the data points are denser. Is there any straightforward way in python to cluster these data points such that:
each cluster has equal size
each cluster consists of local points, i.e., each cluster consists of points being close to each other.
I have already tried K-means clustering from Scipy package but it did not give me a good result and the points of each cluster were very widespread rather than being concentrated.
Try using Scikit-Learn's implementation. They initialize their clusters using a technique known as "K-Means++" which picks the first means probabilistically to get an optimal starting distribution. This creates a higher probability of a good result.
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
I have a 2D data and it contains five peaks. Could I fit five 2D Gaussians function to obtain the peaks? In my problem, the peaks do not refer to the clustering problem. Which I think EM would be an appropriate answer for it.
In my case I measure a variable in x-y space and it shows maximum in more than one position. Is still fitting Fourier series or using Expectation-Maximization method an applicable solution to my problem?
In order to make my likelihood, do I need to just add up the five 2D Gaussians distributions with x and y and the height of each peak as variables?
If I understand what you're asking, check out Gaussian Mixture Models and Expectation Maximization. I don't know of any pre-implemented versions of these in Python, although I haven't looked too hard.