I have a set of values D:
[[ 6.83822474 3.54843586]
[ 12.45778114 4.42755159]
[ 10.27710359 9.47337879]
...,
[ 46.55259568 64.73755611]
[ 51.50842754 44.60132979]
Given a Multivariate Gaussian distribution with mean M and covariance V:
What is the equivalent multivariate case of a univariate point being within 2 standard deviations of the mean? i.e. assuming I have a univariate distribution with mean A and std B, I can say a point x_i is within 2 standard deviations of the mean if x_i - A < B. What would be the equivalent of this in the multivariate case?
How would I compute all the points in D that are within 2 std's (or the equivalent in the multivariate case) from the mean M?
It sounds like the generalization that you want is the Mahalanobis distance. A Mahalanobis distance of 1 from the mean is the generalization of one standard deviation from the mean of a univariate Gaussian.
You can compute the Mahalanobis distance using functions in the module scipy.spatial.distance. (There is almost certainly code for this distance in some form in scikit-learn, and possibly statsmodels, but I haven't checked.)
For computing a single distance, there is scipy.spatial.distance.mahalanobis, and for computing distances among or between collections of points, you can use pdist and cdist, respectively (also from scipy.spatial.distance).
Here's a script that uses cdist. In the plot, the points circled in red are within a Mahalanobis distance of 2 from the mean.
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
# Mean
M = [10, 7]
# Covariance matrix
V = np.array([[ 9, -2],
[-2, 2]])
VI = np.linalg.inv(V)
# Generate a sample from the multivariate normal distribution
# with mean M and covariance matrix V.
rng = np.random.default_rng()
x = rng.multivariate_normal(M, V, size=250)
# Compute the Mahalanobis distance of each point in the sample.
mdist = cdist(x, [M], metric='mahalanobis', VI=VI)[:,0]
# Find where the Mahalanobis distance is less than 2.
d2_mask = mdist < 2
x2 = x[d2_mask]
plt.plot(x2[:,0], x2[:,1], 'o',
markeredgecolor='r', markerfacecolor='w', markersize=6, alpha=0.6)
plt.plot(x[:,0], x[:,1], 'k.', markersize=5, alpha=0.5)
plt.grid(alpha=0.3)
plt.axis('equal')
plt.show()
The correct way to define distance for the multivariate case is the Mahalanobis distance, i.e.
An example of doing this would be:
import numpy as np
vals = np.array([[ 6.83822474, 3.54843586],
[ 12.45778114, 4.42755159],
[ 10.27710359, 9.47337879],
[ 46.55259568, 64.73755611],
[ 51.50842754, 44.60132979]])
# Compute covariance matrix and its inverse
cov = np.cov(vals.T)
cov_inverse = np.linalg.inv(cov)
# Mean center the values
mean = np.mean(vals, axis=0)
centered_vals = vals - mean
# Compute Mahalanobis distance
dist = np.sqrt(np.sum(centered_vals * cov_inverse.dot(centered_vals.T).T, axis=1))
# Find points that are "far away" from the mean
indices = dist > 2
Related
I'm currently learning about Mahalanobis Distance and I find it quite difficult. To get the idea better I generated 2 sets of random values (x and y) and a random point, where for all 3 mean=0 and standard deviation=1. How can I calculate the Mahalanobis Distance between them? Please find my Python code below
Many thanks for your help!
import numpy as np
from numpy import cov
from scipy.spatial import distance
generate 20 random values where mean = 0 and standard deviation = 1, assign one set to x and one to y
x = [random.normalvariate(0,1) for i in range(20)]
y = [random.normalvariate(0,1) for i in range(20)]
r_point = [random.normalvariate(0,1)] #that's my random point
sigma = cov(x, y)
print(sigma)
print("random point =", r_point)
#use the covariance to calculate the mahalanobis distance from a random point```
Here is an example that shows how to compute the Mahalanobis distance of a point r_point to some data. The Mahalanobis distance takes into account the variance and correlation of the data you are measuring the distance to (using the inverse of its covariance matrix). Here, the Mahalanobis distance and the Euclidean distance should be very close because of the distribution of the data (0-mean and standard-deviation of 1). For other data, they will be different.
import numpy as np
N = 5000
mean = 0.0
stdDev = 1.0
data = np.random.normal(mean, stdDev, (2, N)) # 2D random points
r_point = np.random.randn(2)
cov = np.cov(data)
mahalanobis_dist = np.sqrt(r_point.T # np.linalg.inv(cov) # r_point)
print("Mahalanobis distance = ", mahalanobis_dist)
euclidean_dist = np.sqrt(r_point.T # r_point)
print("Euclidean distance = ", euclidean_dist)
My data is like this:
powerplantname, latitude, longitude, powergenerated
A, -92.3232, 100.99, 50
B, <lat>, <long>, 10
C, <lat>, <long>, 20
D, <lat>, <long>, 40
E, <lat>, <long>, 5
I want to be able to cluster the data into N number of clusters (say 3). Normally I would use a kmeans:
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans2, whiten
coordinates= np.array([
[lat, long],
[lat, long],
...
[lat, long]
])
x, y = kmeans2(whiten(coordinates), 3, iter = 20)
plt.scatter(coordinates[:,0], coordinates[:,1], c=y);
plt.show()
The problem with this is it does not account for any weighting (in this case, my powergenerated value) I want to ideally have my clusters taking the value "powergenerated" into account, trying to keep the clusters not only spatially close, but also have close to relatively equal total powergenerated.
Should I be doing this with kmeans (or some other method)? Or is there something else I should be using for this problem that would be better?
Or is there something else I should be using for this problem that would be better?
In order to take into account simultaneously the geographical distance among centrals and the generated power you should define a proper metric. The function below computes the distance between two points on Earth's surface from their latitudes and longitudes through the haversine formula and adds the absolute value of the generated power difference multiplied by a weighting factor. The value of the weight determines the relative influence of distance and power difference in the clustering process.
import numpy as np
def custom_metric(central_1, central_2, weight=1):
lat1, lng1, pow1 = central_1
lat2, lng2, pow2 = central_2
lat1, lat2, lng1, lng2 = np.deg2rad(np.asarray([lat1, lat2, lng1, lng2]))
dlat = lat2 - lat1
dlng = lng2 - lng1
h = (1 - np.cos(dlat))/2. + np.cos(lat1)*np.cos(lat2)*(1 - np.cos(dlng))/2.
km = 2*6371*np.arcsin(np.sqrt(h))
MW = np.abs(pow2 - pow1)
return km + weight*MW
Should I be doing this with kmeans (or some other method)?
Unfortunately the current implementations of SciPy's kmeans2 and scikit-learn's KMeans only support Euclidean distance. An alternative method would consist in performing hierarchical clustering through the SciPy's clustering package to group the centrals according to the metric just defined.
Demo
Let us first generate mock data, namely feature vectors for 8 centrals with random values:
N = 8
np.random.seed(0)
lat = np.random.uniform(low=-90, high=90, size=N)
lng = np.random.uniform(low=-180, high=180, size=N)
power = np.random.randint(low=5, high=50, size=N)
data = np.vstack([lat, lng, power]).T
The content of variable data yielded by the snippet above looks like this:
array([[ 8.7864, 166.9186, 21. ],
[ 38.7341, -41.9611, 10. ],
[ 18.4974, 105.021 , 20. ],
[ 8.079 , 10.4022, 5. ],
[ -13.7421, 24.496 , 23. ],
[ 26.2609, 153.2148, 40. ],
[ -11.2343, -154.427 , 29. ],
[ 70.5191, -148.6335, 34. ]])
To divide those data into three different groups we have to pass data and custom_metric to the linkage function (check the docs to find out more on parameter method), and then pass the returned linkage matrix to the cut_tree function with n_clusters=3.
from scipy.cluster.hierarchy import linkage, cut_tree
Z = linkage(data, method='average', metric=custom_metric)
y = cut_tree(Z, 3).flatten()
As a result we get the group membership (array y) for each central:
array([0, 1, 0, 2, 2, 0, 0, 1])
The results above depend on the value of weight. If you wish to use a value different to 1 (for example 250) you can change the default value like this:
def custom_metric(central_1, central_2, weight=250):
Alternatively, you could set the parameter metric in the call to linkage to a lambda expression as follows: metric=lambda x, y: custom_metric(x, y, 250).
Finally, to gain a deeper insight into the hierarchical/agglomerative clustering you could plot it as a dendrogram:
from scipy.cluster.hierarchy import dendrogram
dendrogram(Z)
If you are looking for a solution where you form clusters based on coordinates and power being weights to these co-ordinates, you can add sample_weight= power. This will give you clusters based on coordinates and centroid will be leaning towards the higher weights observations in the cluster
Sumup
There appears to be a lot of confusion between OP and answers. A brief sumup:
Input:
power plants with lat/lon and generated power [3D-array]
Desired output:
clusters (groups of power plants) with similar cumulative generated power
power plants in a cluster must be geographically close/coherent
Partial solutions
any kmeans implementation (only takes care of geographical proximity and coherence, without weight)
SciKit Learn's weighted kmeans (notwithstanding the sample_weight-parameter it cannot weight the data points but instead only moves the cluster centroids to the cluster's point of gravity
accepted answer doesn't respect output condition no 2 (geographical coherence)
Solution
The only solution I found is this repo. Confusingly it is also called "weighted k-means" but instead of SciKit Learn's implementation it really does fulfil both criteria above.
To get started clone the repo and run example.py.
For my use case the results are pretty good.
Once you get to the point of adding the cluster numbers back to your original dataframe, unfortunately a small hack is needed but it still works.
More specifically, given a natural number d, how can I generate random vectors in R^d such that each vector x has Euclidean norm <= 1?
Generating random vectors via numpy.random.rand(1,d) is no problem, but the likelihood of such a random vector having norm <= 1 is predictably bad for even not-small d. For example, even for d = 10 about 0.2% percent of such random vectors have appropriately small norm. So that seems like a silly solution.
EDIT: Re: Walter's comment, yes, I'm looking for a uniform distribution over vectors in the unit ball in R^d.
Based on the Wolfram Mathworld article on hypersphere point picking and Nate Eldredge's answer to a similar question on math.stackexchange.com, you can generate such a vector by generating a vector of d independent Gaussian random variables and a random number U uniformly distributed over the closed interval [0, 1], then normalizing the vector to norm U^(1/d).
Based on the answer by user2357112, you need something like this:
import numpy as np
...
inv_d = 1.0 / d
for ...:
gauss = np.random.normal(size=d)
length = np.linalg.norm(gauss)
if length == 0.0:
x = gauss
else:
r = np.random.rand() ** inv_d
x = np.multiply(gauss, r / length)
# conceptually: / length followed by * r
# do something with x
(this is my second Python program, so don't shoot at me...)
The tricks are that
the combination of d independent gaussian variables with same σ is a gaussian distribution in d dimensions, which, remarkably, has spherical symmetry,
the gaussian distribution in d dimensions can be projected onto the unit sphere by dividing by the norm, and
the uniform distribution in a d-dimensional unit sphere has cumulative radial distribution rd (which is what you need to invert)
this is the Python / Numpy code I am using. Since it does not use loops, is much faster:
n_vectors=1000
d=2
rnd_vec=np.random.uniform(-1, 1, size=(n_vectors, d)) # the initial random vectors
unif=np.random.uniform(size=n_vectors) # a second array random numbers
scale_f=np.expand_dims(np.linalg.norm(rnd_vec, axis=1)/unif, axis=1) # the scaling factors
rnd_vec=rnd_vec/scale_f # the random vectors in R^d
The second array of random numbers (unif) is needed as second scaling factor because otherwise all the vectors will have euclidean norm equal to one.
My Python programming problem is the following:
I want to create an array of measurement results. Each result can be described as a normal distribution for which the mean value is the measurement result itself and the standard deviation is its uncertainty.
Pseudo code could be:
x1 = N(result1, unc1)
x2 = N(result2, unc2)
...
x = array(x1, x2, ..., xN)
Than I would like to calculate the FFT of x:
f = numpy.fft.fft(x)
What I want is that the uncertainty of the measurements contained in x is propagated through the FFT calculation so that f is an array of amplitudes along with their uncertainty like this:
f = (a +/- unc(a), b +/- unc(b), ...)
Can you suggest me a way to do this?
Each Fourier coefficient computed by the discrete Fourier transform
of the array x is a linear combination of the elements of x; see
the formula for X_k on the wikipedia page on the discrete Fourier transform,
which I'll write as
X_k = sum_(n=0)^(n=N-1) [ x_n * exp(-i*2*pi*k*n/N) ]
(That is, X is the discrete Fourier transform of x.)
If x_n is normally distributed with mean mu_n and variance sigma_n**2,
then a little bit of algebra shows that the variance of X_k is the sum
of the variances of x_n
Var(X_k) = sum_(n=0)^(n=N-1) sigma_n**2
In other words, the variance is the same for each Fourier coefficent;
it is the sum of the variances of the measurements in x.
Using your notation, where unc(z) is the standard deviation of z,
unc(X_0) = unc(X_1) = ... = unc(X_(N-1)) = sqrt(unc(x1)**2 + unc(x2)**2 + ...)
(Note that the distribution of the magnitude of X_k is the Rice distribution.)
Here's a script that demonstrates this result. In this example, the standard
deviation of the x values increase linearly from 0.01 to 0.5.
import numpy as np
from numpy.fft import fft
import matplotlib.pyplot as plt
np.random.seed(12345)
n = 16
# Create 'x', the vector of measured values.
t = np.linspace(0, 1, n)
x = 0.25*t - 0.2*t**2 + 1.25*np.cos(3*np.pi*t) + 0.8*np.cos(7*np.pi*t)
x[:n//3] += 3.0
x[::4] -= 0.25
x[::3] += 0.2
# Compute the Fourier transform of x.
f = fft(x)
num_samples = 5000000
# Suppose the std. dev. of the 'x' measurements increases linearly
# from 0.01 to 0.5:
sigma = np.linspace(0.01, 0.5, n)
# Generate 'num_samples' arrays of the form 'x + noise', where the standard
# deviation of the noise for each coefficient in 'x' is given by 'sigma'.
xn = x + sigma*np.random.randn(num_samples, n)
fn = fft(xn, axis=-1)
print("Sum of input variances: %8.5f" % (sigma**2).sum())
print()
print("Variances of Fourier coefficients:")
np.set_printoptions(precision=5)
print(fn.var(axis=0))
# Plot the Fourier coefficient of the first 800 arrays.
num_plot = min(num_samples, 800)
fnf = fn[:num_plot].ravel()
clr = "#4080FF"
plt.plot(fnf.real, fnf.imag, 'o', color=clr, mec=clr, ms=1, alpha=0.3)
plt.plot(f.real, f.imag, 'kD', ms=4)
plt.grid(True)
plt.axis('equal')
plt.title("Fourier Coefficients")
plt.xlabel("$\Re(X_k)$")
plt.ylabel("$\Im(X_k)$")
plt.show()
The printed output is
Sum of input variances: 1.40322
Variances of Fourier coefficients:
[ 1.40357 1.40288 1.40331 1.40206 1.40231 1.40302 1.40282 1.40358
1.40376 1.40358 1.40282 1.40302 1.40231 1.40206 1.40331 1.40288]
As expected, the sample variances of the Fourier coefficients are
all (approximately) the same as the sum of the measurement variances.
Here's the plot generated by the script. The black diamonds are the
Fourier coefficients of a single x vector. The blue dots are the
Fourier coefficients of 800 realizations of x + noise. You can see that
the point clouds around each Fourier coefficent are roughly symmetric
and all the same "size" (except, of course, for the real coeffcients,
which show up in this plot as horizontal lines on the real axis).
I have two lists ( of different lengths) of numbers.
Using Python, I want to calculate histograms with say 10 bins.
Then I want to smooth these two histograms with Standard kernel (gaussian kernel with mean = 0 ,sigma=1)
Then I want to calculate the KL distance between these 2 smoothed histograms.
I found some code about histogram calculation but no sure about how to apply standard kernel for smoothening and then how to calculate the KL distance.
Please help.
For calculating histograms you can use numpy.histogram() and for gaussian smoothing scipy.ndimage.filters.gaussian_filter(). Kullback-Leibler divergence code can be found here.
Method to calculate do the required calculation would look something like this:
import numpy as np
from scipy.ndimage.filters import gaussian_filter
def kl(p, q):
p = np.asarray(p, dtype=np.float)
q = np.asarray(q, dtype=np.float)
return np.sum(np.where(p != 0, p * np.log(p / q), 0))
def smoothed_hist_kl_distance(a, b, nbins=10, sigma=1):
ahist, bhist = (np.histogram(a, bins=nbins)[0],
np.histogram(b, bins=nbins)[0])
asmooth, bsmooth = (gaussian_filter(ahist, sigma),
gaussian_filter(bhist, sigma))
return kl(asmooth, bsmooth)