Assume I have two sets of points A and B. A is a matrix of size N-by-D and B is a matrix of size M-by-D. Both sets have the same dimensions D but may be composed of different numbers of samples N,M > 1 (assume that number of samples is sufficiently large).
I am looking for an efficient metric to determine to what degree set A overlaps with set B. This metric should have the following properties:
It should return a large value when set A is contained in set B
It should return an intermediate value when set A partially overlaps with set B
It should return a small value when set A and set B do not overlap
I have thought of some ways to achieve this, but none of them quite make the cut:
Determine the convex hull of B, then calculate what percentage of A lies within this convex hull. This is more or less reliable (assuming B is sufficiently convex), but calculating the convex hull becomes prohibitively expensive for large D.
Estimate the mean and covariance of A and B and calculate a Kullback-Leibler divergence between the two resulting multivariate Gaussian distributions. This is reasonably efficient, but does not distinguish the case when A is completely embedded in B but has significantly lower spread, and when A and B have similar spread but only overlap partially.
Do you have any other ideas on how I could tackle this issue? Below I have provided an example code illustrating the problem with using the Kullback-Leibler divergence:
import numpy as np
import scipy.stats
N = 1000
M = 1000
D = 3
def metric_KLD(A,B):
# Get the dimension of A and B
D = A.shape[-1]
# Estimate mean and cov of A
A_mean = np.mean(A,axis=0)
A_cov = np.cov(A.T)
# Estimate mean and cov of B
B_mean = np.mean(B,axis=0)
B_cov = np.cov(B.T)
# Calculate the KLD
KLD = 0.5*(np.log(np.linalg.det(B_cov)/np.linalg.det(A_cov)) - \
D + np.trace(,A_cov)) + \
(B_mean - A_mean)[np.newaxis,:],
(B_mean - A_mean)[:,np.newaxis])))
return KLD
# Case 1: Both distributions overlap perfectly
A1 = scipy.stats.multivariate_normal.rvs(
mean = np.zeros(D),
cov = np.identity(D)*10**2,
size = N)
B1 = scipy.stats.multivariate_normal.rvs(
mean = np.zeros(D),
cov = np.identity(D)*10**2,
size = M)
print('KLD case 1: '+str(metric_KLD(A1,B1)))
# Case 2: Both distributions overlap partially
A2 = scipy.stats.multivariate_normal.rvs(
mean = np.asarray([0,0,0]),
cov = np.identity(D)*10**2,
size = N)
B2 = scipy.stats.multivariate_normal.rvs(
mean = np.asarray([10,0,0]),
cov = np.identity(D)*10**2,
size = M)
print('KLD case 2: '+str(metric_KLD(A2,B2)))
# Case 3: Both distributions don't overlap at all
A3 = scipy.stats.multivariate_normal.rvs(
mean = np.asarray([0,0,0]),
cov = np.identity(D)*10**2,
size = N)
B3 = scipy.stats.multivariate_normal.rvs(
mean = np.asarray([30,30,30]),
cov = np.identity(D)*10**2,
size = M)
print('KLD case 3: '+str(metric_KLD(A3,B3)))
# Case 4: A is included in B, but A has significantly smaller spread
A4 = scipy.stats.multivariate_normal.rvs(
mean = np.asarray([0,0,0]),
cov = np.identity(D)*1**2,
size = N)
B4 = scipy.stats.multivariate_normal.rvs(
mean = np.asarray([0,0,0]),
cov = np.identity(D)*10**2,
size = M)
# This is problematic: Should be better than case 2 but isn't
print('KLD case 4: '+str(metric_KLD(A4,B4)))
Another idea is to compare the standard deviation within each set to the overall standard deviation in the pooled set. The more the sets overlap, the more similar these numbers will be.
To handle the case of one set with a small spread within the other one, you could just use the larger individual standard deviation for comparison with the pooled one.
This could be done for each dimension, and the final metric would be the average over all dimensions:
def metric_std(A, B):
# standard deviations
A_std = np.std(A, axis=0)
B_std = np.std(B, axis=0)
pooled_std = np.std(np.vstack([A, B]), axis=0)
# metric: ratio of higher individual to pooled std
std_ratio = np.max([A_std, B_std], axis=0) / pooled_std
return np.mean(std_ratio)
print('case 1: '+str(metric_std(A1, B1)))
print('case 2: '+str(metric_std(A2, B2)))
print('case 3: '+str(metric_std(A3, B3)))
print('case 4: '+str(metric_std(A4, B4)))
case 1: 1.012685295955191
case 2: 0.973728554241719
case 3: 0.5569612281519413
case 4: 1.4071848466046386
Let's say I have a data matrix X with num_samples = 1600, dim_data = 2, from which I can build a 1600*1600 similarity matrix S using the rbf kernel. I can normalize each row of the matrix, by multiplying all entries of the row by (1 / sum(entries of the row)). This procedure gives me a (square) right stochastic matrix, which we expect to have an eigenvalue equal to 1 associated to a constant eigenvector full of 1s.
We can easily check that this is indeed an eigenvector by taking its product with the matrix. However, using scipy.linalg.eig the obtained eigenvector associated to eigenvalue 1 is only piecewise constant.
I have tried scipy.linalg.eig on similarly sized matrices with randomly generated data which I transformed into stochastic matrices and consistently obtained a constant eigenvector associated to eigenvalue 1.
My question is then, what factors may cause numerical instabilities when computing eigenvalues of stochastic matrices using scipy.linalg.eig?
Reproducible example:
def kernel(sigma,X):
param sigma: variance
param X: (num_samples,data_dim)
squared_norm = np.expand_dims(np.sum(X**2,axis=1),axis=1) + np.expand_dims(np.sum(X**2,axis=1),axis=0)-2*np.einsum('ni,mi->nm',X,X)
return np.exp(-0.5*squared_norm/sigma**2)
def normalize(array):
degrees = []
M = array.shape[0]
for i in range(M):
norm = sum(array[i,:])
degrees_matrix = np.diag(np.array(degrees))
P = np.matmul(np.linalg.inv(degrees_matrix),array)
return P
#generate the data
points = np.linspace(0,4*np.pi,1600)
Z = np.zeros((1600,2))
Z[0:800,:] = np.array([2.2*np.cos(points[0:800]),2.2*np.sin(points[0:800])]).T
Z[800:,:] = np.array([4*np.cos(points[0:800]),4*np.sin(points[0:800])]).T
X = np.zeros((1600,2))
X[:,0] = np.where(Z[:,1] >= 0, Z[:,0] + .8 + params[1], Z[:,0] - .8 + params[2])
X[:,1] = Z[:,1] + params[0]
#create the stochastic matrix P
P = normalize(kernel(.05,X))
#inspect the eigenvectors
e,v = scipy.linalg.eig(P)
p = np.flip(np.argsort(e))
e = e[p]
v = v[:,p]
#check on synthetic data:
Y = np.random.normal(size=(1600,2))
P = normalize(kernel(Y))
#inspect the eigenvectors
e,v = scipy.linalg.eig(P)
p = np.flip(np.argsort(e))
e = e[p]
v = v[:,p]
Using the code provided by Ahmed AEK, here are some results on the divergence of the obtained eigenvector from the constant eigenvector.
[-1.36116641e-05 -1.36116641e-05 -1.36116641e-05 ... 5.44472888e-06
5.44472888e-06 5.44472888e-06]
norm = 0.9999999999999999
max difference = 0.04986484253966891
max difference / element value -3663.3906291852545
I have observed that a low value of sigma in the construction of the kernel matrix produces a less sharp decay in the (sorted) eigenvalues. In fact, for sigma=0.05, the first 4 eigenvalues produced by scipy.linalg.eig are rounded up to 1. This may be linked to the imprecision in the eigenvectors. When sigma is increased to 0.5, I do obtain a constant eigenvector.
First 5 eigenvectors in the sigma=0.05 case
First 5 eigenvectors in the sigma=0.5 case
the computer has an expected accuracy of 14 digits as of 64 bits of float as shown here, which means that any result will only be accurate up to 14 digits.
using the below code you can check this result:
Y = np.random.normal(size=(1600,2))
P = normalize(kernel(5,Y))
P = P / np.sum(P,axis=1)
#inspect the eigenvectors
e,v = np.linalg.eig(P)
p = np.flip(np.argsort(e))
a = np.isclose(e,1)
e1 = e[a]
v1 = v[:,a]
v11 = v1[:,0]
print('norm = ',np.sum(v11**2))
print('max difference = ',np.amax(np.abs(np.diff(v11))))
print('max difference / element value =',np.amax(np.abs(np.diff(v11)))/v11[0])
result is:
[0.025+0.j 0.025+0.j 0.025+0.j ... 0.025+0.j 0.025+0.j 0.025+0.j]
norm = (1+0j)
max difference = 1.97758476261356e-16
max difference / element value = (7.91033905045416e-15+0j)
as you can see, the difference is accuate to within 8e-15 which is around 14 digits of precision, the norm will sometimes be 0.99999999999998, which is within 14 digits of precision.
I have two solutions to this problem actually, they are both applied below to a test case. The thing is that none of them is perfect: first one only take into account the two end points, the other one can't be made "arbitrarily smooth": there is a limit in the amount of smoothness one can achieve (the one I am showing).
I am sure there is a better solution, that kind-of go from the first solution to the other and all the way to no smoothing at all. It may already be implemented somewhere. Maybe solving a minimization problem with an arbitrary number of splines equidistributed?
Thank you very much for your help
Ps: the seed used is a challenging one
import matplotlib.pyplot as plt
from scipy import interpolate
from scipy.signal import savgol_filter
import numpy as np
import random
def scipy_bspline(cv, n=100, degree=3):
""" Calculate n samples on a bspline
cv : Array ov control vertices
n : Number of samples to return
degree: Curve degree
cv = np.asarray(cv)
count = cv.shape[0]
degree = np.clip(degree,1,count-1)
kv = np.clip(np.arange(count+degree+1)-degree,0,count-degree)
# Return samples
max_param = count - (degree * (1-periodic))
spl = interpolate.BSpline(kv, cv, degree)
return spl(np.linspace(0,max_param,n))
def round_up_to_odd(f):
return / 2.) * 2 + 1)
def generateRandomSignal(n=1000, seed=None):
n : integer, optional
Number of points in the signal. The default is 1000.
sig : numpy array
print("Seed was:", seed)
steps = np.random.choice(a=[-1, 0, 1], size=(n-1))
roughSig = np.concatenate([np.array([0]), steps]).cumsum(0)
sig = savgol_filter(roughSig, round_up_to_odd(n/10), 6)
return sig
# Generate a random signal to illustrate my point
n = 1000
t = np.linspace(0, 10, n)
seed = 45136. # Challenging seed
sig = generateRandomSignal(n=1000, seed=seed)
sigInit = np.copy(sig)
# Add noise to the signal
mean = 0
std = sig.max()/3.0
num_samples = n/5
idxMin = n/2-100
idxMax = idxMin + num_samples
tCut = t[idxMin+1:idxMax]
noise = np.random.normal(mean, std, size=num_samples-1) + 2*std*np.sin(2.0*np.pi*tCut/0.4)
sig[idxMin+1:idxMax] += noise
# Define filtering range enclosing the noisy area of the signal
idxMin -= 20
idxMax += 20
# Extreme filtering solution
# Spline between first and last points, the points in between have no influence
sigTrim = np.delete(sig, np.arange(idxMin,idxMax))
tTrim = np.delete(t, np.arange(idxMin,idxMax))
f = interpolate.interp1d(tTrim, sigTrim, kind='quadratic')
sigSmooth1 = f(t)
# My attempt. Not bad but not perfect because there is a limit in the maximum
# amount of smoothing we can add (degree=len(tSlice) is the maximum)
# If I could do degree=10*len(tSlice) and converging to the first solution
# I would be done!
sigSlice = sig[idxMin:idxMax]
tSlice = t[idxMin:idxMax]
cv = np.stack((tSlice, sigSlice)).T
p = scipy_bspline(cv, n=len(tSlice), degree=len(tSlice))
tSlice = p.T[0]
sigSliceSmooth = p.T[1]
sigSmooth2 = np.copy(sig)
sigSmooth2[idxMin:idxMax] = sigSliceSmooth
# Plot
plt.plot(t, sig, label="Signal")
plt.plot(t, sigSmooth1, label="Solution 1")
plt.plot(t, sigSmooth2, label="Solution 2")
plt.plot(t[idxMin:idxMax], sigInit[idxMin:idxMax], label="What I'd want (kind of, smoother will be even better actually)")
plt.plot([t[idxMin],t[idxMax]], [sig[idxMin],sig[idxMax]],"o")
Yes, a minimization is a good way to approach this smoothing problem.
Least squares problem
Here is a suggestion for a least squares formulation: let s[0], ..., s[N] denote the N+1 samples of the given signal to smooth, and let L and R be the desired slopes to preserve at the left and right endpoints. Find the smoothed signal u[0], ..., u[N] as the minimizer of
min_u (1/2) sum_n (u[n] - s[n])² + (λ/2) sum_n (u[n+1] - 2 u[n] + u[n-1])²
subject to
s[0] = u[0], s[N] = u[N] (value constraints),
L = u[1] - u[0], R = u[N] - u[N-1] (slope constraints),
where in the minimization objective, the sums are over n = 1, ..., N-1 and λ is a positive parameter controlling the smoothing strength. The first term tries to keep the solution close to the original signal, and the second term penalizes u for bending to encourage a smooth solution.
The slope constraints require that
u[1] = L + u[0] = L + s[0] and u[N-1] = u[N] - R = s[N] - R. So we can consider the minimization as over only the interior samples u[2], ..., u[N-2].
Finding the minimizer
The minimizer satisfies the Euler–Lagrange equations
(u[n] - s[n]) / λ + (u[n+2] - 4 u[n+1] + 6 u[n] - 4 u[n-1] + u[n-2]) = 0
for n = 2, ..., N-2.
An easy way to find an approximate solution is by gradient descent: initialize u = np.copy(s), set u[1] = L + s[0] and u[N-1] = s[N] - R, and do 100 iterations or so of
u[2:-2] -= (0.05 / λ) * (u - s)[2:-2] + np.convolve(u, [1, -4, 6, -4, 1])[4:-4]
But with some more work, it is possible to do better than this by solving the E–L equations directly. For each n, move the known quantities to the right-hand side: s[n] and also the endpoints u[0] = s[0], u[1] = L + s[0], u[N-1] = s[N] - R, u[N] = s[N]. The you will have a linear system "A u = b", and matrix A has rows like
0, ..., 0, 1, -4, (6 + 1/λ), -4, 1, 0, ..., 0.
Finally, solve the linear system to find the smoothed signal u. You could use numpy.linalg.solve to do this if N is not too large, or if N is large, try an iterative method like conjugate gradients.
you can apply a simple smoothing method and plot the smooth curves with different smoothness values to see which one works best.
def smoothing(data, smoothness=0.5):
last = data[0]
new_data = [data[0]]
for datum in data[1:]:
new_value = smoothness * last + (1 - smoothness) * datum
last = datum
return new_data
You can plot this curve for multiple values of smoothness and pick the curve which suits your needs. You can also apply this method only on a range of values in the actual curve by defining start and end
What would you do if you had n particles on a plane (with positions (x_n,y_n)), with a certain flux flux_n, and you have to pixelate these particles, so you have to go from (x,y) to (pixel_i, pixel_j) space and you have to sum up the flux of the m particles which fall in to every single pixel? Any suggestions? Thank you!
The are several ways with which you can solve your problem.
Assumptions: your positions have been stored into two numpy array of shape (N, ), i.e. the position x_n (or y_n) for n in [0, N), let's call them x and y. The flux is stored into a numpy array with the same shape, fluxes.
Create something that looks like a grid:
#get minimums and maximums position
mins = int(x.min()), int(y.min())
maxs = int(x.max()), int(y.max())
#actually you can also add and subtract 1 or more unit
#in order to have a grid larger than the x, y extremes
#something like mins-=epsilon and maxs += epsilon
#create the grid
xx = np.arange(mins[0], maxs[0])
yy = np.arange(mins[1], maxs[1])
Now you can perform a double for loop, tacking, each time, two consecutive elements of xx and yy, to do this, you can simple take:
x1 = xx[:-1] #excluding the last element
x2 = xx[1:] #excluding the first element
#the same for y:
y1 = yy[:-1] #excluding the last element
y2 = yy[1:] #excluding the first element
fluxes_grid = np.zeros((xx.shape[0], yy.shape[0]))
for i, (x1_i, x2_i) in enumerate(zip(x1, x2)):
for j, (y1_j, y2_j) in enumerate(zip(y1, y2)):
idx = np.where((x>=x1_i) & (x<x2_i) & (y>=y1_j) & (y<y2_j))[0]
fluxes_grid[i,j] = np.sum(fluxes[idx])
At the end of this loop you have a grid whose elements are pixels representing the sum of fluxes.
What happen if you have a lot o points, so many that the loop takes hours?
A faster solution is to use a quantization method, like K Nearest Neighbor, KNN on a rigid grid. There are many way to run a KNN (included already implemented version, e.g. sklearn KNN). But this is vary efficient if you can take advantage of a GPU. For example this my tensorflow (vs 2.1) implementation. After you have defined a squared grid:
_min, maxs = min(mins), max(maxs)
xx = np.arange(_min, _max)
yy = np.arange(_min, _max)
You can build the matrix, grid, and your position matrix, X:
grid = np.column_stack([xx, yy])
X = np.column_stack([x, y])
then you have to define a matrix euclidean pairwise-distance function:
def pairwise_dist(A, B):
# squared norms of each row in A and B
na = tf.reduce_sum(tf.square(A), 1)
nb = tf.reduce_sum(tf.square(B), 1)
# na as a row and nb as a co"lumn vectors
na = tf.reshape(na, [-1, 1])
nb = tf.reshape(nb, [1, -1])
# return pairwise euclidead difference matrix
D = tf.sqrt(tf.maximum(na - 2*tf.matmul(A, B, False, True) + nb, 0.0))
return D
#compute the pairwise distances:
D = pairwise_dist(grid, X)
D = D.numpy() #get a numpy matrix from a tf tensor
#D has shape M, N, where M is the number of points in the grid and N the number of positions.
#now take a rank and from this the best K (e.g. 10)
ranks = np.argsort(D, axis=1)[:, :10]
#for each point in the grid you have the nearest ten.
Now you have to take the fluxes corresponding to this 10 positions and sum on them.
I had avoid to further specify this second method, I don't know the dimension of your catalogue, if you have or not a GPU or if you want to use such kind of optimization.
If you want I can improve this explanation, only if you are interested.
This is my second attempt at implementing gradient descent in one variable and it always diverges. Any ideas?
This is simple linear regression for minimizing the residual sum of squares in one variable.
def gradient_descent_wtf(xvalues, yvalues):
tolerance = 0.1
#some line to predict y values from x values
#a predicted y-value has value mx + b
for i in range(0,10):
#calculate y-value predictions for all x-values
predicted_yvalues = list()
for x in xvalues:
predicted_yvalues.append(m*x + b)
# predicted_yvalues holds the predicted y-values
#now calculate the residuals = y-value - predicted y-value for each point
residuals = list()
number_of_points = len(yvalues)
for n in range(0,number_of_points):
residuals.append(yvalues[n] - predicted_yvalues[n])
## calculate the residual sum of squares from the residuals, that is,
## square each residual and add them all up. we will try to minimize
## the residual sum of squares later.
residual_sum_of_squares = 0.
for r in residuals:
residual_sum_of_squares += r**2
print("RSS = %s" % residual_sum_of_squares)
#now make a version of the residuals which is multiplied by the x-values
residuals_times_xvalues = list()
for n in range(0,number_of_points):
residuals_times_xvalues.append(residuals[n] * xvalues[n])
#now create the sums for the residuals and for the residuals times the x-values
residuals_sum = sum(residuals)
residuals_times_xvalues_sum = sum(residuals_times_xvalues)
# now multiply the sums by a positive scalar and add each to m and b.
residuals_sum *= 0.1
residuals_times_xvalues_sum *= 0.1
b += residuals_sum
m += residuals_times_xvalues_sum
#and repeat until convergence.
#convergence occurs when ||sum vector|| < some tolerance.
# ||sum vector|| = sqrt( residuals_sum**2 + residuals_times_xvalues_sum**2 )
#check for convergence
magnitude_of_sum_vector = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
if magnitude_of_sum_vector < tolerance:
return (b, m)
RSS = 370433.0
RSS = 300170125.7
RSS = 4.86943013045e+11
RSS = 7.90447409339e+14
RSS = 1.28312217794e+18
RSS = 2.08287421094e+21
RSS = 3.38110045417e+24
RSS = 5.48849288217e+27
RSS = 8.90939341376e+30
RSS = 1.44624932026e+34
(-3.475524066284303e+16, -2.4195981188763203e+17)
The gradients are huge -- hence you are following large vectors for long distances (0.1 times a large number is large). Find unit vectors in the appropriate direction. Something like this (with comprehensions replacing your loops):
def gradient_descent_wtf(xvalues, yvalues):
tolerance = 0.1
for i in range(0,10):
predicted_yvalues = [m*x+b for x in xvalues]
residuals = [y-y_hat for y,y_hat in zip(yvalues,predicted_yvalues)]
residual_sum_of_squares = sum(r**2 for r in residuals) #only needed for debugging purposes
print("RSS = %s" % residual_sum_of_squares)
residuals_times_xvalues = [r*x for r,x in zip(residuals,xvalues)]
residuals_sum = sum(residuals)
residuals_times_xvalues_sum = sum(residuals_times_xvalues)
# (residuals_sum,residual_times_xvalues_sum) is a vector which points in the negative
# gradient direction. *Find a unit vector which points in same direction*
magnitude = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
residuals_sum /= magnitude
residuals_times_xvalues_sum /= magnitude
b += residuals_sum * (0.1)
m += residuals_times_xvalues_sum * (0.1)
#check for convergence -- this needs work!
magnitude_of_sum_vector = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
if magnitude_of_sum_vector < tolerance:
return (b, m)
For example:
>>> gradient_descent_wtf([1,2,3,4,5,6,7,8,9,10],[6,23,8,56,3,24,234,76,59,567])
RSS = 370433.0
RSS = 368732.1655050716
RSS = 367039.18363896786
RSS = 365354.0543519137
RSS = 363676.7775934381
RSS = 362007.3533123621
RSS = 360345.7814567845
RSS = 358692.061974069
RSS = 357046.1948108295
RSS = 355408.17991291644
(1.1157111313023558, 1.9932828425473605)
which is certainly much more plausible.
It isn't a trivial matter to make a numerically stable gradient-descent algorithm. You might want to consult a decent textbook in numerical analysis.
First, Your code is right.
But you should consider something about math when you do linear regression.
For example, the residual is -205.8 and your learning rate is 0.1 so you will get a huge descent step -25.8.
It's a so large step that you can't go back to the correct m and b. You have to make your step small enough.
There are two ways to make gradient descent step reasonable:
initialize a small learning rate, such as 0.001 and 0.0003.
Divide your step by the total amount of your input values.
i'm calculating Gini coefficient (similar to: Python - Gini coefficient calculation using Numpy) but i get an odd result. for a uniform distribution sampled from np.random.rand(), the Gini coefficient is 0.3 but I would have expected it to be close to 0 (perfect equality). what is going wrong here?
def G(v):
bins = np.linspace(0., 100., 11)
total = float(np.sum(v))
yvals = []
for b in bins:
bin_vals = v[v <= np.percentile(v, b)]
bin_fraction = (np.sum(bin_vals) / total) * 100.0
# perfect equality area
pe_area = np.trapz(bins, x=bins)
# lorenz area
lorenz_area = np.trapz(yvals, x=bins)
gini_val = (pe_area - lorenz_area) / float(pe_area)
return bins, yvals, gini_val
v = np.random.rand(500)
bins, result, gini_val = G(v)
plt.subplot(2, 1, 1)
plt.plot(bins, result, label="observed")
plt.plot(bins, bins, '--', label="perfect eq.")
plt.xlabel("fraction of population")
plt.ylabel("fraction of wealth")
plt.title("GINI: %.4f" %(gini_val))
plt.subplot(2, 1, 2)
plt.hist(v, bins=20)
for the given set of numbers, the above code calculates the fraction of the total distribution's values that are in each percentile bin.
the result:
uniform distributions should be near "perfect equality" so the lorenz curve bending is off.
This is to be expected. A random sample from a uniform distribution does not result in uniform values (i.e. values that are all relatively close to each other). With a little calculus, it can be shown that the expected value (in the statistical sense) of the Gini coefficient of a sample from the uniform distribution on [0, 1] is 1/3, so getting values around 1/3 for a given sample is reasonable.
You'll get a lower Gini coefficient with a sample such as v = 10 + np.random.rand(500). Those values are all close to 10.5; the relative variation is lower than the sample v = np.random.rand(500).
In fact, the expected value of the Gini coefficient for the sample base + np.random.rand(n) is 1/(6*base + 3).
Here's a simple implementation of the Gini coefficient. It uses the fact that the Gini coefficient is half the relative mean absolute difference.
def gini(x):
# (Warning: This is a concise implementation, but it is O(n**2)
# in time and memory, where n = len(x). *Don't* pass in huge
# samples!)
# Mean absolute difference
mad = np.abs(np.subtract.outer(x, x)).mean()
# Relative mean absolute difference
rmad = mad/np.mean(x)
# Gini coefficient
g = 0.5 * rmad
return g
(For some more efficient implementations, see More efficient weighted Gini coefficient in Python)
Here's the Gini coefficient for several samples of the form v = base + np.random.rand(500):
In [80]: v = np.random.rand(500)
In [81]: gini(v)
Out[81]: 0.32760618249832563
In [82]: v = 1 + np.random.rand(500)
In [83]: gini(v)
Out[83]: 0.11121487509454202
In [84]: v = 10 + np.random.rand(500)
In [85]: gini(v)
Out[85]: 0.01567937753659053
In [86]: v = 100 + np.random.rand(500)
In [87]: gini(v)
Out[87]: 0.0016594595244509495
A slightly faster implementation (using numpy vectorization and only computing each difference once):
def gini_coefficient(x):
"""Compute Gini coefficient of array of values"""
diffsum = 0
for i, xi in enumerate(x[:-1], 1):
diffsum += np.sum(np.abs(xi - x[i:]))
return diffsum / (len(x)**2 * np.mean(x))
Note: x must be a numpy array.
Gini coefficient is the area under the Lorence curve, usually calculated for analyzing the distribution of income in population. provides simple implementation for the same using python.
A quick note on the original methodology:
When calculating Gini coefficients directly from areas under curves with np.traps or another integration method, the first value of the Lorenz curve needs to be 0 so that the area between the origin and the second value is accounted for. The following changes to G(v) fix this:
yvals = [0]
for b in bins[1:]:
I also discussed this issue in this answer, where including the origin in those calculations provides an equivalent answer to using the other methods discussed here (which do not need 0 to be appended).
In short, when calculating Gini coefficients directly using integration, start from the origin. If using the other methods discussed here, then it's not needed.
Note that gini index is currently present in skbio.diversity.alpha as gini_index. It might give a bit different result with examples mentioned above.
You are getting the right answer. The Gini Coefficient of the uniform distribution is not 0 "perfect equality", but (b-a) / (3*(b+a)). In your case, b = 1, and a = 0, so Gini = 1/3.
The only distributions with perfect equality are the Kroneker and the Dirac deltas. Remember that equality means "all the same", not "all equally probable".
There were some issues with the previous implementations. They never gave the gini index = 1 for perfectly sparse data.
def gini_coefficient(x):
"""Compute Gini coefficient of array of values"""
diffsum = 0
for i, xi in enumerate(x[:-1], 1):
diffsum += np.sum(np.abs(xi - x[i:]))
return diffsum / (len(x)**2 * np.mean(x))
gini_coefficient(np.array([0, 0, 1]))
gives the answer 0.666666. That happens because of the implied "integration scheme" it uses.
Here is another variant that bypasses the issue, although it is computationally heavier:
import numpy as np
from scipy.interpolate import interp1d
def gini(v, n_new = 1000):
"""Compute Gini coefficient of array of values"""
v_abs = np.sort(np.abs(v))
cumsum_v = np.cumsum(v_abs)
n = len(v_abs)
vals = np.concatenate([[0], cumsum_v/cumsum_v[-1]])
x = np.linspace(0, 1, n+1)
f = interp1d(x=x, y=vals, kind='previous')
xnew = np.linspace(0, 1, n_new+1)
dx_new = 1/(n_new)
vals_new = f(xnew)
return 1 - 2 * np.trapz(y=vals_new, x=xnew, dx=dx_new)
gini(np.array([0, 0, 1]))
it gives 0.999 output, which is closer to what one wants to have =)