Fast, elegant way to calculate empirical/sample covariogram - python

Does anyone know a good method to calculate the empirical/sample covariogram, if possible in Python?
This is a screenshot of a book which contains a good definition of covariagram:
If I understood it correctly, for a given lag/width h, I'm supposed to get all the pair of points that are separated by h (or less than h), multiply its values and for each of these points, calculate its mean, which in this case, are defined as m(x_i). However, according to the definition of m(x_{i}), if I want to compute m(x1), I need to obtain the average of the values located within distance h from x1. This looks like a very intensive computation.
First of all, am I understanding this correctly? If so, what is a good way to compute this assuming a two dimensional space? I tried to code this in Python (using numpy and pandas), but it takes a couple of seconds and I'm not even sure it is correct, that is why I will refrain from posting the code here. Here is another attempt of a very naive implementation:
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(np.array(coordinates))) # coordinates is a nx2 array
z = np.array(z) # z are the values
cutoff = np.max(distances)/3.0 # somewhat arbitrary cutoff
width = cutoff/15.0
widths = np.arange(0, cutoff + width, width)
Z = []
Cov = []
for w in np.arange(len(widths)-1): # for each width
# for each pairwise distance
for i in np.arange(distances.shape[0]):
for j in np.arange(distances.shape[1]):
if distances[i, j] <= widths[w+1] and distances[i, j] > widths[w]:
m1 = []
m2 = []
# when a distance is within a given width, calculate the means of
# the points involved
for x in np.arange(distances.shape[1]):
if distances[i,x] <= widths[w+1] and distances[i, x] > widths[w]:
m1.append(z[x])
for y in np.arange(distances.shape[1]):
if distances[j,y] <= widths[w+1] and distances[j, y] > widths[w]:
m2.append(z[y])
mean_m1 = np.array(m1).mean()
mean_m2 = np.array(m2).mean()
Z.append(z[i]*z[j] - mean_m1*mean_m2)
Z_mean = np.array(Z).mean() # calculate covariogram for width w
Cov.append(Z_mean) # collect covariances for all widths
However, now I have confirmed that there is an error in my code. I know that because I used the variogram to calculate the covariogram (covariogram(h) = covariogram(0) - variogram(h)) and I get a different plot:
And it is supposed to look like this:
Finally, if you know a Python/R/MATLAB library to calculate empirical covariograms, let me know. At least, that way I can verify what I did.

One could use scipy.cov, but if one does the calculation directly (which is very easy), there are more ways to speed this up.
First, make some fake data that has some spacial correlations. I'll do this by first making the spatial correlations, and then using random data points that are generated using this, where the data is positioned according to the underlying map, and also takes on the values of the underlying map.
Edit 1:
I changed the data point generator so positions are purely random, but z-values are proportional to the spatial map. And, I changed the map so that left and right side were shifted relative to eachother to create negative correlation at large h.
from numpy import *
import random
import matplotlib.pyplot as plt
S = 1000
N = 900
# first, make some fake data, with correlations on two spatial scales
# density map
x = linspace(0, 2*pi, S)
sx = sin(3*x)*sin(10*x)
density = .8* abs(outer(sx, sx))
density[:,:S//2] += .2
# make a point cloud motivated by this density
random.seed(10) # so this can be repeated
points = []
while len(points)<N:
v, ix, iy = random.random(), random.randint(0,S-1), random.randint(0,S-1)
if True: #v<density[ix,iy]:
points.append([ix, iy, density[ix,iy]])
locations = array(points).transpose()
print locations.shape
plt.imshow(density, alpha=.3, origin='lower')
plt.plot(locations[1,:], locations[0,:], '.k')
plt.xlim((0,S))
plt.ylim((0,S))
plt.show()
# build these into the main data: all pairs into distances and z0 z1 values
L = locations
m = array([[math.sqrt((L[0,i]-L[0,j])**2+(L[1,i]-L[1,j])**2), L[2,i], L[2,j]]
for i in range(N) for j in range(N) if i>j])
Which gives:
The above is just the simulated data, and I made no attempt to optimize it's production, etc. I assume this is where the OP starts, with the task below, since the data already exists in a real situation.
Now calculate the "covariogram" (which is much easier than generating the fake data, btw). The idea here is to sort all the pairs and associated values by h, and then index into these using ihvals. That is, summing up to index ihval is the sum over N(h) in the equation, since this includes all pairs with hs below the desired values.
Edit 2:
As suggested in the comments below, N(h) is now only the pairs that are between h-dh and h, rather than all pairs between 0 and h (where dh is the spacing of h-values in ihvals -- ie, S/1000 was used below).
# now do the real calculations for the covariogram
# sort by h and give clear names
i = argsort(m[:,0]) # h sorting
h = m[i,0]
zh = m[i,1]
zsh = m[i,2]
zz = zh*zsh
hvals = linspace(0,S,1000) # the values of h to use (S should be in the units of distance, here I just used ints)
ihvals = searchsorted(h, hvals)
result = []
for i, ihval in enumerate(ihvals[1:]):
start, stop = ihvals[i-1], ihval
N = stop-start
if N>0:
mnh = sum(zh[start:stop])/N
mph = sum(zsh[start:stop])/N
szz = sum(zz[start:stop])/N
C = szz-mnh*mph
result.append([h[ihval], C])
result = array(result)
plt.plot(result[:,0], result[:,1])
plt.grid()
plt.show()
which looks reasonable to me as one can see bumps or troughs at the expected for the h values, but I haven't done a careful check.
The main speedup here over scipy.cov, is that one can precalculate all of the products, zz. Otherwise, one would feed zh and zsh into cov for every new h, and all the products would be recalculated. This calculate could be sped up even more by doing partial sums, ie, from ihvals[n-1] to ihvals[n] at each timestep n, but I doubt that will be necessary.

Related

Rectangular lattice fit to noisey coordinates

I have the following problem. Imaging you have a set of coordinates that are somewhat organized in a regular pattern, such as the one shown below.
What i want to do is to automatically extract coordinates, such that they are ordered from left to right and top to bottom. In addition, the total number of coordinates should be as large as possible, but only include coordinates, such that the extracted coordinates are on a nearly rectangular grid (even if the coordinates have a different symmetry, e.g. hexagonal). I always want to extract coordinates that follow a rectangular unit cell structure.
For the example shown above, the largest number that contain such an orthorhombic set would be 8 x 8 coordinates (lets call this dimensions: m x n), as framed by the red rectangle.
The problem is that the given coordinates are noisy and distorted.
My approach was to generate an artificial lattice, and minimizing the difference to the given coordinates, taking into account some rotation, shift and simple distortion of the lattice. However, it turned out to be tricky to define a cost function that covers the complexity of the problem, i.e. minimizing the difference between the given coordinates and the fitted lattice, but also maximizing the grid components m x n.
If anyone has a smart idea how to tackle this problem, maybe also with machine learning algorithms, i would be very thankful.
Here is the code that i have used so far:
A function to generate the artificial lattice with m x n coordinates that are spaced by a and b in the "n" and "m" directions. The angle theta allows for a rotation of the lattice.
def lattice(m, n, a, b, theta):
coords = []
for j in range(m):
for i in range(n):
coords.append([np.sin(theta)*a*i + np.cos(theta)*b*j, np.cos(theta)*a*i - np.sin(theta)*b*j])
return np.array(coords)
I used the following function to measure the mean minimal distance between points, which is a good starting point for fitting:
def mean_min_distance(coords):
from scipy.spatial import distance
cd = distance.cdist(coords, coords)
cd_1 = np.where(cd == 0, np.nan, cd)
return np.mean(np.nanmin(cd_1, axis=1))
The following function provides all possible combinations of m x n that theoretically fit into the lengths of the coordinates, whose arrangement is assumed to be unknown. The ability to limit this to minimal and maximal values is included already:
def get_all_mxn(l, min_m=2, min_n=2, max_m=None, max_n=None):
poss = []
if max_m is None:
max_m = l + 1
if max_n is None:
max_n = l +1
for i in range(min_m, max_m):
for j in range(min_n, max_n):
if i * j <= l:
poss.append([i, j])
return np.array(poss)
The definition of the costfunction i used (for one particular set of m x n). So i first wanted to get a good fit for a certain m x n arrangement.
def cost(x0):
a, b, theta, shift_a, shift_b, dd1 = x0
# generate lattice
l = lattice(m, n, a, b, theta)
# distort lattice by affine transformation
distortion_matr = np.array([[1, dd1], [0, 1]])
l = np.dot(distortion_matr, l.T).T
# shift lattice
l = l + np.array((shift_b, shift_a))
# Some padding to make the lists the same length
len_diff = coords.shape[0] - l.shape[0]
l = np.append(l, (1e3, 1e3)*len_diff).reshape((l.shape[0] + len_diff, 2))
# calculate all distances between all points
cd = distance.cdist(coords, l)
minimum distance between each artificial lattice point and all coords
cd_min = np.min(cd[:, :coords.shape[0] - len_diff], axis=0)
# returns root mean square difference of all minimal distances
return np.sqrt(np.sum(np.abs(cd_min) ** 2) )
I then run the minimization:
md = mean_min_distance(coords)
# initial guess
x0 = np.array((md, md, np.deg2rad(-3.), 3, 1, 0.12))
res = minimize(cost, x0)
However, the results are extremely dependend on the initial parameter x0 and i have not even included a fitting of m and n.

Inverse FFT returns negative values when it should not

I have several points (x,y,z coordinates) in a 3D box with associated masses. I want to draw an histogram of the mass-density that is found in spheres of a given radius R.
I have written a code that, providing I did not make any errors which I think I may have, works in the following way:
My "real" data is something huge thus I wrote a little code to generate non overlapping points randomly with arbitrary mass in a box.
I compute a 3D histogram (weighted by mass) with a binning about 10 times smaller than the radius of my spheres.
I take the FFT of my histogram, compute the wave-modes (kx, ky and kz) and use them to multiply my histogram in Fourier space by the analytic expression of the 3D top-hat window (sphere filtering) function in Fourier space.
I inverse FFT my newly computed grid.
Thus drawing a 1D-histogram of the values on each bin would give me what I want.
My issue is the following: given what I do there should not be any negative values in my inverted FFT grid (step 4), but I get some, and with values much higher that the numerical error.
If I run my code on a small box (300x300x300 cm3 and the points of separated by at least 1 cm) I do not get the issue. I do get it for 600x600x600 cm3 though.
If I set all the masses to 0, thus working on an empty grid, I do get back my 0 without any noted issues.
I here give my code in a full block so that it is easily copied.
import numpy as np
import matplotlib.pyplot as plt
import random
from numba import njit
# 1. Generate a bunch of points with masses from 1 to 3 separated by a radius of 1 cm
radius = 1
rangeX = (0, 100)
rangeY = (0, 100)
rangeZ = (0, 100)
rangem = (1,3)
qty = 20000 # or however many points you want
# Generate a set of all points within 1 of the origin, to be used as offsets later
deltas = set()
for x in range(-radius, radius+1):
for y in range(-radius, radius+1):
for z in range(-radius, radius+1):
if x*x + y*y + z*z<= radius*radius:
deltas.add((x,y,z))
X = []
Y = []
Z = []
M = []
excluded = set()
for i in range(qty):
x = random.randrange(*rangeX)
y = random.randrange(*rangeY)
z = random.randrange(*rangeZ)
m = random.uniform(*rangem)
if (x,y,z) in excluded: continue
X.append(x)
Y.append(y)
Z.append(z)
M.append(m)
excluded.update((x+dx, y+dy, z+dz) for (dx,dy,dz) in deltas)
print("There is ",len(X)," points in the box")
# Compute the 3D histogram
a = np.vstack((X, Y, Z)).T
b = 200
H, edges = np.histogramdd(a, weights=M, bins = b)
# Compute the FFT of the grid
Fh = np.fft.fftn(H, axes=(-3,-2, -1))
# Compute the different wave-modes
kx = 2*np.pi*np.fft.fftfreq(len(edges[0][:-1]))*len(edges[0][:-1])/(np.amax(X)-np.amin(X))
ky = 2*np.pi*np.fft.fftfreq(len(edges[1][:-1]))*len(edges[1][:-1])/(np.amax(Y)-np.amin(Y))
kz = 2*np.pi*np.fft.fftfreq(len(edges[2][:-1]))*len(edges[2][:-1])/(np.amax(Z)-np.amin(Z))
# I create a matrix containing the values of the filter in each point of the grid in Fourier space
R = 5
Kh = np.empty((len(kx),len(ky),len(kz)))
#njit(parallel=True)
def func_njit(kx, ky, kz, Kh):
for i in range(len(kx)):
for j in range(len(ky)):
for k in range(len(kz)):
if np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2) != 0:
Kh[i][j][k] = (np.sin((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R)-(np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R*np.cos((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R))*3/((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R)**3
else:
Kh[i][j][k] = 1
return Kh
Kh = func_njit(kx, ky, kz, Kh)
# I multiply each point of my grid by the associated value of the filter (multiplication in Fourier space = convolution in real space)
Gh = np.multiply(Fh, Kh)
# I take the inverse FFT of my filtered grid. I take the real part to get back floats but there should only be zeros for the imaginary part.
Density = np.real(np.fft.ifftn(Gh,axes=(-3,-2, -1)))
# Here it shows if there are negative values the magnitude of the error
print(np.min(Density))
D = Density.flatten()
N = np.mean(D)
# I then compute the histogram I want
hist, bins = np.histogram(D/N, bins='auto', density=True)
bin_centers = (bins[1:]+bins[:-1])*0.5
plt.plot(bin_centers, hist)
plt.xlabel('rho/rhom')
plt.ylabel('P(rho)')
plt.show()
Do you know why I'm getting these negative values? Do you think there is a simpler way to proceed?
Sorry if this is a very long post, I tried to make it very clear and will edit it with your comments, thanks a lot!
-EDIT-
A follow-up question on the issue can be found [here].1
The filter you create in the frequency domain is only an approximation to the filter you want to create. The problem is that we are dealing with the DFT here, not the continuous-domain FT (with its infinite frequencies). The Fourier transform of a ball is indeed the function you describe, however this function is infinitely large -- it is not band-limited!
By sampling this function only within a window, you are effectively multiplying it with an ideal low-pass filter (the rectangle of the domain). This low-pass filter, in the spatial domain, has negative values. Therefore, the filter you create also has negative values in the spatial domain.
This is a slice through the origin of the inverse transform of Kh (after I applied fftshift to move the origin to the middle of the image, for better display):
As you can tell here, there is some ringing that leads to negative values.
One way to overcome this ringing is to apply a windowing function in the frequency domain. Another option is to generate a ball in the spatial domain, and compute its Fourier transform. This second option would be the simplest to achieve. Do remember that the kernel in the spatial domain must also have the origin at the top-left pixel to obtain a correct FFT.
A windowing function is typically applied in the spatial domain to avoid issues with the image border when computing the FFT. Here, I propose to apply such a window in the frequency domain to avoid similar issues when computing the IFFT. Note, however, that this will always further reduce the bandwidth of the kernel (the windowing function would work as a low-pass filter after all), and therefore yield a smoother transition of foreground to background in the spatial domain (i.e. the spatial domain kernel will not have as sharp a transition as you might like). The best known windowing functions are Hamming and Hann windows, but there are many others worth trying out.
Unsolicited advice:
I simplified your code to compute Kh to the following:
kr = np.sqrt(kx[:,None,None]**2 + ky[None,:,None]**2 + kz[None,None,:]**2)
kr *= R
Kh = (np.sin(kr)-kr*np.cos(kr))*3/(kr)**3
Kh[0,0,0] = 1
I find this easier to read than the nested loops. It should also be significantly faster, and avoid the need for njit. Note that you were computing the same distance (what I call kr here) 5 times. Factoring out such computation is not only faster, but yields more readable code.
Just a guess:
Where do you get the idea that the imaginary part MUST be zero? Have you ever tried to take the absolute values (sqrt(re^2 + im^2)) and forget about the phase instead of just taking the real part? Just something that came to my mind.

Select a point randomly, but without the bias of density

I have this distribution of points (allPoints, which is a list of lists: [[x1,y1][x2,y2][x3,y3][x4,y4]...[xn,yn]]):
From which I'd like to select points, randomly.
in Python I would do something like:
from random import *
point = choice(allPoints)
Except, I need the random pick to not be biased by the existing density. For instance, here, "choice" would tend to pick a point in the upmost-leftmost part of the plot.
How can I, in Python, get rid of this bias?
I've tried to divide the space in portions of size "div", and then, sample within this portion, but in many cases, no points exist at all and the while loop doesn't find any solution:
def column(matrix, i):
return [row[i] for row in matrix]
div = 10
min_x,max_x = min(column(allPoints,0)),max(column(allPoints,0))
min_y, max_y = min(column(allPoints,1)),max(column(allPoints,1))
zone_x_min = randint(1,div-1) * (max_x - min_x) / div + min_x
zone_x_max = zone_x_min + (max_x - min_x) / div
zone_y_min = randint(1,div-1) * (max_y - min_y) / div + min_y
zone_y_max = zone_yl_min + (max_y - min_y) / div
p = choice(allPoints)
cont = True
while cont == True:
if (p[0] > zone_x_min and p[0] < zone_x_max) and (e[1] > zone_y_min and e[1] < zone_y_max):
cont = False
else:
p = choice(allPoints)
what would be a correct, inexpensive (if possible) solution to this problem?
If it wasn't ridiculous, I think something like would work for me, in theory:
p = [uniform(min_x,max_x),uniform(min_y,max_y)]
while p not in allPoints:
p = [uniform(min_x,max_x),uniform(min_y,max_y)]
The question is a little ill-formed, but here's a stab.
The idea is to use a gaussian kernel density estimate, then sample from your data with weights equal to the inverse of the pdf at each point.
This is not statistically justifiable in any real sense.
import numpy as np
from scipy import stats
#random data
x = np.random.normal(size = 200)
y = np.random.normal(size = 200)
#estimate the density
kernel = stats.gaussian_kde(np.vstack([x,y]))
#calculate the inverse of pdf for each point, and normalise to sum to 1
pvector = 1/kernel.pdf(np.vstack([x,y]))/sum(1/kernel.pdf(np.vstack([x,y])))
#get a vector of indices based on your weights
np.random.choice(range(len(x)), size = 10, replace = True, p = pvector)
I believe you want to randomly select a datum point from your graph.That is, one of the little black dots.
Compute a centroid, or pick a point like (1.0, 70).
Compute the distance from each point to the centroid and let that be the probability of your choice of that point.
That is if distance(P,C) is 100 and distance(Q,C) is 1 then let P be 100x more likely to be chosen. All points are eligible to win, but the crowded ones are individually less likely (but make it up with.volume).
If I understand your initial attempt correctly, I believe there is a simple adjustment you can make to make this work.
Randomly generate an x value (0,4.5), and a y value (0,70).
Then loop through allPoints to find the closest dot.
This has the downside of large empty areas all converging to a single point. A way to help (not remove) this problem would be to make your random point have a range. If no dot exists in that range, randomly generate a new dot.
Assuming you want your selected points to be visually spread I can think of at least one "efficient/easy" method.
Choose a random point (with random.choice for example) ;
remove from your initial set any point that is "close"*;
repeat until there is no point left in your set.
*This requires that you know from the beginning how dense you want your sample to be.

FInd all the points that lie with in a spherical region

For example, find the image below, which explains the problem for a simple 2D case. The label (N) and coordinates (x,y) for each point is known. I need to find all the point labels that lie within the red circle
My actual problem is in 3D and the points are not uniformly distributed
Sample input file which contain coordinates of 7.25 M points is attached here point file.
I tried the following piece of code
import numpy as np
C = [50,50,50]
R = 20
centroid = np.loadtxt('centroid') #chk the file attached
def dist(x,y): return sum([(xi-yi)**2 for xi, yi in zip(x,y)])
elabels=[i+1 for i in range(len(centroid)) if dist(C,centroid[i])<=R**2]
For an single search it takes ~ 10 min. Any suggestions to make it faster ?
Thanks,
Prithivi
When using numpy, avoid using list comprehensions on arrays.
Your computation can be done using vectorized expressions like this
centre = np.array((50., 50., 50.))
points = np.loadtxt('data')
distances2= np.sum((points-centre)**2, axis=1)
points is a N x 2 array, points-centre is also a N x 2 array,
(points-centre)**2 computes the squares of each element of the difference and eventually np.sum(..., axis=1) sums the elements of the squared differences along axis no. 1, that is, across columns.
To filter the array of positions, you can use boolean indexing
close = points[distances2<max_dist**2]
You are heavily calling the dist function. You could try to low level optimize it, and control with the timeit Python module which is more efficient. On my machine, I tried this one:
def dist(x,y):
d0 = y[0] -x[0]
d1 = y[1] -x[1]
d2 = y[2] -x[2]
return d0 * d0 + d1*d1 + d2*d2
and timeit said it was more than 3 times quicker.
This one was just in the middle:
def dist(x,y):
s = 0
for i in range(len(x)):
d = y[i] - x[i]
s += d * d
return s

Sorting x,y,z coordinates into arrays within a defined volume Python/numpy

Hi I am fairly new and I hope you can answer my question or help me find a better method!
Say I have a set of x,y,z coordinates that I want to subdivide into arrays containing the points within a certain volume (dV) of the total volume of the x,y,z space.
I have been trying to sort each x,y,z coordinate by the x value first, then subdividing by some dx into a new dimension of the array, then within each of these subdivided dimensions, sorting the y values and redividing by dy, and then the same along the z axis, giving the sorted and subdivided coordinates
I have attempted to create an array to append the coordinate sets to...
def splitter(array1):
xSortx = np.zeros([10,1,3])
for j in range(0,10):
for i in range(len(array1)) :
if (j * dx) <= array1[i][0] < (j + 1)*dx:
np.append(xSortx[j],array1[i])
everything seemed to be working but the append part, i have heard append in python can be troubling so another method I tried was to create the multidimensional matrix first in order to fill it, but I ran into the problem that I do not know how to create a multidimensional matrix that could have for example 1 entry in the second dimension but 5 in the next index of the second ex: [[[0,0,0]],[[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0]]].
I would really appreciate any tips or advice, let me know if this is not very clear and I will try to explain it more!
I believe this is what you want:
# define your working volume
Vmin = np.array([1,2,3])
Vmax = np.array([4,5,6])
DV = Vmax-Vmin
# define your subdividing unit
d = 0.5
N = np.ceil(DV / d).astype(int) # number of bins in each dimension
def splitter(array):
result = [[[[] for i in xrange(N[0])] for j in xrange(N[1])] for k in xrange(N[2])]
for p in array:
i,j,k = ((p - Vmin ) / d).astype(int) # find the bin coordinates
result[i][j][k].append(p)
return result
# test the function
test = Vmin + np.random.rand(20,3) * DV # create 20 random points in the working volume
result = splitter(test)
for i in xrange(N[0]):
for j in xrange(N[1]):
for k in xrange(N[2]):
print "points in bin:", Vmin + np.array([i,j,k]) * d
for p in result[i][j][k]:
print p

Categories

Resources