Related
Given a 10x10 grid (2d-array) filled randomly with numbers, either 0, 1 or 2. How can I find the Euclidean distance (the l2-norm of the distance vector) between two given points considering periodic boundaries?
Let us consider an arbitrary grid point called centre. Now, I want to find the nearest grid point containing the same value as centre. I need to take periodic boundaries into account, such that the matrix/grid can be seen rather as a torus instead of a flat plane. In that case, say the centre = matrix[0,2], and we find that there is the same number in matrix[9,2], which would be at the southern boundary of the matrix. The Euclidean distance computed with my code would be for this example np.sqrt(0**2 + 9**2) = 9.0. However, because of periodic boundaries, the distance should actually be 1, because matrix[9,2] is the northern neighbour of matrix[0,2]. Hence, if periodic boundary values are implemented correctly, distances of magnitude above 8 should not exist.
So, I would be interested on how to implement in Python a function to compute the Euclidean distance between two arbitrary points on a torus by applying a wrap-around for the boundaries.
import numpy as np
matrix = np.random.randint(0,3,(10,10))
centre = matrix[0,2]
#rewrite the centre to be the number 5 (to exclude itself as shortest distance)
matrix[0,2] = 5
#find the points where entries are same as centre
same = np.where((matrix == centre) == True)
idx_row, idx_col = same
#find distances from centre to all values which are of same value
dist = np.zeros(len(same[0]))
for i in range(0,len(same[0])):
delta_row = same[0][i] - 0 #row coord of centre
delta_col = same[1][i] - 2 #col coord of centre
dist[i] = np.sqrt(delta_row**2 + delta_col**2)
#retrieve the index of the smallest distance
idx = dist.argmin()
print('Centre value: %i. The nearest cell with same value is at (%i,%i)'
% (centre, same[0][idx],same[1][idx]))
For each axis, you can check whether the distance is shorter when you wrap around or when you don't. Consider the row axis, with rows i and j.
When not wrapping around, the difference is abs(i - j).
When wrapping around, the difference is "flipped", as in 10 - abs(i - j). In your example with i == 0 and j == 9 you can check that this correctly produces a distance of 1.
Then simply take whichever is smaller:
delta_row = same[0][i] - 0 #row coord of centre
delta_row = min(delta_row, 10 - delta_row)
And similarly for delta_column.
The final dist[i] calculation needs no changes.
I have a working 'sketch' of how this could work. In short, I calculate the distance 9 times, 1 for the normal distance, and 8 shifts to possibly correct for a closer 'torus' distance.
As n is getting larger, the calculation costs can go sky high as the numbers go up. But, the torus effect, is probably not needed as there is always a point nearby without 'wrap around'.
You can easily test this, because for a grid of size 1, if a point is found of distance 1/2 or closer, you know there is not a closer torus point (right?)
import numpy as np
n=10000
np.random.seed(1)
A = np.random.randint(low=0, high=10, size=(n,n))
I create 10000x10000 points, and store the location of the 1's in ONES.
ONES = np.argwhere(A == 0)
Now I define my torus distance, which is trying which of the 9 mirrors is the closest.
def distance_on_torus( point=[500,500] ):
index_diff = [[1],[1],[0],[0],[0,1],[0,1],[0,1],[0,1]]
coord_diff = [[-1],[1],[-1],[1],[-1,-1],[-1,1],[1,-1],[1,1]]
tree = BallTree( ONES, leaf_size=5*n, metric='euclidean')
dist, indi = tree.query([point],k=1, return_distance=True )
distances = [dist[0]]
for indici_to_shift, coord_direction in zip(index_diff, coord_diff):
MIRROR = ONES.copy()
for i,shift in zip(indici_to_shift,coord_direction):
MIRROR[:,i] = MIRROR[:,i] + (shift * n)
tree = BallTree( MIRROR, leaf_size=5*n, metric='euclidean')
dist, indi = tree.query([point],k=1, return_distance=True )
distances.append(dist[0])
return np.min(distances)
%%time
distance_on_torus([2,3])
It is slow, the above takes 15 minutes.... For n = 1000 less than a second.
A optimisation would be to first consider the none-torus distance, and if the minimum distance is possibly not the smallest, calculate with only the minimum set of extra 'blocks' around. This will greatly increase speed.
I want to generate 750 random number in 2D data x (0,1) and y (0,1)
X1 = np.random.random((750,2))
However, I want to make sure I don't have any value in the circular region such as
I can remove the value, but I want to fix the number of the random number to be 750. What would be the best way to generate such list?
import random
points = []
radius = 1
num_points=1500
index = 0
max_coord = 750
while index < num_points:
x=random.random()*max_coord
y=random.random()*max_coord
if x**2 + y**2 > radius**2:
points = points + [[x,y]]
index = index + 1
print(points)
Change max_coor to obtain bigger numbers. This only gives you 1500 points outside the circle of the given radius=1. Each coordinate varies between 0 and 750
you can just apply pythagorean's theorem. Here is my code to do that
import numpy as np
import matplotlib.pyplot as plt
dist = 0.2
X1 = np.random.random((750,2))
X2 = [x for x in X1 if (x[0]-0.5)**2 + (x[1]-0.5)**2 > dist**2] # filter values that are to close to center
# for plotting
x = [x[0] for x in X2]
y = [y[1] for y in X2]
plt.scatter(x,y)
plt.show()
You can do this with a well-known technique called rejection sampling. Generate candidate values and see if they meet your constraints. If so, accept them, otherwise reject and repeat. With 2-d sampling, the probability of acceptance on a given trial is P{Accept} = Area(acceptance region) / Area(generating region). Repetitions due to rejection are independent, so the number of trials has a geometric distribution with parameter p = P{Accept}
and Expected # trials = 1 / P{Accept}. For example, if the rejection region is a circle with radius 1/2 centered in a unit square, P{Accept} = (4 - Pi)/4 andon average it will take 4/(4 - Pi) (approximately 4.66) attempts per value generated. Obviously this won't work if the radius is >= sqrt(2)/2, and gets expensive as you approach that limit.
Note that the rejection circle does not need to be centered in the square, you just reject any candidate that is less than radius away from the circle's center.
Here's the code:
import numpy as np
def generate_pt(exclusion_radius):
while True:
candidate = np.random.sample(2)
# The following assumes the rejection circle is centered at (0.5, 0.5).
# Adjust by subtracting individual x/y coordinates for the center of
# the circle if you don't want it centered. Acceptance/rejection
# distance is determined by Pythagorean theorem.
if np.sum(np.square(candidate - 0.5)) >= exclusion_radius * exclusion_radius:
return candidate
# Generate 750 points outside the centered circle of radius 0.3.
data = np.array([generate_pt(0.3) for _ in range(750)])
I have several points (x,y,z coordinates) in a 3D box with associated masses. I want to draw an histogram of the mass-density that is found in spheres of a given radius R.
I have written a code that, providing I did not make any errors which I think I may have, works in the following way:
My "real" data is something huge thus I wrote a little code to generate non overlapping points randomly with arbitrary mass in a box.
I compute a 3D histogram (weighted by mass) with a binning about 10 times smaller than the radius of my spheres.
I take the FFT of my histogram, compute the wave-modes (kx, ky and kz) and use them to multiply my histogram in Fourier space by the analytic expression of the 3D top-hat window (sphere filtering) function in Fourier space.
I inverse FFT my newly computed grid.
Thus drawing a 1D-histogram of the values on each bin would give me what I want.
My issue is the following: given what I do there should not be any negative values in my inverted FFT grid (step 4), but I get some, and with values much higher that the numerical error.
If I run my code on a small box (300x300x300 cm3 and the points of separated by at least 1 cm) I do not get the issue. I do get it for 600x600x600 cm3 though.
If I set all the masses to 0, thus working on an empty grid, I do get back my 0 without any noted issues.
I here give my code in a full block so that it is easily copied.
import numpy as np
import matplotlib.pyplot as plt
import random
from numba import njit
# 1. Generate a bunch of points with masses from 1 to 3 separated by a radius of 1 cm
radius = 1
rangeX = (0, 100)
rangeY = (0, 100)
rangeZ = (0, 100)
rangem = (1,3)
qty = 20000 # or however many points you want
# Generate a set of all points within 1 of the origin, to be used as offsets later
deltas = set()
for x in range(-radius, radius+1):
for y in range(-radius, radius+1):
for z in range(-radius, radius+1):
if x*x + y*y + z*z<= radius*radius:
deltas.add((x,y,z))
X = []
Y = []
Z = []
M = []
excluded = set()
for i in range(qty):
x = random.randrange(*rangeX)
y = random.randrange(*rangeY)
z = random.randrange(*rangeZ)
m = random.uniform(*rangem)
if (x,y,z) in excluded: continue
X.append(x)
Y.append(y)
Z.append(z)
M.append(m)
excluded.update((x+dx, y+dy, z+dz) for (dx,dy,dz) in deltas)
print("There is ",len(X)," points in the box")
# Compute the 3D histogram
a = np.vstack((X, Y, Z)).T
b = 200
H, edges = np.histogramdd(a, weights=M, bins = b)
# Compute the FFT of the grid
Fh = np.fft.fftn(H, axes=(-3,-2, -1))
# Compute the different wave-modes
kx = 2*np.pi*np.fft.fftfreq(len(edges[0][:-1]))*len(edges[0][:-1])/(np.amax(X)-np.amin(X))
ky = 2*np.pi*np.fft.fftfreq(len(edges[1][:-1]))*len(edges[1][:-1])/(np.amax(Y)-np.amin(Y))
kz = 2*np.pi*np.fft.fftfreq(len(edges[2][:-1]))*len(edges[2][:-1])/(np.amax(Z)-np.amin(Z))
# I create a matrix containing the values of the filter in each point of the grid in Fourier space
R = 5
Kh = np.empty((len(kx),len(ky),len(kz)))
#njit(parallel=True)
def func_njit(kx, ky, kz, Kh):
for i in range(len(kx)):
for j in range(len(ky)):
for k in range(len(kz)):
if np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2) != 0:
Kh[i][j][k] = (np.sin((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R)-(np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R*np.cos((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R))*3/((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R)**3
else:
Kh[i][j][k] = 1
return Kh
Kh = func_njit(kx, ky, kz, Kh)
# I multiply each point of my grid by the associated value of the filter (multiplication in Fourier space = convolution in real space)
Gh = np.multiply(Fh, Kh)
# I take the inverse FFT of my filtered grid. I take the real part to get back floats but there should only be zeros for the imaginary part.
Density = np.real(np.fft.ifftn(Gh,axes=(-3,-2, -1)))
# Here it shows if there are negative values the magnitude of the error
print(np.min(Density))
D = Density.flatten()
N = np.mean(D)
# I then compute the histogram I want
hist, bins = np.histogram(D/N, bins='auto', density=True)
bin_centers = (bins[1:]+bins[:-1])*0.5
plt.plot(bin_centers, hist)
plt.xlabel('rho/rhom')
plt.ylabel('P(rho)')
plt.show()
Do you know why I'm getting these negative values? Do you think there is a simpler way to proceed?
Sorry if this is a very long post, I tried to make it very clear and will edit it with your comments, thanks a lot!
-EDIT-
A follow-up question on the issue can be found [here].1
The filter you create in the frequency domain is only an approximation to the filter you want to create. The problem is that we are dealing with the DFT here, not the continuous-domain FT (with its infinite frequencies). The Fourier transform of a ball is indeed the function you describe, however this function is infinitely large -- it is not band-limited!
By sampling this function only within a window, you are effectively multiplying it with an ideal low-pass filter (the rectangle of the domain). This low-pass filter, in the spatial domain, has negative values. Therefore, the filter you create also has negative values in the spatial domain.
This is a slice through the origin of the inverse transform of Kh (after I applied fftshift to move the origin to the middle of the image, for better display):
As you can tell here, there is some ringing that leads to negative values.
One way to overcome this ringing is to apply a windowing function in the frequency domain. Another option is to generate a ball in the spatial domain, and compute its Fourier transform. This second option would be the simplest to achieve. Do remember that the kernel in the spatial domain must also have the origin at the top-left pixel to obtain a correct FFT.
A windowing function is typically applied in the spatial domain to avoid issues with the image border when computing the FFT. Here, I propose to apply such a window in the frequency domain to avoid similar issues when computing the IFFT. Note, however, that this will always further reduce the bandwidth of the kernel (the windowing function would work as a low-pass filter after all), and therefore yield a smoother transition of foreground to background in the spatial domain (i.e. the spatial domain kernel will not have as sharp a transition as you might like). The best known windowing functions are Hamming and Hann windows, but there are many others worth trying out.
Unsolicited advice:
I simplified your code to compute Kh to the following:
kr = np.sqrt(kx[:,None,None]**2 + ky[None,:,None]**2 + kz[None,None,:]**2)
kr *= R
Kh = (np.sin(kr)-kr*np.cos(kr))*3/(kr)**3
Kh[0,0,0] = 1
I find this easier to read than the nested loops. It should also be significantly faster, and avoid the need for njit. Note that you were computing the same distance (what I call kr here) 5 times. Factoring out such computation is not only faster, but yields more readable code.
Just a guess:
Where do you get the idea that the imaginary part MUST be zero? Have you ever tried to take the absolute values (sqrt(re^2 + im^2)) and forget about the phase instead of just taking the real part? Just something that came to my mind.
I have a 3D array where I added the potential values for each coordinate here:
z = y = x = size # size of each side of the cube
potential = np.zeros((z, y, x))
exemple: I set a potential to a point -> potential[3,3,3] = 10 now this coord has a value of 10. And the surrounding points will have a potential based on the mean of the surrounding points.
such as:
for z in range(1,size):
for y in range(1,size):
for x in range(1,size):
if [z,y,x] in polop:
potential[z,y,x] = Positive # If positive polo, keeps potential
elif [z,y,x] in polon:
potential[z,y,x] = Negative # If negative polo, keeps potential
elif z!=size-1 and y!=size-1 and x!=size-1: # Sets the potential to the mean potential of neighbors
potential[z][y][x] = (potential[z][y][x+1] + potential[z][y][x-1] + potential[z][y+1][x] +
potential[z][y-1][x] + potential[z+1][y][x] + potential[z-1][y][x]) / 6
Then I plotted this array and it looks like this:
Now what I want is to have surfaces instead of scattering points. A surface consisting of the points that have more less the same potential (let's say from 2 in 2, as 10->8, 8->6, 6->4, 4->2, 2->0, 0->-2, etc.)
Is it possible?
Plus, this was supposed to be an ellipsoid instead of a cube, the poles were supposed to be spheres instead of cubes and this needs to be updating like each second.
Please help!
Does anyone know a good method to calculate the empirical/sample covariogram, if possible in Python?
This is a screenshot of a book which contains a good definition of covariagram:
If I understood it correctly, for a given lag/width h, I'm supposed to get all the pair of points that are separated by h (or less than h), multiply its values and for each of these points, calculate its mean, which in this case, are defined as m(x_i). However, according to the definition of m(x_{i}), if I want to compute m(x1), I need to obtain the average of the values located within distance h from x1. This looks like a very intensive computation.
First of all, am I understanding this correctly? If so, what is a good way to compute this assuming a two dimensional space? I tried to code this in Python (using numpy and pandas), but it takes a couple of seconds and I'm not even sure it is correct, that is why I will refrain from posting the code here. Here is another attempt of a very naive implementation:
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(np.array(coordinates))) # coordinates is a nx2 array
z = np.array(z) # z are the values
cutoff = np.max(distances)/3.0 # somewhat arbitrary cutoff
width = cutoff/15.0
widths = np.arange(0, cutoff + width, width)
Z = []
Cov = []
for w in np.arange(len(widths)-1): # for each width
# for each pairwise distance
for i in np.arange(distances.shape[0]):
for j in np.arange(distances.shape[1]):
if distances[i, j] <= widths[w+1] and distances[i, j] > widths[w]:
m1 = []
m2 = []
# when a distance is within a given width, calculate the means of
# the points involved
for x in np.arange(distances.shape[1]):
if distances[i,x] <= widths[w+1] and distances[i, x] > widths[w]:
m1.append(z[x])
for y in np.arange(distances.shape[1]):
if distances[j,y] <= widths[w+1] and distances[j, y] > widths[w]:
m2.append(z[y])
mean_m1 = np.array(m1).mean()
mean_m2 = np.array(m2).mean()
Z.append(z[i]*z[j] - mean_m1*mean_m2)
Z_mean = np.array(Z).mean() # calculate covariogram for width w
Cov.append(Z_mean) # collect covariances for all widths
However, now I have confirmed that there is an error in my code. I know that because I used the variogram to calculate the covariogram (covariogram(h) = covariogram(0) - variogram(h)) and I get a different plot:
And it is supposed to look like this:
Finally, if you know a Python/R/MATLAB library to calculate empirical covariograms, let me know. At least, that way I can verify what I did.
One could use scipy.cov, but if one does the calculation directly (which is very easy), there are more ways to speed this up.
First, make some fake data that has some spacial correlations. I'll do this by first making the spatial correlations, and then using random data points that are generated using this, where the data is positioned according to the underlying map, and also takes on the values of the underlying map.
Edit 1:
I changed the data point generator so positions are purely random, but z-values are proportional to the spatial map. And, I changed the map so that left and right side were shifted relative to eachother to create negative correlation at large h.
from numpy import *
import random
import matplotlib.pyplot as plt
S = 1000
N = 900
# first, make some fake data, with correlations on two spatial scales
# density map
x = linspace(0, 2*pi, S)
sx = sin(3*x)*sin(10*x)
density = .8* abs(outer(sx, sx))
density[:,:S//2] += .2
# make a point cloud motivated by this density
random.seed(10) # so this can be repeated
points = []
while len(points)<N:
v, ix, iy = random.random(), random.randint(0,S-1), random.randint(0,S-1)
if True: #v<density[ix,iy]:
points.append([ix, iy, density[ix,iy]])
locations = array(points).transpose()
print locations.shape
plt.imshow(density, alpha=.3, origin='lower')
plt.plot(locations[1,:], locations[0,:], '.k')
plt.xlim((0,S))
plt.ylim((0,S))
plt.show()
# build these into the main data: all pairs into distances and z0 z1 values
L = locations
m = array([[math.sqrt((L[0,i]-L[0,j])**2+(L[1,i]-L[1,j])**2), L[2,i], L[2,j]]
for i in range(N) for j in range(N) if i>j])
Which gives:
The above is just the simulated data, and I made no attempt to optimize it's production, etc. I assume this is where the OP starts, with the task below, since the data already exists in a real situation.
Now calculate the "covariogram" (which is much easier than generating the fake data, btw). The idea here is to sort all the pairs and associated values by h, and then index into these using ihvals. That is, summing up to index ihval is the sum over N(h) in the equation, since this includes all pairs with hs below the desired values.
Edit 2:
As suggested in the comments below, N(h) is now only the pairs that are between h-dh and h, rather than all pairs between 0 and h (where dh is the spacing of h-values in ihvals -- ie, S/1000 was used below).
# now do the real calculations for the covariogram
# sort by h and give clear names
i = argsort(m[:,0]) # h sorting
h = m[i,0]
zh = m[i,1]
zsh = m[i,2]
zz = zh*zsh
hvals = linspace(0,S,1000) # the values of h to use (S should be in the units of distance, here I just used ints)
ihvals = searchsorted(h, hvals)
result = []
for i, ihval in enumerate(ihvals[1:]):
start, stop = ihvals[i-1], ihval
N = stop-start
if N>0:
mnh = sum(zh[start:stop])/N
mph = sum(zsh[start:stop])/N
szz = sum(zz[start:stop])/N
C = szz-mnh*mph
result.append([h[ihval], C])
result = array(result)
plt.plot(result[:,0], result[:,1])
plt.grid()
plt.show()
which looks reasonable to me as one can see bumps or troughs at the expected for the h values, but I haven't done a careful check.
The main speedup here over scipy.cov, is that one can precalculate all of the products, zz. Otherwise, one would feed zh and zsh into cov for every new h, and all the products would be recalculated. This calculate could be sped up even more by doing partial sums, ie, from ihvals[n-1] to ihvals[n] at each timestep n, but I doubt that will be necessary.