Getting the width and area of peaks from Scipy Signal object - python

How do I get the peak objects with the properties such as position, peak aarea, peak width etc from Scipy Signal function using cwt get peaks method:
def CWT(trace):
x = []
y = []
for i in range(len(trace)):
x.append(trace[i].Position)
y.append(trace[i].Intensity)
x = np.asarray(x)
y = np.asarray(y)
return signal.find_peaks_cwt(x,y)
This just returns an array?

First, it looks like you are using find_peaks_cwt incorrectly. Its two positional parameters are not x and y coordinates of data points. The first parameter is y-values. The x-values are not taken at all, they are assumed to be 0,1,2,.... The second parameter is a list of peak widths that you are interested in;
1-D array of widths to use for calculating the CWT matrix. In general, this range should cover the expected width of peaks of interest.
There is no reason for width parameter to be of the same size as the data array. In my example below, the data has 500 values, but the widths I use are 30...99.
Second, this method only finds the position of peaks (the array you get has the indexes of peaks). There is no analysis of their widths and areas. You will either have to look elsewhere (blog post Peak Detection in the Python World lists some alternatives, though none of them return the data you want), or come up with your own method of estimating those things.
My attempt is below. It does the following:
Cuts the signal by midpoints between peaks
For each piece, uses the median of values in it as the baseline
Declares the peak to consist of all values that are greater than 0.5*(peak value + baseline), i.e., midway between median and maximum.
Finds where the peak begins and where it ends. (The width is just the difference of these)
Declares the area of the peak to be the sum of (y - baseline) over the interval found in step 4.
Complete example:
t = np.linspace(0, 4.2, 500)
y = np.sin(t**2) + np.random.normal(0, 0.03, size=t.shape) # simulated noisy signal
peaks = find_peaks_cwt(y, np.arange(30, 100, 10))
cuts = (peaks[1:] + peaks[:-1])//2 # where to cut the signal
cuts = np.insert(cuts, [0, cuts.size], [0, t.size])
peak_begins = np.zeros_like(peaks)
peak_ends = np.zeros_like(peaks)
areas = np.zeros(peaks.shape)
for i in range(peaks.size):
peak_value = y[peaks[i]]
y_cut = y[cuts[i]:cuts[i+1]] # piece of signal with 1 peak
baseline = np.median(y_cut)
large = np.where(y_cut > 0.5*(peak_value + baseline))[0]
peak_begins[i] = large.min() + cuts[i]
peak_ends[i] = large.max() + cuts[i]
areas[i] = np.sum(y[peak_begins[i]:peak_ends[i]] - baseline)
The arrays areas, peak_begins and peak_ends are of interest here. The widths are [84 47 36], indicating the peaks get thinner (recall these are in index units, the width is the number of data points in the peak). I use this data to color the peaks in red:
widths = peak_ends - peak_begins
print(widths, areas)
plt.plot(t, y)
for i in range(peaks.size):
plt.plot(t[peak_begins[i]:peak_ends[i]], y[peak_begins[i]:peak_ends[i]], 'r')
plt.show()

Related

How to find period of signal (autocorrelation vs fast fourier transform vs power spectral density)?

Suppose one wanted to find the period of a given sinusoidal wave signal. From what I have read online, it appears that the two main approaches employ either fourier analysis or autocorrelation. I am trying to automate the process using python and my usage case is to apply this concept to similar signals that come from the time-series of positions (or speeds or accelerations) of simulated bodies orbiting a star.
For simple-examples-sake, consider x = sin(t) for 0 ≤ t ≤ 10 pi.
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
## sample data
t = np.linspace(0, 10 * np.pi, 100)
x = np.sin(t)
fig, ax = plt.subplots()
ax.plot(t, x, color='b', marker='o')
ax.grid(color='k', alpha=0.3, linestyle=':')
plt.show()
plt.close(fig)
Given a sine-wave of the form x = a sin(b(t+c)) + d, the period of the sine-wave is obtained as 2 * pi / b. Since b=1 (or by visual inspection), the period of our sine wave is 2 * pi. I can check the results obtained from other methods against this baseline.
Attempt 1: Autocorrelation
As I understand it (please correct me if I'm wrong), correlation can be used to see if one signal is a time-lagged copy of another signal (similar to how cosine and sine differ by a phase difference). So autocorrelation is testing a signal against itself to measure the times at which the time-lag repeats said signal. Using the example posted here:
result = np.correlate(x, x, mode='full')
Since x and t each consist of 100 elements and result consists of 199 elements, I am not sure why I should arbitrarily select the last 100 elements.
print("\n autocorrelation (shape={}):\n{}\n".format(result.shape, result))
autocorrelation (shape=(199,)):
[ 0.00000000e+00 -3.82130761e-16 -9.73648712e-02 -3.70014208e-01
-8.59889695e-01 -1.56185995e+00 -2.41986054e+00 -3.33109112e+00
-4.15799070e+00 -4.74662427e+00 -4.94918053e+00 -4.64762251e+00
-3.77524157e+00 -2.33298717e+00 -3.97976240e-01 1.87752669e+00
4.27722402e+00 6.54129270e+00 8.39434617e+00 9.57785701e+00
9.88331103e+00 9.18204933e+00 7.44791758e+00 4.76948221e+00
1.34963425e+00 -2.50822289e+00 -6.42666652e+00 -9.99116299e+00
-1.27937834e+01 -1.44791297e+01 -1.47873668e+01 -1.35893098e+01
-1.09091510e+01 -6.93157447e+00 -1.99159756e+00 3.45267493e+00
8.86228186e+00 1.36707567e+01 1.73433176e+01 1.94357232e+01
1.96463736e+01 1.78556800e+01 1.41478477e+01 8.81191526e+00
2.32100171e+00 -4.70897483e+00 -1.15775811e+01 -1.75696560e+01
-2.20296487e+01 -2.44327920e+01 -2.44454330e+01 -2.19677060e+01
-1.71533510e+01 -1.04037163e+01 -2.33560966e+00 6.27458308e+00
1.45655029e+01 2.16769872e+01 2.68391837e+01 2.94553896e+01
2.91697473e+01 2.59122266e+01 1.99154591e+01 1.17007613e+01
2.03381596e+00 -8.14633251e+00 -1.78184255e+01 -2.59814393e+01
-3.17580589e+01 -3.44884934e+01 -3.38046447e+01 -2.96763956e+01
-2.24244433e+01 -1.26974172e+01 -1.41464998e+00 1.03204331e+01
2.13281784e+01 3.04712823e+01 3.67721634e+01 3.95170295e+01
3.83356037e+01 3.32477037e+01 2.46710643e+01 1.33886439e+01
4.77778141e-01 -1.27924775e+01 -2.50860560e+01 -3.51343866e+01
-4.18671622e+01 -4.45258983e+01 -4.27482779e+01 -3.66140001e+01
-2.66465884e+01 -1.37700036e+01 7.76494745e-01 1.55574483e+01
2.90828312e+01 3.99582426e+01 4.70285203e+01 4.95000000e+01
4.70285203e+01 3.99582426e+01 2.90828312e+01 1.55574483e+01
7.76494745e-01 -1.37700036e+01 -2.66465884e+01 -3.66140001e+01
-4.27482779e+01 -4.45258983e+01 -4.18671622e+01 -3.51343866e+01
-2.50860560e+01 -1.27924775e+01 4.77778141e-01 1.33886439e+01
2.46710643e+01 3.32477037e+01 3.83356037e+01 3.95170295e+01
3.67721634e+01 3.04712823e+01 2.13281784e+01 1.03204331e+01
-1.41464998e+00 -1.26974172e+01 -2.24244433e+01 -2.96763956e+01
-3.38046447e+01 -3.44884934e+01 -3.17580589e+01 -2.59814393e+01
-1.78184255e+01 -8.14633251e+00 2.03381596e+00 1.17007613e+01
1.99154591e+01 2.59122266e+01 2.91697473e+01 2.94553896e+01
2.68391837e+01 2.16769872e+01 1.45655029e+01 6.27458308e+00
-2.33560966e+00 -1.04037163e+01 -1.71533510e+01 -2.19677060e+01
-2.44454330e+01 -2.44327920e+01 -2.20296487e+01 -1.75696560e+01
-1.15775811e+01 -4.70897483e+00 2.32100171e+00 8.81191526e+00
1.41478477e+01 1.78556800e+01 1.96463736e+01 1.94357232e+01
1.73433176e+01 1.36707567e+01 8.86228186e+00 3.45267493e+00
-1.99159756e+00 -6.93157447e+00 -1.09091510e+01 -1.35893098e+01
-1.47873668e+01 -1.44791297e+01 -1.27937834e+01 -9.99116299e+00
-6.42666652e+00 -2.50822289e+00 1.34963425e+00 4.76948221e+00
7.44791758e+00 9.18204933e+00 9.88331103e+00 9.57785701e+00
8.39434617e+00 6.54129270e+00 4.27722402e+00 1.87752669e+00
-3.97976240e-01 -2.33298717e+00 -3.77524157e+00 -4.64762251e+00
-4.94918053e+00 -4.74662427e+00 -4.15799070e+00 -3.33109112e+00
-2.41986054e+00 -1.56185995e+00 -8.59889695e-01 -3.70014208e-01
-9.73648712e-02 -3.82130761e-16 0.00000000e+00]
Attempt 2: Fourier
Since I am not sure where to go from the last attempt, I sought a new attempt. To my understanding, Fourier analysis basically shifts a signal from/to the time-domain (x(t) vs t) to/from the frequency domain (x(t) vs f=1/t); the signal in frequency-space should appear as a sinusoidal wave that dampens over time. The period is obtained from the most observed frequency since this is the location of the peak of the distribution of frequencies.
Since my values are all real-valued, applying the Fourier transform should mean my output values are all complex-valued. I wouldn't think this is a problem, except for the fact that scipy has methods for real-values. I do not fully understand the differences between all of the different scipy methods. That makes following the algorithm proposed in this posted solution hard for me to follow (ie, how/why is the threshold value picked?).
omega = np.fft.fft(x)
freq = np.fft.fftfreq(x.size, 1)
threshold = 0
idx = np.where(abs(omega)>threshold)[0][-1]
max_f = abs(freq[idx])
print(max_f)
This outputs 0.01, meaning the period is 1/0.01 = 100. This doesn't make sense either.
Attempt 3: Power Spectral Density
According to the scipy docs, I should be able to estimate the power spectral density (psd) of the signal using a periodogram (which, according to wikipedia, is the fourier transform of the autocorrelation function). By selecting the dominant frequency fmax at which the signal peaks, the period of the signal can be obtained as 1 / fmax.
freq, pdensity = signal.periodogram(x)
fig, ax = plt.subplots()
ax.plot(freq, pdensity, color='r')
ax.grid(color='k', alpha=0.3, linestyle=':')
plt.show()
plt.close(fig)
The periodogram shown below peaks at 49.076... at a frequency of fmax = 0.05. So, period = 1/fmax = 20. This doesn't make sense to me. I have a feeling it has something to do with the sampling rate, but don't know enough to confirm or progress further.
I realize I am missing some fundamental gaps in understanding how these things work. There are a lot of resources online, but it's hard to find this needle in the haystack. Can someone help me learn more about this?
Let's first look at your signal (I've added endpoint=False to make the division even):
t = np.linspace(0, 10*np.pi, 100, endpoint=False)
x = np.sin(t)
Let's divide out the radians (essentially by taking t /= 2*np.pi) and create the same signal by relating to frequencies:
fs = 20 # Sampling rate of 100/5 = 20 (e.g. Hz)
f = 1 # Signal frequency of 1 (e.g. Hz)
t = np.linspace(0, 5, 5*fs, endpoint=False)
x = np.sin(2*np.pi*f*t)
This makes it more salient that f/fs == 1/20 == 0.05 (i.e. the periodicity of the signal is exactly 20 samples). Frequencies in a digital signal always relate to its sampling rate, as you have already guessed. Note that the actual signal is exactly the same no matter what the values of f and fs are, as long as their ratio is the same:
fs = 1 # Natural units
f = 0.05
t = np.linspace(0, 100, 100*fs, endpoint=False)
x = np.sin(2*np.pi*f*t)
In the following I'll use these natural units (fs = 1). The only difference will be in t and hence the generated frequency axes.
Autocorrelation
Your understanding of what the autocorrelation function does is correct. It detects the correlation of a signal with a time-lagged version of itself. It does this by sliding the signal over itself as seen in the right column here (from Wikipedia):
Note that as both inputs to the correlation function are the same, the resulting signal is necessarily symmetric. That is why the output of np.correlate is usually sliced from the middle:
acf = np.correlate(x, x, 'full')[-len(x):]
Now index 0 corresponds to 0 delay between the two copies of the signal.
Next you'll want to find the index or delay that presents the largest correlation. Due to the shrinking overlap this will by default also be index 0, so the following won't work:
acf.argmax() # Always returns 0
Instead I recommend to find the largest peak instead, where a peak is defined to be any index with a larger value than both its direct neighbours:
inflection = np.diff(np.sign(np.diff(acf))) # Find the second-order differences
peaks = (inflection < 0).nonzero()[0] + 1 # Find where they are negative
delay = peaks[acf[peaks].argmax()] # Of those, find the index with the maximum value
Now delay == 20, which tells you that the signal has a frequency of 1/20 of its sampling rate:
signal_freq = fs/delay # Gives 0.05
Fourier transform
You used the following to calculate the FFT:
omega = np.fft.fft(x)
freq = np.fft.fftfreq(x.size, 1)
Thhese functions re designed for complex-valued signals. They will work for real-valued signals, but you'll get a symmetric output as the negative frequency components will be identical to the positive frequency components. NumPy provides separate functions for real-valued signals:
ft = np.fft.rfft(x)
freqs = np.fft.rfftfreq(len(x), t[1]-t[0]) # Get frequency axis from the time axis
mags = abs(ft) # We don't care about the phase information here
Let's have a look:
plt.plot(freqs, mags)
plt.show()
Note two things: the peak is at frequency 0.05, and the maximum frequency on the axis is 0.5 (the Nyquist frequency, which is exactly half the sampling rate). If we had picked fs = 20, this would be 10.
Now let's find the maximum. The thresholding method you have tried can work, but the target frequency bin is selected blindly and so this method would suffer in the presence of other signals. We could just select the maximum value:
signal_freq = freqs[mags.argmax()] # Gives 0.05
However, this would fail if, e.g., we have a large DC offset (and hence a large component in index 0). In that case we could just select the highest peak again, to make it more robust:
inflection = np.diff(np.sign(np.diff(mags)))
peaks = (inflection < 0).nonzero()[0] + 1
peak = peaks[mags[peaks].argmax()]
signal_freq = freqs[peak] # Gives 0.05
If we had picked fs = 20, this would have given signal_freq == 1.0 due to the different time axis from which the frequency axis was generated.
Periodogram
The method here is essentially the same. The autocorrelation function of x has the same time axis and period as x, so we can use the FFT as above to find the signal frequency:
pdg = np.fft.rfft(acf)
freqs = np.fft.rfftfreq(len(x), t[1]-t[0])
plt.plot(freqs, abs(pdg))
plt.show()
This curve obviously has slightly different characteristics from the direct FFT on x, but the main takeaways are the same: the frequency axis ranges from 0 to 0.5*fs, and we find a peak at the same signal frequency as before: freqs[abs(pdg).argmax()] == 0.05.
Edit:
To measure the actual periodicity of np.sin, we can just use the "angle axis" that we passed to np.sin instead of the time axis when generating the frequency axis:
freqs = np.fft.rfftfreq(len(x), 2*np.pi*f*(t[1]-t[0]))
rad_period = 1/freqs[mags.argmax()] # 6.283185307179586
Though that seems pointless, right? We pass in 2*np.pi and we get 2*np.pi. However, we can do the same with any regular time axis, without presupposing pi at any point:
fs = 10
t = np.arange(1000)/fs
x = np.sin(t)
rad_period = 1/np.fft.rfftfreq(len(x), 1/fs)[abs(np.fft.rfft(x)).argmax()] # 6.25
Naturally, the true value now lies in between two bins. That's where interpolation comes in and the associated need to choose a suitable window function.

Inverse FFT returns negative values when it should not

I have several points (x,y,z coordinates) in a 3D box with associated masses. I want to draw an histogram of the mass-density that is found in spheres of a given radius R.
I have written a code that, providing I did not make any errors which I think I may have, works in the following way:
My "real" data is something huge thus I wrote a little code to generate non overlapping points randomly with arbitrary mass in a box.
I compute a 3D histogram (weighted by mass) with a binning about 10 times smaller than the radius of my spheres.
I take the FFT of my histogram, compute the wave-modes (kx, ky and kz) and use them to multiply my histogram in Fourier space by the analytic expression of the 3D top-hat window (sphere filtering) function in Fourier space.
I inverse FFT my newly computed grid.
Thus drawing a 1D-histogram of the values on each bin would give me what I want.
My issue is the following: given what I do there should not be any negative values in my inverted FFT grid (step 4), but I get some, and with values much higher that the numerical error.
If I run my code on a small box (300x300x300 cm3 and the points of separated by at least 1 cm) I do not get the issue. I do get it for 600x600x600 cm3 though.
If I set all the masses to 0, thus working on an empty grid, I do get back my 0 without any noted issues.
I here give my code in a full block so that it is easily copied.
import numpy as np
import matplotlib.pyplot as plt
import random
from numba import njit
# 1. Generate a bunch of points with masses from 1 to 3 separated by a radius of 1 cm
radius = 1
rangeX = (0, 100)
rangeY = (0, 100)
rangeZ = (0, 100)
rangem = (1,3)
qty = 20000 # or however many points you want
# Generate a set of all points within 1 of the origin, to be used as offsets later
deltas = set()
for x in range(-radius, radius+1):
for y in range(-radius, radius+1):
for z in range(-radius, radius+1):
if x*x + y*y + z*z<= radius*radius:
deltas.add((x,y,z))
X = []
Y = []
Z = []
M = []
excluded = set()
for i in range(qty):
x = random.randrange(*rangeX)
y = random.randrange(*rangeY)
z = random.randrange(*rangeZ)
m = random.uniform(*rangem)
if (x,y,z) in excluded: continue
X.append(x)
Y.append(y)
Z.append(z)
M.append(m)
excluded.update((x+dx, y+dy, z+dz) for (dx,dy,dz) in deltas)
print("There is ",len(X)," points in the box")
# Compute the 3D histogram
a = np.vstack((X, Y, Z)).T
b = 200
H, edges = np.histogramdd(a, weights=M, bins = b)
# Compute the FFT of the grid
Fh = np.fft.fftn(H, axes=(-3,-2, -1))
# Compute the different wave-modes
kx = 2*np.pi*np.fft.fftfreq(len(edges[0][:-1]))*len(edges[0][:-1])/(np.amax(X)-np.amin(X))
ky = 2*np.pi*np.fft.fftfreq(len(edges[1][:-1]))*len(edges[1][:-1])/(np.amax(Y)-np.amin(Y))
kz = 2*np.pi*np.fft.fftfreq(len(edges[2][:-1]))*len(edges[2][:-1])/(np.amax(Z)-np.amin(Z))
# I create a matrix containing the values of the filter in each point of the grid in Fourier space
R = 5
Kh = np.empty((len(kx),len(ky),len(kz)))
#njit(parallel=True)
def func_njit(kx, ky, kz, Kh):
for i in range(len(kx)):
for j in range(len(ky)):
for k in range(len(kz)):
if np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2) != 0:
Kh[i][j][k] = (np.sin((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R)-(np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R*np.cos((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R))*3/((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R)**3
else:
Kh[i][j][k] = 1
return Kh
Kh = func_njit(kx, ky, kz, Kh)
# I multiply each point of my grid by the associated value of the filter (multiplication in Fourier space = convolution in real space)
Gh = np.multiply(Fh, Kh)
# I take the inverse FFT of my filtered grid. I take the real part to get back floats but there should only be zeros for the imaginary part.
Density = np.real(np.fft.ifftn(Gh,axes=(-3,-2, -1)))
# Here it shows if there are negative values the magnitude of the error
print(np.min(Density))
D = Density.flatten()
N = np.mean(D)
# I then compute the histogram I want
hist, bins = np.histogram(D/N, bins='auto', density=True)
bin_centers = (bins[1:]+bins[:-1])*0.5
plt.plot(bin_centers, hist)
plt.xlabel('rho/rhom')
plt.ylabel('P(rho)')
plt.show()
Do you know why I'm getting these negative values? Do you think there is a simpler way to proceed?
Sorry if this is a very long post, I tried to make it very clear and will edit it with your comments, thanks a lot!
-EDIT-
A follow-up question on the issue can be found [here].1
The filter you create in the frequency domain is only an approximation to the filter you want to create. The problem is that we are dealing with the DFT here, not the continuous-domain FT (with its infinite frequencies). The Fourier transform of a ball is indeed the function you describe, however this function is infinitely large -- it is not band-limited!
By sampling this function only within a window, you are effectively multiplying it with an ideal low-pass filter (the rectangle of the domain). This low-pass filter, in the spatial domain, has negative values. Therefore, the filter you create also has negative values in the spatial domain.
This is a slice through the origin of the inverse transform of Kh (after I applied fftshift to move the origin to the middle of the image, for better display):
As you can tell here, there is some ringing that leads to negative values.
One way to overcome this ringing is to apply a windowing function in the frequency domain. Another option is to generate a ball in the spatial domain, and compute its Fourier transform. This second option would be the simplest to achieve. Do remember that the kernel in the spatial domain must also have the origin at the top-left pixel to obtain a correct FFT.
A windowing function is typically applied in the spatial domain to avoid issues with the image border when computing the FFT. Here, I propose to apply such a window in the frequency domain to avoid similar issues when computing the IFFT. Note, however, that this will always further reduce the bandwidth of the kernel (the windowing function would work as a low-pass filter after all), and therefore yield a smoother transition of foreground to background in the spatial domain (i.e. the spatial domain kernel will not have as sharp a transition as you might like). The best known windowing functions are Hamming and Hann windows, but there are many others worth trying out.
Unsolicited advice:
I simplified your code to compute Kh to the following:
kr = np.sqrt(kx[:,None,None]**2 + ky[None,:,None]**2 + kz[None,None,:]**2)
kr *= R
Kh = (np.sin(kr)-kr*np.cos(kr))*3/(kr)**3
Kh[0,0,0] = 1
I find this easier to read than the nested loops. It should also be significantly faster, and avoid the need for njit. Note that you were computing the same distance (what I call kr here) 5 times. Factoring out such computation is not only faster, but yields more readable code.
Just a guess:
Where do you get the idea that the imaginary part MUST be zero? Have you ever tried to take the absolute values (sqrt(re^2 + im^2)) and forget about the phase instead of just taking the real part? Just something that came to my mind.

Regrid 2D data onto larger 2D grid at given coordinates in Python

I have a square 2D array data that I would like to add to a larger 2D array frame at some given set of non-integer coordinates coords. The idea is that data will be interpolated onto frame with it's center at the new coordinates.
Some toy data:
# A gaussian to add to the frame
x, y = np.meshgrid(np.linspace(-1,1,10), np.linspace(-1,1,10))
data = 50*np.exp(-np.sqrt(x**2+y**2)**2)
# The frame to add the gaussian to
frame = np.random.normal(size=(100,50))
# The desired (x,y) location of the gaussian center on the new frame
coords = 23.4, 22.6
Here's the idea. I want to add this:
to this:
to get this:
If the coordinates were integers (indexes), of course I could simply add them like this:
frame[23:33,22:32] += data
But I want to be able to specify non-integer coordinates so that data is regridded and added to frame.
I've looked into PIL.Image methods but my use case is just for 2D data, not images. Is there a way to do this with just scipy? Can this be done with interp2d or a similar function? Any guidance would be greatly appreciated!
Scipy's shift function from scipy.ndimage.interpolation is what you are looking for, as long as the grid spacings between data and frame overlap. If not, look to the other answer. The shift function can take floating point numbers as input and will do a spline interpolation. First, I put the data into an array as large as frame, then shift it, and then add it. Make sure to reverse the coordinate list, as x is the rightmost dimension in numpy arrays. One of the nice features of shift is that it sets to zero those values that go out of bounds.
import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage.interpolation import shift
# A gaussian to add to the frame.
x, y = np.meshgrid(np.linspace(-1,1,10), np.linspace(-1,1,10))
data = 50*np.exp(-np.sqrt(x**2+y**2)**2)
# The frame to add the gaussian to
frame = np.random.normal(size=(100,50))
x_frame = np.arange(50)
y_frame = np.arange(100)
# The desired (x,y) location of the gaussian center on the new frame.
coords = np.array([23.4, 22.6])
# First, create a frame as large as the frame.
data_large = np.zeros(frame.shape)
data_large[:data.shape[0], :data.shape[1]] = data[:,:]
# Subtract half the distance as the bottom left is at 0,0 instead of the center.
# The shift of 4.5 is because data is 10 points wide.
# Reverse the coords array as x is the last coordinate.
coords_shift = -4.5
data_large = shift(data_large, coords[::-1] + coords_shift)
frame += data_large
# Plot the result and add lines to indicate to coordinates
plt.figure()
plt.pcolormesh(x_frame, y_frame, frame, cmap=plt.cm.jet)
plt.axhline(coords[1], color='w')
plt.axvline(coords[0], color='w')
plt.colorbar()
plt.gca().invert_yaxis()
plt.show()
The script gives you the following figure, which has the desired coordinates indicated with white dotted lines.
One possible solution is to use scipy.interpolate.RectBivariateSpline. In the code below, x_0 and y_0 are the coordinates of a feature from data (i.e., the position of the center of the Gaussian in your example) that need to be mapped to the coordinates given by coords. There are a couple of advantages to this approach:
If you need to "place" the same object into multiple locations in the output frame, the spline needs to be computed only once (but evaluated multiple times).
In case you actually need to compute integrated flux of the model over a pixel, you can use the integral method of scipy.interpolate.RectBivariateSpline.
Resample using spline interpolation:
from scipy.interpolate import RectBivariateSpline
x = np.arange(data.shape[1], dtype=np.float)
y = np.arange(data.shape[0], dtype=np.float)
kx = 3; ky = 3; # spline degree
spline = RectBivariateSpline(
x, y, data.T, kx=kx, ky=ky, s=0
)
# Define coordinates of a feature in the data array.
# This can be the center of the Gaussian:
x_0 = (data.shape[1] - 1.0) / 2.0
y_0 = (data.shape[0] - 1.0) / 2.0
# create output grid, shifted as necessary:
yg, xg = np.indices(frame.shape, dtype=np.float64)
xg += x_0 - coords[0] # see below how to account for pixel scale change
yg += y_0 - coords[1] # see below how to account for pixel scale change
# resample and fill extrapolated points with 0:
resampled_data = spline.ev(xg, yg)
extrapol = (((xg < -0.5) | (xg >= data.shape[1] - 0.5)) |
((yg < -0.5) | (yg >= data.shape[0] - 0.5)))
resampled_data[extrapol] = 0
Now plot the frame and resampled data:
plt.figure(figsize=(14, 14));
plt.imshow(frame+resampled_data, cmap=plt.cm.jet,
origin='upper', interpolation='none', aspect='equal')
plt.show()
If you also want to allow for scale changes, then replace code for computing xg and yg above with:
coords = 20, 80 # change coords to easily identifiable (in plot) values
zoom_x = 2 # example scale change along X axis
zoom_y = 3 # example scale change along Y axis
yg, xg = np.indices(frame.shape, dtype=np.float64)
xg = (xg - coords[0]) / zoom_x + x_0
yg = (yg - coords[1]) / zoom_y + y_0
Most likely this is what you actually want based on your example. Specifically, the coordinates of pixels in data are "spaced" by 0.222(2) distance units. Therefore it actually seems that for your particular example (whether accidental or intentional), you have a zoom factor of 0.222(2). In that case your data image would shrink to almost 2 pixels in the output frame.
Comparison to #Chiel answer
In the image below, I compare the results from my method (left), #Chiel's method (center) and difference (right panel):
Fundamentally, the two methods are quite similar and possibly even use the same algorithm (I did not look at the code for shift but based on the description - it also uses splines). From comparison image it is visible that the biggest differences are at the edges and, for unknown to me reasons, shift seems to truncate the shifted image slightly too soon.
I think the biggest difference is that my method allows for pixel scale changes and it also allows re-use of the same interpolator to place the original image at different locations in the output frame. #Chiel's method is somewhat simpler but (what I did not like about it is that) it requires creation of a larger array (data_large) into which the original image is placed in the corner.
While the other answers have gone into detail, but here's my lazy solution:
xc,yc = 23.4, 22.6
x, y = np.meshgrid(np.linspace(-1,1,10)-xc%1, np.linspace(-1,1,10)-yc%1)
data = 50*np.exp(-np.sqrt(x**2+y**2)**2)
frame = np.random.normal(size=(100,50))
frame[23:33,22:32] += data
And it's the way you liked it. As you mentioned, the coordinates of both are the same, so the origin of data is somewhere between the indices. Now just simply shift it by the amount you want it to be off a grid point (remainder to one) in the second line and you're good to go (you might need to flip the sign, but I think this is correct).

Moving average produces array of different length?

This question has a lot of useful answers on how to get a moving average.
I have tried the two methods of numpy convolution and numpy cumsum and both worked fine on an example dataset, but produced a shorter array on my real data.
The data are spaced by 0.01. The example dataset has a length of 50, the real data tens of thousands. So it must be something about the window size that is causing the problem and I don't quite understand what is going on in the functions.
This is how I define the functions:
def smoothMAcum(depth,temp, scale): # Moving average by cumsum, scale = window size in m
dz = np.diff(depth)
N = int(scale/dz[0])
cumsum = np.cumsum(np.insert(temp, 0, 0))
smoothed=(cumsum[N:] - cumsum[:-N]) / N
return smoothed
def smoothMAconv(depth,temp, scale): # Moving average by numpy convolution
dz = np.diff(depth)
N = int(scale/dz[0])
smoothed=np.convolve(temp, np.ones((N,))/N, mode='valid')
return smoothed
Then I implement it:
scale = 5.
smooth = smoothMAconv(dep,data, scale)
but print len(dep), len(smooth)
returns 81071 80572
and the same happens if I use the other function.
How can I get the smooth array of the same length as the data?
And why did it work on the small dataset? Even if I try different scales (and use the same for the example and for the data), the result in the example has the same length as the original data, but not in the real application.
I considered an effect of nan values, but if I have a nan in the example, it doesn't make a difference.
So where is the problem, if possible to tell without the full dataset?
The second of your approaches is easy to modify to preserve the length, because numpy.convolve supports the parameter mode='same'.
np.convolve(temp, np.ones((N,))/N, mode='same')
This is made possible by zero-padding the data set temp on both sides, -
which will inevitably have some effect at the boundaries unless your data happens to be 0 near the boundaries. Example:
N = 10
x = np.linspace(0, 2, 100)
y = x**2 + np.random.uniform(size=x.shape)
y_smooth = np.convolve(y, np.ones((N,))/N, mode='same')
plt.plot(x, y, 'r.')
plt.plot(x, y_smooth)
plt.show()
The boundary effect of zero-padding is very visible at the right end, where the data points are about 4-5 but are padded by 0.
To reduce this undesired effect, use numpy.pad for more intelligent padding; reverting to mode='valid' for convolution. The pad width must be such that in total N-1 elements are added, where N is the size of moving window.
y_padded = np.pad(y, (N//2, N-1-N//2), mode='edge')
y_smooth = np.convolve(y_padded, np.ones((N,))/N, mode='valid')
Padding by edge values of an array looks much better.

How can I set a minimum distance constraint for generating points with numpy.random.rand?

I am trying to generate an efficient code for generating a number of random position vectors which I then use to calculate a pair correlation function. I am wondering if there is straightforward way to set a constraint on the minimum distance allowed between any two points placed in my box.
My code currently is as follows:
def pointRun(number, dr):
"""
Compute the 3D pair correlation function
for a random distribution of 'number' particles
placed into a 1.0x1.0x1.0 box.
"""
## Create array of distances over which to calculate.
r = np.arange(0., 1.0+dr, dr)
## Generate list of arrays to define the positions of all points,
## and calculate number density.
a = np.random.rand(number, 3)
numberDensity = len(a)/1.0**3
## Find reference points within desired region to avoid edge effects.
b = [s for s in a if all(s > 0.4) and all(s < 0.6) ]
## Compute pairwise correlation for each reference particle
dist = scipy.spatial.distance.cdist(a, b, 'euclidean')
allDists = dist[(dist < np.sqrt(3))]
## Create histogram to generate radial distribution function, (RDF) or R(r)
Rr, bins = np.histogram(allDists, bins=r, density=False)
## Make empty containers to hold radii and pair density values.
radii = []
rhor = []
## Normalize RDF values by distance and shell volume to get pair density.
for i in range(len(Rr)):
y = (r[i] + r[i+1])/2.
radii.append(y)
x = np.average(Rr[i])/(4./3.*np.pi*(r[i+1]**3 - r[i]**3))
rhor.append(x)
## Generate normalized pair density function, by total number density
gr = np.divide(rhor, numberDensity)
return radii, gr
I have previously tried using a loop that calculated all distances for each point as it was made and then accepted or rejected. This method was very slow if I use a lot of points.
Here is a scalable O(n) solution using numpy. It works by initially specifying an equidistant grid of points and then perturbing the points by some amount keeping the distance between the points at most min_dist.
You'll want to tweak the number of points, box shape and perturbation sensitivity to get the min_dist you want.
Note: If you fix the size of a box and specify a minimum distance between every point, it makes sense that there will be a limit to the number of points you can draw satisfying the minimum distance.
import numpy as np
import matplotlib.pyplot as plt
# specify params
n = 500
shape = np.array([64, 64])
sensitivity = 0.8 # 0 means no movement, 1 means max distance is init_dist
# compute grid shape based on number of points
width_ratio = shape[1] / shape[0]
num_y = np.int32(np.sqrt(n / width_ratio)) + 1
num_x = np.int32(n / num_y) + 1
# create regularly spaced neurons
x = np.linspace(0., shape[1]-1, num_x, dtype=np.float32)
y = np.linspace(0., shape[0]-1, num_y, dtype=np.float32)
coords = np.stack(np.meshgrid(x, y), -1).reshape(-1,2)
# compute spacing
init_dist = np.min((x[1]-x[0], y[1]-y[0]))
min_dist = init_dist * (1 - sensitivity)
assert init_dist >= min_dist
print(min_dist)
# perturb points
max_movement = (init_dist - min_dist)/2
noise = np.random.uniform(
low=-max_movement,
high=max_movement,
size=(len(coords), 2))
coords += noise
# plot
plt.figure(figsize=(10*width_ratio,10))
plt.scatter(coords[:,0], coords[:,1], s=3)
plt.show()
Based on #Samir 's answer, and make it a callable function for your convenience :)
import numpy as np
import matplotlib.pyplot as plt
def generate_points_with_min_distance(n, shape, min_dist):
# compute grid shape based on number of points
width_ratio = shape[1] / shape[0]
num_y = np.int32(np.sqrt(n / width_ratio)) + 1
num_x = np.int32(n / num_y) + 1
# create regularly spaced neurons
x = np.linspace(0., shape[1]-1, num_x, dtype=np.float32)
y = np.linspace(0., shape[0]-1, num_y, dtype=np.float32)
coords = np.stack(np.meshgrid(x, y), -1).reshape(-1,2)
# compute spacing
init_dist = np.min((x[1]-x[0], y[1]-y[0]))
# perturb points
max_movement = (init_dist - min_dist)/2
noise = np.random.uniform(low=-max_movement,
high=max_movement,
size=(len(coords), 2))
coords += noise
return coords
coords = generate_points_with_min_distance(n=8, shape=(2448,2448), min_dist=256)
# plot
plt.figure(figsize=(10,10))
plt.scatter(coords[:,0], coords[:,1], s=3)
plt.show()
As I understood, you're looking for an algorithm to create many random points in a box such that no two points are closer than some minimum distance. If this is your problem, then you can take advantage of statistical physics, and solve it using molecular dynamics software. Moreover, you do need molecular dynamics or Monte Carlo to obtain exact solution of this problem.
You place N atoms in a rectangular box, create a repulsive interaction of a fixed radius between them (such as shifted Lennard-Jones interaction), and run simulation for some time (untill you see that the points spread out uniformly throughout the box). By laws of statistical physics you can show that positions of the points would be maximally random given the constraint that points cannot be close than some distance. This would not be true if you use iterative algorithm, such as placing points one-by-one and rejecting them if they overlap
I would estimate a runtime of several seconds for 10000 points, and several minutes for 100k. I use OpenMM for all my moelcular dynamics simulations.
#example of generating 50 points in a square of 4000x4000 and with minimum distance of 400
import numpy as np
import random as rnd
n_points=50
x,y = np.zeros(n_points),np.zeros(n_points)
x[0],y[0]=np.round(rnd.uniform(0,4000)),np.round(rnd.uniform(0,4000))
min_distances=[]
i=1
while i<n_points :
x_temp,y_temp=np.round(rnd.uniform(0,4000)),np.round(rnd.uniform(0,4000))
distances = []
for j in range(0,i):
distances.append(np.sqrt((x_temp-x[j])**2+(y_temp-y[j])**2))
min_distance = np.min(distances)
if min_distance>400 :
min_distances.append(min_distance)
x[i]=x_temp
y[i]=y_temp
i = i+1
print(x,y)

Categories

Resources