If I have a 2D numpy array composed of points (x, y) that give some value z(x, y) at each point, can I find the standard deviation along the x-axis and along the y-axis? I know that np.std(data) will simply find the standard deviation of the entire dataset, but that's not want I want. Also, adding in axis=0 or axis=1 computes the standard deviations along each axis for as many rows or columns that you have. If I just want one standard deviation along the y-axis, and another along the x-axis, can I find these in a dataset like this? From my understanding, standard deviations along x and y normally make sense when you have points x with values y(x). But I need some sigma_x and sigma_y for a 2D Gaussian fit I'm trying to do. Is this possible?
Here is an oversimplified example, since my actual data is much larger.
import numpy as np
data = np.array([[1, 5, 0, 3], [3, 5, 1, 1], [41, 33, 9, 20], [11, 20, 4, 13]])
print(np.std(data)) #not what I want
>>> 11.78386
print(np.std(data, axis=0)) #this gives me as many results as there are rows/columns so it's not what I want
>>> [16.03 11.69 3.5 7.69]
I'm not sure how the output corresponding to what I want would look like, since I'm not even sure if it's possible in a 2D array with shape > nx2. But I want to know if it's possible to compute a standard deviation along the x-axis, and one along the y-axis. I'm not even sure if this makes sense for a 2D array... But if it doesn't, I'm not sure what to input as my sigma_x and sigma_y for a 2D Gaussian fit.
Standard deviation doesn't care whether y = f(x) or (x, y) are coordinates. It just measures how spread a set of values are. If you have n points (x, y) which make up a nX2 size array, then the std(axis=0) is what you want. It creates a (2, )shaped array, where the first elements is the x-axis std, and the second the y-axis std. Whether that is useful, depends on what you want, and it ignores the correlation between x and y.
I think what you want is to separate the x axis in small intervals and compute the standard deviation of the y coordinates of the points within those intervals.
You could compute std(y_i), where y_i are the y coordinates for points x in the interval (x_min+i*delta_x, x_min+(i+1)*delta_x), choosing a small delta_x, such that enough points (x_j, y_j) lie within the interval.
import numpy as np
x = np.array([0, 0.11, 0.1, 0.01, 0.2, 0.22, 0.23])
y = np.array([1, 2, 3, 2, 2, 2.1, 2.2])
num_intervals = 3
#sort the arrays
sort_inds = np.argsort(x)
x = x[sort_inds]
y = y[sort_inds]
# create intervals
x_range = x.max() - x.min()
x_intervals = np.linspace(np.min(x)+x_range/num_intervals, x.max()-x_range/num_intervals, num_intervals)
print(x_intervals)
>> [0.07666667 0.115 0.15333333]
Next, we split the arrays y and x using these intervals:
# get indices of x where the elements of x_intervals
# should be inserted, in order to maintain the order
# for sufficiently large num_intervals it
# approximates the closest value in x to an element
# in x_intervals
split_indices = np.unique(np.searchsorted(x, x_intervals, side='left'))
ls_of_arrays_x = np.array_split(x, split_indices)
ls_of_arrays_y = np.array_split(y, split_indices)
print(ls_of_arrays_x)
print(ls_of_arrays_y)
>> [array([0. , 0.01]), array([0.1 , 0.11]), array([0.2 , 0.22, 0.23])]
>> [array([1., 2.]), array([3., 2.]), array([2. , 2.1, 2.2])]
Now compute the x coordinates and the corresponding y std:
y_stds = np.array([np.std(yi) for yi in ls_of_arrays_y])
x_mean = np.array([np.std(xi) for xi in ls_of_arrays_x])
print(x_mean)
print(y_stds)
>> [0.005 0.105 0.21666667]
>> [0.5 0.5 0.08164966]
I hope it was what you were looking for.
Related
I have a vector of 2d means.
means = np.array([[0,0], [0, 3], [3,0], [3,3], [0, 5]])
I want to generate random normal numbers using this means vector.
If the means were only in x axis, I would do this in a way like this:
x_samples = np.asarray(list(map(lambda mean: np.random.normal(mean, 1), x_means)))
Is there a simple way to generate the samples for x and y together?
Thanks
With two mean values (x and y) for each point, I am assuming you want a multivariate normal distribution with these mean values in each axis, and standard deviation of 1 in each axis? (The standard deviation is 1 in your 1d example)
in which case you can use np.random.multivariate_normal.
xy_samples = np.asarray([np.random.multivariate_normal(mean, np.diag([1., 1.])) for mean in means])
or similar to your formulation, using map:
xy_samples = np.asarray(list(map(lambda mean: np.random.multivariate_normal(mean, np.diag([1., 1.])), means)))
the np.diag deals with the fact that you need to supply a covariance matrix, not scalar variance.
Allow me to separate this to increasing difficulty questions:
1.
I have some 1d curve, given as a (n,) point array.
I would like to have it re-sampled k times, and have the results come from a cubic spline that passes through all points.
This can be done with interp1d
2.
The curve is given at non-same-interval samples as an array of shape (n, 2) where (:, 0) represents the sample time, and (:, 1) represent the sample values.
I want to re-sample the curve at k same-time-intervals.
How can this be done?
I thought i could do t_sampler = interp1d(np.arange(0,k),arr[:, 0]) for the time, then interp1d(t_sampler(np.arange(0,k)), arr[:, 1])
Am I missing something with this?
3.
How can I re-sample the curve at equal distance intervals? (question 2 was equal time intervals)
4.
What if the curve is 3d given by an array of shape (n, 4), where (:,0) are the (non uniform) sampling times, and the rest are the locations sampled?
Sorry for many-questionsin-single-question, they seemed too similar to open a new question for every one.
Partial answer; for 1 and 2 I would do this:
from scipy.interpolate import interp1d
import numpy as np
# dummy data
x = np.arange(-100,100,10)
y = x**2 + np.random.normal(0,1, len(x))
# interpolate:
f = interp1d(x,y, kind='cubic')
# resample at k intervals, with k = 100:
k = 100
# generate x axis:
xnew = np.linspace(np.min(x), np.max(x), k)
# call f on xnew to sample y values:
ynew = f(xnew)
plt.scatter(x,y)
plt.plot(xnew, ynew)
I have a set of points (x,y) as two vectors
x,y for example:
from pylab import *
x = sorted(random(30))
y = random(30)
plot(x,y, 'o-')
Now I would like to smooth this data with a Gaussian and evaluate it only at certain (regularly spaced) points on the x-axis. lets say for:
x_eval = linspace(0,1,11)
I got the tip that this method is called a "Gaussian sum filter", but so far I have not found any implementation in numpy/scipy for that, although it seems like a standard problem at first glance.
As the x values are not equally spaced I can't use the scipy.ndimage.gaussian_filter1d.
Usually this kind of smoothing is done going through furrier space and multiplying with the kernel, but I don't really know if this will be possible with irregular spaced data.
Thanks for any ideas
This will blow up for very large datasets, but the proper calculaiton you are asking for would be done as follows:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0) # for repeatability
x = np.random.rand(30)
x.sort()
y = np.random.rand(30)
x_eval = np.linspace(0, 1, 11)
sigma = 0.1
delta_x = x_eval[:, None] - x
weights = np.exp(-delta_x*delta_x / (2*sigma*sigma)) / (np.sqrt(2*np.pi) * sigma)
weights /= np.sum(weights, axis=1, keepdims=True)
y_eval = np.dot(weights, y)
plt.plot(x, y, 'bo-')
plt.plot(x_eval, y_eval, 'ro-')
plt.show()
I'll preface this answer by saying that this is more of a DSP question than a programming question...
...that being said there, there is a simple two step solution to your problem.
Step 1: Resample the data
So to illustrate this we can create a random data set with unequal sampling:
import numpy as np
x = np.cumsum(np.random.randint(0,100,100))
y = np.random.normal(0,1,size=100)
This gives something like:
We can resample this data using simple linear interpolation:
nx = np.arange(x.max()) # choose new x axis sampling
ny = np.interp(nx,x,y) # generate y values for each x
This converts our data to:
Step 2: Apply filter
At this stage you can use some of the tools available through scipy to apply a Gaussian filter to the data with a given sigma value:
import scipy.ndimage.filters as filters
fx = filters.gaussian_filter1d(ny,sigma=100)
Plotting this up against the original data we get:
The choice of the sigma value determines the width of the filter.
Based on #Jaime's answer I wrote a function that implements this with some additional documentation and the ability to discard estimates far from the datapoints.
I think confidence intervals could be obtained on this estimate by bootstrapping, but I haven't done this yet.
def gaussian_sum_smooth(xdata, ydata, xeval, sigma, null_thresh=0.6):
"""Apply gaussian sum filter to data.
xdata, ydata : array
Arrays of x- and y-coordinates of data.
Must be 1d and have the same length.
xeval : array
Array of x-coordinates at which to evaluate the smoothed result
sigma : float
Standard deviation of the Gaussian to apply to each data point
Larger values yield a smoother curve.
null_thresh : float
For evaluation points far from data points, the estimate will be
based on very little data. If the total weight is below this threshold,
return np.nan at this location. Zero means always return an estimate.
The default of 0.6 corresponds to approximately one sigma away
from the nearest datapoint.
"""
# Distance between every combination of xdata and xeval
# each row corresponds to a value in xeval
# each col corresponds to a value in xdata
delta_x = xeval[:, None] - xdata
# Calculate weight of every value in delta_x using Gaussian
# Maximum weight is 1.0 where delta_x is 0
weights = np.exp(-0.5 * ((delta_x / sigma) ** 2))
# Multiply each weight by every data point, and sum over data points
smoothed = np.dot(weights, ydata)
# Nullify the result when the total weight is below threshold
# This happens at evaluation points far from any data
# 1-sigma away from a data point has a weight of ~0.6
nan_mask = weights.sum(1) < null_thresh
smoothed[nan_mask] = np.nan
# Normalize by dividing by the total weight at each evaluation point
# Nullification above avoids divide by zero warning shere
smoothed = smoothed / weights.sum(1)
return smoothed
I'd like to generate correlated arrays of x and y coordinates, in order to test various matplotlib plotting approaches, but I'm failing somewhere, because I can't get numpy.random.multivariate_normal to give me the samples I want. Ideally, I want my x values between -0.51, and 51.2, and my y values between 0.33 and 51.6 (though I suppose equal ranges would be OK, since I can constrain the plot afterwards), but I'm not sure what mean (0, 0?) and covariance values I should be using to get these samples from the function.
As the name implies numpy.random.multivariate_normal generates normal distributions, this means that there is a non-null probability of finding points outside of any given interval. You can generate correlated uniform distributions but this a little more convoluted. Take a look here for two possible methods.
If you want to go with the normal distribution you can set up the sigmas so that your half-interval correspond to 3 standard deviations (you can also filter out the bad points if needed). In this way you will have ~99% of your points inside your interval, ex:
import numpy as np
from matplotlib.pyplot import scatter
xx = np.array([-0.51, 51.2])
yy = np.array([0.33, 51.6])
means = [xx.mean(), yy.mean()]
stds = [xx.std() / 3, yy.std() / 3]
corr = 0.8 # correlation
covs = [[stds[0]**2 , stds[0]*stds[1]*corr],
[stds[0]*stds[1]*corr, stds[1]**2]]
m = np.random.multivariate_normal(means, covs, 1000).T
scatter(m[0], m[1])
I am interested in computing the power spectrum of a system of particles (~100,000) in 3D space with Python. What I have found so far is a group of functions in Numpy (fft,fftn,..) which compute the discrete Fourier transform, of which the square of the absolute value is the power spectrum. My question is a matter of how my data are being represented - and truthfully may be fairly simple to answer.
The data structure I have is an array which has a shape of (n,2), n being the number of particles I have, and each column representing either the x, y, and z coordinate of the n particles. The function I believe I should be using it the fftn() function, which takes the discrete Fourier transform of an n-dimensional array - but it says nothing about the format. How should the data be represented as a data structure to be fed into fftn?
Here is what I've tried so far to test the function:
import numpy as np
import random
import matplotlib.pyplot as plt
DATA = np.zeros((100,3))
for i in range(len(DATA)):
DATA[i,0] = random.uniform(-1,1)
DATA[i,1] = random.uniform(-1,1)
DATA[i,2] = random.uniform(-1,1)
FFT = np.fft.fftn(DATA)
PS = abs(FFT)**2
plt.plot(PS)
plt.show()
The array entitled DATA is a mock array, ultimately the thing which will be 100,000 by 3 in shape. The output of the code gives me something like:
As you can see, I think this is giving me three 1D power spectra (1 for each column of my data), but really I'd like a power spectrum as a function of radius.
Does anybody have any advice or alternative methods/packages they know of to compute the power spectrum (I'd even settle for the two point autocorrelation function).
It doesn't quite work the way you are setting it out...
You need a function, lets call it f(x, y, z), that describes the density of mass in space. In your case, you can consider the galaxies as point masses, so you will have a delta function centered at the location of each galaxy. It is for this function that you can calculate the three-dimensional autocorrelation, from which you could calculate the power spectrum.
If you want to use numpy to do that for you, you are first going to have to discretize your function. A possible mock example would be:
import numpy as np
import matplotlib.pyplot as plt
space = np.zeros((100, 100, 100), dtype=np.uint8)
x, y, z = np.random.randint(100, size=(3, 1000))
space[x, y, z] += 1
space_ps = np.abs(np.fft.fftn(space))
space_ps *= space_ps
space_ac = np.fft.ifftn(space_ps).real.round()
space_ac /= space_ac[0, 0, 0]
And now space_ac holds the three-dimensional autocorrelation function for the data set. This is not quite what you are after, and to get you one-dimensional correlation function you would have to average the values on spherical shells around the origin:
dist = np.minimum(np.arange(100), np.arange(100, 0, -1))
dist *= dist
dist_3d = np.sqrt(dist[:, None, None] + dist[:, None] + dist)
distances, _ = np.unique(dist_3d, return_inverse=True)
values = np.bincount(_, weights=space_ac.ravel()) / np.bincount(_)
plt.plot(distances[1:], values[1:])
There is another issue with doing things yourself this way: when you compute the power spectrum as above, mathematically is as if your three dimensional array wrapped around the borders, i.e. point [999, y, z] is a neighbour to [0, y, z]. So your autocorrelation could show two very distant galaxies as close neighbours. The simplest way to deal with this is by making your array twice as large along every dimension, padding with extra zeros, and then discarding the extra data.
Alternatively you could use scipy.ndimage.filters.correlate with mode='constant' to do all the dirty work for you.