I'd like to generate correlated arrays of x and y coordinates, in order to test various matplotlib plotting approaches, but I'm failing somewhere, because I can't get numpy.random.multivariate_normal to give me the samples I want. Ideally, I want my x values between -0.51, and 51.2, and my y values between 0.33 and 51.6 (though I suppose equal ranges would be OK, since I can constrain the plot afterwards), but I'm not sure what mean (0, 0?) and covariance values I should be using to get these samples from the function.
As the name implies numpy.random.multivariate_normal generates normal distributions, this means that there is a non-null probability of finding points outside of any given interval. You can generate correlated uniform distributions but this a little more convoluted. Take a look here for two possible methods.
If you want to go with the normal distribution you can set up the sigmas so that your half-interval correspond to 3 standard deviations (you can also filter out the bad points if needed). In this way you will have ~99% of your points inside your interval, ex:
import numpy as np
from matplotlib.pyplot import scatter
xx = np.array([-0.51, 51.2])
yy = np.array([0.33, 51.6])
means = [xx.mean(), yy.mean()]
stds = [xx.std() / 3, yy.std() / 3]
corr = 0.8 # correlation
covs = [[stds[0]**2 , stds[0]*stds[1]*corr],
[stds[0]*stds[1]*corr, stds[1]**2]]
m = np.random.multivariate_normal(means, covs, 1000).T
scatter(m[0], m[1])
Related
P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
Here's the dataset where X corresponds to rows and Y corresponds to columns. I'm trying to figure out how to calculate the Covariance and Marginal Density Probability (MDP) for Y (columns). for the Covariance I have the following.
np.cov(P)
array([[ 2.247e-04, 6.999e-05, 2.571e-05, -2.822e-05, 1.061e-04],
[ 6.999e-05, 2.261e-04, 9.535e-07, 8.165e-05, -2.013e-05],
[ 2.571e-05, 9.535e-07, 7.924e-05, 1.357e-05, -8.118e-05],
[-2.822e-05, 8.165e-05, 1.357e-05, 2.039e-04, -1.267e-04],
[ 1.061e-04, -2.013e-05, -8.118e-05, -1.267e-04, 2.372e-04]])
How do I get the MDP? Also, is there a way to use numpy to select just the X and Y vals and assign them to variables where X= P's rows and Y=P's columns?
The data stored in P are a little ambiguous. In statistics, the X and Y have a very specific meaning. Usually, each row refers to one observation (i.e. datapoint) of some statistical object while the column represents a feature that is measured for each statistical object. In your case, there would be 9 observations with 5 features. This is referred to as a design matrix X, considered exogenous (independent), and serves as the foundation of most statistical learning algorithms. In supervised learning, there is additionally a vector Y; its length equals the number of rows in the X.
Your task at hand is of unsupervised nature as there is no true exogenous data Y and you are interested in the distribution of X alone. This opens up additional questions. Indeed, np.cov() computes the empirical covariance matrix measuring the pairwise covariance between each of these 5 features resulting in a symmetric 5x5 matrix as you indicated. Asking for the marginal probability density of each column (i.e. feature), however, refers to the univariate distribution of each feature alone. The covariance matrix in between features is irrelevant for this task.
There are several methodologies to obtain estimates of an unknown distribution given some data. Broadly speaking, they fall into two categories: parametric and non-parametric. I'll explain how each methodology works and can be implemented by leveraging NumPy exactly in the way you eluded to.
1. Parametric density estimation
In many cases, one assumes that the data stems from a particular parametric distribution. This distributional assumption is mostly based on convenience rather than prior knowledge. Then, estimating the unknown parameter values completely determines the distribution(s). In your case, for example, you could assume that each feature is univariate normal distributed.
import numpy as np
from matplotlib import pyplot as plt
from scipy import stats
# mere numerical grid to plot densities (no statistical significance)
x_ = np.linspace(0.0, 0.055, 1000)
# estimate mean (mu) and standard deviation (sigma) for each column
mean_vec = np.mean(P, axis=1)
std_vec = np.std(P, axis=1)
# histograms
for i in range(5):
plt.plot(x_, stats.norm.pdf(x_, loc=mean_vec[i], scale=std_vec[I]),
label='Col. {}'.format(i+1))
plt.suptitle('Marginal Distribution Estimates', fontsize=21, y=1.025)
plt.title('parametric via univariate Normal densities', fontsize=14)
plt.legend(loc='upper right')
plt.show()
2. Nonparametric density estimation
Alternatively, you can use a histogram as a non-parametric estimator of the unknown probability density functions (of each column/feature). Note, however, that you still have to choose a bandwidth h that determines the width of the bins. Additionally, non-parametric tools require a larger sample size to provide accurate estimates. Your sample size 9 is likely insufficient.
import numpy as np
from matplotlib import pyplot as plt
# endpoints on all bins: implies bandwith h=0.00229
bins = np.linspace(0.0, 0.055, 25)
h = np.diff(bins)[0]
# histograms
for i in range(5):
plt.hist(P[:,i], bins, alpha=0.5, label='Col. {}'.format(i+1))
plt.suptitle('Marginal Distribution Estimates', fontsize=21, y=1.025)
plt.title('nonparametric via histograms (h={})'.format(round(h, 4)), fontsize=14)
plt.legend(loc='upper right')
plt.show()
Say the two series are:
x = [4,4,4,4,6,8,10,8,6,4,4,4,4,4,4,4,4,4,4,4,4,4,4]
y = [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,6,8,10,8,6,4,4]
Series x clearly lags y by 12 time periods.
However, using the following code as suggested in Python cross correlation:
import numpy as np
c = np.correlate(x, y, "full")
lag = np.argmax(c) - c.size/2
leads to an incorrect lag of -0.5.
What's wrong here?
If you want to do it the easy way you should simply use scipy correlation_lags
Also, remember to subtract the mean from the inputs.
import numpy as np
from scipy import signal
x = [4,4,4,4,6,8,10,8,6,4,4,4,4,4,4,4,4,4,4,4,4,4,4]
y = [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,6,8,10,8,6,4,4]
correlation = signal.correlate(x-np.mean(x), y - np.mean(y), mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
lag = lags[np.argmax(abs(correlation))]
This gives lag=-12, that is the difference between the index of the first six in x and in y, if you swap inputs it gives +12
Edit
Why to subtract the mean
If the signals have non-zero mean the terms at the center of the correlation will become larger, because there you have a larger support sample to compute the correlation. Furthermore, for very large data, subtracting the mean makes the calculations more accurate.
Here I illustrate what would happen if the mean was not subtracted for this example.
plt.plot(abs(correlation))
plt.plot(abs(signal.correlate(x, y, mode="full")))
plt.plot(abs(signal.correlate(np.ones_like(x)*np.mean(x), np.ones_like(y)*np.mean(y))))
plt.legend(['subtracting mean', 'constant signal', 'keeping the mean'])
Notice that the maximum on the blue curve (at 10) does not coincide with the maximum of the orange curve.
I'm was hoping to use singular value decomposition to estimate the standard deviation of eliptoid data. I'm not sure if this is the best approach and I may be overthinking the entire process so I need some help.
I simulated some data using the following script...
from matplotlib import pyplot as plt
import numpy
def svd_example():
# simulate some data...
# x values have standard deviation 3000
xdata = numpy.random.normal(0, 3000, 5000).reshape(-1, 1)
# y values standard deviation 300
ydata = numpy.random.normal(0, 300, 5000).reshape(-1, 1)
# apply some rotation
ydata_rotated = ydata + (xdata * 0.5)
data = numpy.hstack((xdata, ydata_rotated))
# get singular values
left_singular_matrix, singular_values, right_singular_matrix = numpy.linalg.svd(data)
print 'singular values', singular_values
# plot data....
plt.scatter(data[:, 0], data[:, 1], s=5)
plt.ylim(-15000, 15000)
plt.show()
svd_example()
I get singular values of...
>>> singular values [ 234001.71228678 18850.45155942]
My data looks like this...
I was under the assumption that the singular values would give me some indication of the spread of data regardless of it's rotation, right? But these values, [234001.71228678 18850.45155942], make no sense to me. My standard deviations were 3000 and 300. Do these singular values represent variance? How do I convert them?
The singular values indeed give some indication of the spread. In fact, they are related to the standard deviation in these directions. However, they are not normalized. If you divide by the square-root of the number samples, you will get values that closely resemble the standard deviations used for creating the data:
singular_values / np.sqrt(5000)
# array([ 3398.61320614, 264.00975837])
Why do you get 3400 and 264 instead of 3000 and 300? That is because ydata + (xdata * 0.5) is not a rotation but a shearing operation. A real rotation would preserve the original standard deviations.
For example, the following code would rotate the data by 40 degrees:
# apply some rotation
s = numpy.sin(40 * numpy.pi / 180)
c = numpy.cos(40 * numpy.pi / 180)
data = numpy.hstack((xdata, ydata)).dot([[c, s], [-s, c]])
With such a rotation you will get normalized singular values that are pretty close to the original standard deviations.
Edit:
On Normalization
I have to admit, normalization is probably not the correct term to apply here. It does not necessarily mean to scale values to a certain range. Normalization, as I meant it, was to bring values into a defined range, independent of the number of samples.
To understand where the division by sqrt(5000) comes from, let's talk about the standard deviation. Let x, be a data vector of n samples with zero mean. Then the standard deviation is computed as sqrt(sum(x**2)/n) or sqrt(sum(x**2)) / sqrt(n). Now, you can think of the singular value decomposition to compute only the sqrt(sum(x**2)) part, so we have to divide by sqrt(n) ourselves.
I'm afraid, this is not a very mathematical explanation, but hopefully it conveys the idea.
I am trying to generate a cluster within the range of x such that 0 < x < 10 and the within range of y such that 0 < y < 10 with the center x = 5 and y = 5. I can't find any solutions on-line. Can anyone help me with this. Below is what I got so far
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from pylab import *
centers = [[5, 5]]
X, labels_true = make_blobs(n_samples=100, centers=centers, cluster_std=0.5, random_state=0)
print X
Example of Output:
[ 5.07747371 5.18908126]
[ 4.6781908 3.88829842]
[ 5.03325861 5.15123595]
[ 4.44780833 5.02608254]
[ 4.77223375 5.00873958]
[ 5.76638961 5.73467938]
[ 5.08871307 4.79910953]
[ 4.68207696 5.33821665]
[ 5.58938979 4.91003758]
As you can see, the output values have x varying from 4 to 6 and the same for y. I need to be able to generate clusters where I can control this range.
make_blobs generates Gaussian clusters. These do not have a finite value range. Values outside a few standard deviations are unlikely, but not impossible. If you want to guarantee the value range, use a uniform distribution instead.
You can use centers to control the centers, and cluster_std to control the standard deviations. See the documentation of make_blobs for details.
Alternatively, if your application allows it, you can simply throw away values outside the range you request effectively sampling from a truncated Gaussian. Finally, if it is not a valid option to throw away samples (for whatever reason), you can indeed sample two uniform numbers. And if you insist on getting a Gaussian distribution, you can Box-Muller transform the two uniform numbers into a 2D Gaussian (in the link: compute z1 and z2 from two uniform numbers between 0 and 1: x1 and x2):
http://mathworld.wolfram.com/Box-MullerTransformation.html
I have a set of points (x,y) as two vectors
x,y for example:
from pylab import *
x = sorted(random(30))
y = random(30)
plot(x,y, 'o-')
Now I would like to smooth this data with a Gaussian and evaluate it only at certain (regularly spaced) points on the x-axis. lets say for:
x_eval = linspace(0,1,11)
I got the tip that this method is called a "Gaussian sum filter", but so far I have not found any implementation in numpy/scipy for that, although it seems like a standard problem at first glance.
As the x values are not equally spaced I can't use the scipy.ndimage.gaussian_filter1d.
Usually this kind of smoothing is done going through furrier space and multiplying with the kernel, but I don't really know if this will be possible with irregular spaced data.
Thanks for any ideas
This will blow up for very large datasets, but the proper calculaiton you are asking for would be done as follows:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0) # for repeatability
x = np.random.rand(30)
x.sort()
y = np.random.rand(30)
x_eval = np.linspace(0, 1, 11)
sigma = 0.1
delta_x = x_eval[:, None] - x
weights = np.exp(-delta_x*delta_x / (2*sigma*sigma)) / (np.sqrt(2*np.pi) * sigma)
weights /= np.sum(weights, axis=1, keepdims=True)
y_eval = np.dot(weights, y)
plt.plot(x, y, 'bo-')
plt.plot(x_eval, y_eval, 'ro-')
plt.show()
I'll preface this answer by saying that this is more of a DSP question than a programming question...
...that being said there, there is a simple two step solution to your problem.
Step 1: Resample the data
So to illustrate this we can create a random data set with unequal sampling:
import numpy as np
x = np.cumsum(np.random.randint(0,100,100))
y = np.random.normal(0,1,size=100)
This gives something like:
We can resample this data using simple linear interpolation:
nx = np.arange(x.max()) # choose new x axis sampling
ny = np.interp(nx,x,y) # generate y values for each x
This converts our data to:
Step 2: Apply filter
At this stage you can use some of the tools available through scipy to apply a Gaussian filter to the data with a given sigma value:
import scipy.ndimage.filters as filters
fx = filters.gaussian_filter1d(ny,sigma=100)
Plotting this up against the original data we get:
The choice of the sigma value determines the width of the filter.
Based on #Jaime's answer I wrote a function that implements this with some additional documentation and the ability to discard estimates far from the datapoints.
I think confidence intervals could be obtained on this estimate by bootstrapping, but I haven't done this yet.
def gaussian_sum_smooth(xdata, ydata, xeval, sigma, null_thresh=0.6):
"""Apply gaussian sum filter to data.
xdata, ydata : array
Arrays of x- and y-coordinates of data.
Must be 1d and have the same length.
xeval : array
Array of x-coordinates at which to evaluate the smoothed result
sigma : float
Standard deviation of the Gaussian to apply to each data point
Larger values yield a smoother curve.
null_thresh : float
For evaluation points far from data points, the estimate will be
based on very little data. If the total weight is below this threshold,
return np.nan at this location. Zero means always return an estimate.
The default of 0.6 corresponds to approximately one sigma away
from the nearest datapoint.
"""
# Distance between every combination of xdata and xeval
# each row corresponds to a value in xeval
# each col corresponds to a value in xdata
delta_x = xeval[:, None] - xdata
# Calculate weight of every value in delta_x using Gaussian
# Maximum weight is 1.0 where delta_x is 0
weights = np.exp(-0.5 * ((delta_x / sigma) ** 2))
# Multiply each weight by every data point, and sum over data points
smoothed = np.dot(weights, ydata)
# Nullify the result when the total weight is below threshold
# This happens at evaluation points far from any data
# 1-sigma away from a data point has a weight of ~0.6
nan_mask = weights.sum(1) < null_thresh
smoothed[nan_mask] = np.nan
# Normalize by dividing by the total weight at each evaluation point
# Nullification above avoids divide by zero warning shere
smoothed = smoothed / weights.sum(1)
return smoothed