Derivatives of delta - python

I want to calculate E by this equation. But I am not sure if I can obtain results with numpy.diff module. It exports 4 points only.
from numpy import diff
x = [395.33, 472.12, 560.45, 652.72, 732.55]
y = [0.17, 0.22, 0.28, 0.34, 0.41]
E = diff(y) / diff(x)
print(E)
Output:
[0.00065113 0.00067927 0.00065027 0.00087686]

It is expected that the derivative is computed only on the intermediates segments between successive points, thus having one less value than the number of points.
What you expect is unclear, do you want to compute the gradient?
import numpy as np
E = np.gradient(y, x)
Output:
array([0.00065113, 0.00066422, 0.00066508, 0.00077175, 0.00087686])
Differences between diff and gradient:
More complex example:
Observe how the green curve is exactly the derivative of each segment (=slope), while the gradient is smoother (depends on points before and after)

Related

How to find the lag between two time series using cross-correlation

Say the two series are:
x = [4,4,4,4,6,8,10,8,6,4,4,4,4,4,4,4,4,4,4,4,4,4,4]
y = [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,6,8,10,8,6,4,4]
Series x clearly lags y by 12 time periods.
However, using the following code as suggested in Python cross correlation:
import numpy as np
c = np.correlate(x, y, "full")
lag = np.argmax(c) - c.size/2
leads to an incorrect lag of -0.5.
What's wrong here?
If you want to do it the easy way you should simply use scipy correlation_lags
Also, remember to subtract the mean from the inputs.
import numpy as np
from scipy import signal
x = [4,4,4,4,6,8,10,8,6,4,4,4,4,4,4,4,4,4,4,4,4,4,4]
y = [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,6,8,10,8,6,4,4]
correlation = signal.correlate(x-np.mean(x), y - np.mean(y), mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
lag = lags[np.argmax(abs(correlation))]
This gives lag=-12, that is the difference between the index of the first six in x and in y, if you swap inputs it gives +12
Edit
Why to subtract the mean
If the signals have non-zero mean the terms at the center of the correlation will become larger, because there you have a larger support sample to compute the correlation. Furthermore, for very large data, subtracting the mean makes the calculations more accurate.
Here I illustrate what would happen if the mean was not subtracted for this example.
plt.plot(abs(correlation))
plt.plot(abs(signal.correlate(x, y, mode="full")))
plt.plot(abs(signal.correlate(np.ones_like(x)*np.mean(x), np.ones_like(y)*np.mean(y))))
plt.legend(['subtracting mean', 'constant signal', 'keeping the mean'])
Notice that the maximum on the blue curve (at 10) does not coincide with the maximum of the orange curve.

Find linear combination of vectors that is the best fit for a target vector

I am trying to find weights across a number of forecasts to give a result that is as close as possible (say, mean squared error) to a known target.
Here is a simplified example showing three different types of forecast across four data points:
target = [1.0, 1.02, 1.01, 1.04] # all approx 1.0
forecasts = [
[0.9, 0.91, 0.92, 0.91], # all approx 0.9
[1.1, 1.11, 1.13, 1.11], # all approx 1.1
[1.21, 1.23, 1.21, 1.23] # all approx 1.2
]
where one forecast is always approximately 0.9, one is always approximately 1.1 and one is always approximately 1.2.
I'd like an automated way of finding weights of approximately [0.5, 0.5, 0.0] for the three forecasts because averaging the first two forecasts and ignoring the third is very close to the target. Ideally the weights would be constrained to be non-negative and sum to 1.
I think I need to use some form of linear programming or quadratic programming to do this. I have installed the Python quadprog library, but I'm not sure how to translate this problem into the form that solvers like this require. Can anyone point me in the right direction?
If I understand you correctly, you want to model some optimization problem and solve it. If you are interested in the general case (without any constraints), your problem seems pretty close to the regular least square error problem (which you can solve with scikit-learn for example).
I recommend to use cvxpy library for modeling an optimization problem. It's a convenient way to model a convex optimization problem, and you can choose which solver you want to work in the background.
Expanding cvxpy least square example, by adding the constraints you mentioned:
# Import packages.
import cvxpy as cp
import numpy as np
# Generate data.
m = 20
n = 15
np.random.seed(1)
A = np.random.randn(m, n)
b = np.random.randn(m)
# Define and solve the CVXPY problem.
x = cp.Variable(n)
cost = cp.sum_squares(A # x - b)
prob = cp.Problem(cp.Minimize(cost), [x>=0, cp.sum(x)==1])
prob.solve()
# Print result.
print("\nThe optimal value is", prob.value)
print("The optimal x is")
print(x.value)
print("The norm of the residual is ", cp.norm(A # x - b, p=2).value)
In this example, A (the matrix) is a matrix of all your vector, x (the variable) is the weights, and b is the known target.
EDIT:
example with your data:
forecasts = np.array([
[0.9, 0.91, 0.92, 0.91],
[1.1, 1.11, 1.13, 1.11],
[1.21, 1.23, 1.21, 1.23]
])
target = np.array([1.0, 1.02, 1.01, 1.04])
x = cp.Variable(forecasts.shape[0])
cost = cp.sum_squares(forecasts.T # x - target)
prob = cp.Problem(cp.Minimize(cost), [x >= 0, cp.sum(x) == 1])
prob.solve()
print("\nThe optimal value is", prob.value)
print("The optimal x is")
print(x.value)
Output:
The optimal value is 0.0005306233766233817
The optimal x is
[ 6.52207792e-01 -1.45736370e-24 3.47792208e-01]
results are approximately [0.65, 0, 0.34] which is different from the [0.5, 0.5, 0.0] you mentioned, but that depends on how you define your problem. This is a solution for the least squares error.
We can see this problem as a least squares, which is indeed equivalent to quadratic programming. If I understand correctly, the weight vector you are looking for is a convex combination, so in least squares form the problem is:
minimize || [w0 w1 w2] * forecasts - target ||^2
s.t. w0 >= 0, w1 >= 0, w2 >= 0
w0 + w1 + w2 == 1
There is a least-squares function you can use out of the box in the qpsolvers package:
import numpy as np
from qpsolvers import solve_ls
target = np.array(target)
forecasts = np.array(forecasts)
w = solve_ls(forecasts.T, target, G=-np.eye(3), h=np.zeros(3), A=np.array([1, 1., 1]), b=np.array([1.]))
You can check in the documentation that the matrices G, h, A and b correspond to the problem above. Using quadprog as the backend solver, I get the following solution on my machine:
In [6]: w
Out[6]: array([6.52207792e-01, 9.94041282e-15, 3.47792208e-01])
In [7]: np.dot(w, forecasts)
Out[7]: array([1.00781558, 1.02129351, 1.02085974, 1.02129351])
Which is the same solution as in Roim's answer. (CVXPY is indeed a great way to start!)

statsmodels PCA eigenvalues sum

When I apply statsmodels.multivariate.pca.PCA to some data, I am finding that the sum of the produced eigenvalues does not equal to the total variance of the data. I am using the following code
import numpy as np
import statsmodels.api as sm
corr_matrix = np.array([
[1, 0.8, 0.4],
[0.8, 1, 0.6],
[0.4, 0.6, 1]])
Z = np.random.multivariate_normal([0,0,0], corr, 1000)
pc = sm.PCA(Z, standardize=False, demean=False, normalize=False)
pc.eigenvals.sum()
and the result (in a given random sample) is 2994.51488403581 while I was expecting this to add up to 3.
What am I missing?
Add 1
It seems that when the PCA is performed on the data X (i.e. using the matrix X^TX), the relationship between sum of variances and eigenvalues no longer holds, and it is only when the PCA is performed on the covariance matrix (i.e. on X^TX/n) when the sum of eigenvalues is eual to the sum of variances, i.e. trace(X^TX/n) = sum(eigenvalues). I wish this was more clearly stated on all the post one finds on PCA.
The eigenvalues are not the variance of the data. eigenvalues are the variances of the data in specific direction, defined by eigenvectors. The Variance of the data is the sum of the distance of all points to the mean value of the data. PC's are the characteristic of data and shows how the data is expanded in the space in specific directions. You should not confuse the variance of the data with eigenvalue (which shows the variance in the direction of the eigenvector).
quick answer by reverse engineering (I don't remember the details)
pc = PCA(Z, standardize=False, demean=True, normalize=False)
​
pc.eigenvals.sum() / 1000
2.7550787264061087
Z.var(0).sum()
2.7550787264061087
In the computation of the variance, the data is demeaned. If we don't demean, then we only get a uncentered quadratic product.
pc = PCA(Z, standardize=False, demean=False, normalize=False)
​
pc.eigenvals.sum(), pc.eigenvals.sum() / Z.shape[0]
(2756.1915877060546, 2.7561915877060548)
(Z**2).mean(0).sum()
2.7561915877060548

Why does scipy.norm.pdf sometimes give PDF > 1? How to correct it?

Given mean and variance of a Gaussian (normal) random variable, I would like to compute its probability density function (PDF).
I referred this post: Calculate probability in normal distribution given mean, std in Python,
Also the scipy docs: scipy.stats.norm
But when I plot a PDF of a curve, the probability exceeds 1! Refer to this minimum working example:
import numpy as np
import scipy.stats as stats
x = np.linspace(0.3, 1.75, 1000)
plt.plot(x, stats.norm.pdf(x, 1.075, 0.2))
plt.show()
This is what I get:
How is it even possible to have 200% probability to get the mean, 1.075? Am I misinterpreting anything here? Is there any way to correct this?
It's not a bug. It's not an incorrect result either. Probability density function's value at some specific point does not give you probability; it is a measure of how dense the distribution is around that value. For continuous random variables, the probability at a given point is equal to zero. Instead of p(X = x), we calculate probabilities between 2 points p(x1 < X < x2) and it is equal to the area below that probability density function. Probability density function's value can very well be above 1. It can even approach to infinity.
it's a density function, not a mass function
if variance is less than 1/(2*pi), the gaussian will exceed 1.0
exceeding 1 is only a limitation for mass functions, not density functions
Probability density is the rate of change in cumulative probability. So where cumulative probability is increasing rapidly, density can easily exceed 1. But if we calculate the area under the density function, it will never exceed 1. Such areas are also called probability mass.
Using your example :
from statistics import mean, stdev
import numpy as np
x, dx = np.linspace(0.3, 1.75, 1000, retstep=True)
mean_1, sigma_1 = mean(x), stdev(x)
f = np.exp(-((x-mean_1)/sigma_1)**2/2) / sigma_1 / np.sqrt(2 * np.pi)
print(np.sum(f)*dx)
Outputs 0.916581457225367
Credits to Richard McElreath in his book "statistical rethinking"

gaussian sum filter for irregular spaced points

I have a set of points (x,y) as two vectors
x,y for example:
from pylab import *
x = sorted(random(30))
y = random(30)
plot(x,y, 'o-')
Now I would like to smooth this data with a Gaussian and evaluate it only at certain (regularly spaced) points on the x-axis. lets say for:
x_eval = linspace(0,1,11)
I got the tip that this method is called a "Gaussian sum filter", but so far I have not found any implementation in numpy/scipy for that, although it seems like a standard problem at first glance.
As the x values are not equally spaced I can't use the scipy.ndimage.gaussian_filter1d.
Usually this kind of smoothing is done going through furrier space and multiplying with the kernel, but I don't really know if this will be possible with irregular spaced data.
Thanks for any ideas
This will blow up for very large datasets, but the proper calculaiton you are asking for would be done as follows:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0) # for repeatability
x = np.random.rand(30)
x.sort()
y = np.random.rand(30)
x_eval = np.linspace(0, 1, 11)
sigma = 0.1
delta_x = x_eval[:, None] - x
weights = np.exp(-delta_x*delta_x / (2*sigma*sigma)) / (np.sqrt(2*np.pi) * sigma)
weights /= np.sum(weights, axis=1, keepdims=True)
y_eval = np.dot(weights, y)
plt.plot(x, y, 'bo-')
plt.plot(x_eval, y_eval, 'ro-')
plt.show()
I'll preface this answer by saying that this is more of a DSP question than a programming question...
...that being said there, there is a simple two step solution to your problem.
Step 1: Resample the data
So to illustrate this we can create a random data set with unequal sampling:
import numpy as np
x = np.cumsum(np.random.randint(0,100,100))
y = np.random.normal(0,1,size=100)
This gives something like:
We can resample this data using simple linear interpolation:
nx = np.arange(x.max()) # choose new x axis sampling
ny = np.interp(nx,x,y) # generate y values for each x
This converts our data to:
Step 2: Apply filter
At this stage you can use some of the tools available through scipy to apply a Gaussian filter to the data with a given sigma value:
import scipy.ndimage.filters as filters
fx = filters.gaussian_filter1d(ny,sigma=100)
Plotting this up against the original data we get:
The choice of the sigma value determines the width of the filter.
Based on #Jaime's answer I wrote a function that implements this with some additional documentation and the ability to discard estimates far from the datapoints.
I think confidence intervals could be obtained on this estimate by bootstrapping, but I haven't done this yet.
def gaussian_sum_smooth(xdata, ydata, xeval, sigma, null_thresh=0.6):
"""Apply gaussian sum filter to data.
xdata, ydata : array
Arrays of x- and y-coordinates of data.
Must be 1d and have the same length.
xeval : array
Array of x-coordinates at which to evaluate the smoothed result
sigma : float
Standard deviation of the Gaussian to apply to each data point
Larger values yield a smoother curve.
null_thresh : float
For evaluation points far from data points, the estimate will be
based on very little data. If the total weight is below this threshold,
return np.nan at this location. Zero means always return an estimate.
The default of 0.6 corresponds to approximately one sigma away
from the nearest datapoint.
"""
# Distance between every combination of xdata and xeval
# each row corresponds to a value in xeval
# each col corresponds to a value in xdata
delta_x = xeval[:, None] - xdata
# Calculate weight of every value in delta_x using Gaussian
# Maximum weight is 1.0 where delta_x is 0
weights = np.exp(-0.5 * ((delta_x / sigma) ** 2))
# Multiply each weight by every data point, and sum over data points
smoothed = np.dot(weights, ydata)
# Nullify the result when the total weight is below threshold
# This happens at evaluation points far from any data
# 1-sigma away from a data point has a weight of ~0.6
nan_mask = weights.sum(1) < null_thresh
smoothed[nan_mask] = np.nan
# Normalize by dividing by the total weight at each evaluation point
# Nullification above avoids divide by zero warning shere
smoothed = smoothed / weights.sum(1)
return smoothed

Categories

Resources