So say I'm trying to create a 100-sample dataset that follows a certain line, maybe 2x+2. And I want the values on my X-axis to range from 0-1000. To do this, I use the following.
X = np.random.random(100,1) * 1000
Y = (2*X) + 2
data = np.hstack(X,Y)
The hstack gives me the array with corresponding x and y values. That part works. But if I want to inject noise into it in order to scatter the datapoints further away from that 2x+2 line...that's what I can't figure out.
Say for example, I want that Y array to have a standard deviation of 20. How would I inject that noise into the y values?
Maybe I'm missing something, but have you tried adding numpy.random.normal(scale=20,size=100) to Y? You can even write
Y=numpy.random.normal(2*X+2,20)
and do it all at once (and without repeating the array size).
To simulate noise use a normally distributed random number generator like np.random.randn.
Is this what you are trying to do:
X = np.linspace(0, 1000, 100)
Y = (2*X) + 2 + 20*np.random.randn(100)
data = np.hstack((X.reshape(100,1),Y.reshape(100,1)))
Related
I am trying to vectorize my code and have reached a roadblock. I have :
nxd array of x values [[x1],[...],[xn]] (where each row [x1] has many points [x11, ..., x1d]
nxd array of y values [[y1],[y2],[y3]] (where each row [y1] has many points [y11, ..., y1d]
nx1 array of x' values [[x'1],[...],[x'n]] that I would like to interpolate a y value for based on the corresponding row of x and y
The only thing I can think to use is a list comprehension like [np.interp(x'[i,:], x[i,:], y[i,:]) for i in range(n)]. I'd like a faster vectorized option if one exists. Thanks for the help!
This is hardly an answer, but I guess it may still be useful for someone (if not, feel free to delete this); and by the way,
I think I misunderstood your question at first. What you have is a collection of n different one-dimensional datasets or functions y(x) that you want to interpolate (correct me otherwise).
As such, it turns out doing this by multidimensional interpolation is a terrible approach.
The idea I thought is to add a new dimension to the data so your datasets are mapped into one single dataset in which this new dimension is what distinguishes between the different xi, where i=1,2,...,n. In other words, you assign a value in this new dimension, let's say, z, to every row of x; this way, different functions are correctly mapped to this higher-dimensional space.
However, this approach is slower than the np.interp list comprehension solution, at least one order of magnitude in my computer. I guess it has to do with two-dimensional interpolation algorithms being at best of order O(nlog(n)) (this is a guess); in this sense, it would seem more efficient to perform multiple interpolations to different datasets rather than one big interpolation.
Anyways, the approach is shown in the following snippet:
import numpy as np
from scipy.interpolate import LinearNDInterpolator
def vectorized_interpolation(x, y, xq):
"""
Vectorized option using LinearNDInterpolator
"""
# Dummy new data points in added dimension
z = np.arange(x.shape[0])
# We must repeat every z value for every row of x
interpolant = LinearNDInterpolator(list(zip(x.ravel(), np.repeat(z, x.shape[1]))), y.ravel())
return interpolant(xq, z)
def non_vectorized_interpolation(x, y, xq):
"""
Your non-vectorized solution
"""
return np.array([np.interp(xq[i], x[i], y[i]) for i in range(x.shape[0])])
if __name__ == "__main__":
n, d = 100, 500
x = np.linspace(0, 2*np.pi, n*d).reshape((n, d))
y = np.sin(x)
xq = np.linspace(0, 2*np.pi, n)
yq1 = vectorized_interpolation(x, y, xq)
yq2 = non_vectorized_interpolation(x, y, xq)
The only advantage of the vectorized solution is that LinearNDInterpolator (and some of the other scipy.interpolate functions) explicitly calculates the interpolant, so you can reuse it if you plan on interpolating the same datasets several times and avoid repetitive calculations. Another thing you could try is using multiprocessing if you have several cores in your machine, but this is not vectorizing which is what you asked for. Sorry I can't be of more help.
so I'm trying to get the second derivative of the following formula using numpy.gradient, and I'm trying to differentiate it once by S[:,0] and then by S[:,1]
S = np.random.multivariate_normal(mean, covariance, N)
formula = (S[:,0]**2) * (S[:,1]**2)
But the thing is when I use spacings as the second argument of numpy.gradient
dx = np.diff(S[:,0])
dy = np.diff(S[:,1])
dfdx = np.gradient(formula,dx)
I get the error saying
ValueError: when 1d, distances must match the length of the corresponding dimension
And I get that's because the spacings vector length is one element less than the formula's, but I didn't know what to do to fix that.
I've read somewhere also that you can have coordinates of the point rather than the spacing as the second argument, but when I tried checking the result out of that by differentiating the formula by S[:,0] and then by S[:,1], and then trying to differentiate it this time by S[:,0] and then by S[:,1], and comparing the two results, which should be similar; there was a huge difference between those two results.
Can anybody explain to me what I'm doing wrong here?
When introducing the vector of coordinates of values of your function using Numpy's gradient, you have to be careful to either introduce it as a list with as many arrays as dimensions of your function, or to specify at which axis (as an argument of gradient) you want to calculate the gradient.
When you checked both ways of differentiation, I think the problem is that your formula isn't actually two-dimensional, but one-dimensional (even though you use data from two variables, note your f array has only one dimension).
Take a look at this little script in which we verify that, indeed, the order of differentiation doesn't alter the result (assuming your function is well-behaved).
import numpy as np
# Dummy arrays and function
x = np.linspace(0,1,50)
y = np.linspace(0,2,50)
f = np.sin(2*np.pi*x[:,None]) * np.cos(2*np.pi*y)
dfdx = np.gradient(f, x, axis = 0)
df2dy = np.gradient(dfdx, y, axis = 1)
dfdy = np.gradient(f, y, axis = 1)
df2dx = np.gradient(dfdy, x, axis = 0)
# Check how many values are essentially different
print(np.sum(~np.isclose(df2dx, df2dy)))
Does this apply to your problem?
Say I have a data set of 100 data. The interesting part about this data set is that each data is a 4x3 matrix. My question is how should I calculate the standard deviation of this data set? I tried the following code, but I don't know if the result is correct. If it is correct, I want to know how it works. I know the standard deviation equation for 1d data, but I don't know the definition of std for a collection of m x n data. There is only explanation for 1d data in the docstring of np.std.
import numpy as np
datalist = []
for _ in range(100):
data = np.random.random((4,3))
datalist.append(data)
std = np.std(np.asarray(datalist))
print(std)
Seems like you're having unnecessary steps. To begin with, you can get 100 matrices of 4x3 like this:
x = np.random.rand(100, 4, 3)
Then just call np.std on it:
np.std(x)
0.2827262559096299
That's if you want the standard deviation of all values. If you want it per matrix cell, specify the axis argument:
np.std(x, axis=0)
array([[0.27863211, 0.2670126 , 0.28752064],
[0.28540484, 0.25365294, 0.28905531],
[0.28848584, 0.27695767, 0.26886147],
[0.27138472, 0.3135065 , 0.29361115]])
axis=0 means that it's going to collapse the axis 0 (the one with size 100), which will return a matrix of 4x3.
is it possible to do the following without loop (so it improves the speed)? i have looked sklearn, sm, and pd, unfortunately don't think they have any direct solution.
i have
x = np.array(range(1000)) # ie a standard discrete time series
y = np.append(np.zeros(600),np.random.random(400)) #it has a lot of zeros
y = np.random.permutation(y) #the number of zeros in b/w the non zero is random
z = np.empty(1000) # z will contain predicted values from the reg analysis
rolling_window=20
i wish to obtain z, where z(i) = a(i)+b(i)x(i) for i within range(1000)
and a(i) and b(i) are obtained by regressing Ys vs Xs for the i b/w (i-rolling_window, i), but only uses Ys that are non zero (hence need to assign weight = 0 for Ys that are zero in the regression. preferably use a weighting method rather than getting rid of the zeros together, because i dont wish to loop)
many thanks in advance
I am trying to make a 2D 5850x5850 array from two 1D arrays by putting them into this equation for a 2D gausian.
psf = 1/(2*np.pi*sigma_x*sigma_y) * np.exp(-(x**2/(2*sigma_x**2) + y**2/(2*sigma_y**2)))
However it gives back a 1D array, waht am i doing wrong?
If I understand your question correctly:
All you need to do is to alter shape of your arrays.
E.g.
x.shape=(5850,1) # now it is column array
y.shape=(1,5850) # now it is row array
Then you can proceed as in your original post. The result will be 5850 by 5850 array. Each row will correspond to different x and each column will correspond to different y.
However I would change few things in your code to make it look like that:
psf = 1/(2*np.pi*sigma_x*sigma_y) * np.exp(-(x*x/(2*sigma_x*sigma_x) + y*y/(2*sigma_y*sigma_y)))
Squaring values is usually inefficient (unless your complier translates it to multiplication, but in Python there is no complier to rely on). Squaring is much slower than multiplication. When you take a value to the power your computer needs to be ready that it might be negative or that it is not an integer. There is no such overhead when you multiply values.
Try:
for i in xrange(0,1000000):
z=i**2
for i in xrange(0,1000000):
z=i*i
Formar ran 0.975s on my machine whereas later only 0.267s.
It doesn't understand that x and y are to mean that for every x, you must do this for each y. If you can't find a library to create 2d functions/guassians more conveniently, try:
z = np.empty((len(x), len(y))
for idx, yval in enumerate(y):
z[:,idx] = f(x, yval)
Where f(x, yval) if you 2d function but where you have y, use yval. There's got to be more support for 2d function creation somewhere, maybe try scipy 2d guassian functions in a search?
The proper expression to make a 2d Gaussian would be
x = np.arange(0, size, 1, float)
y = x[:,np.newaxis]
x0 = y0 = 0 # your center
np.exp(-4*np.log(2) * ((x-x0)**2 + (y-y0)**2) / radius**2)