I am using NumPy to do some calculation on finding the Y intercept, through an Aperture between a big box and a small box. I have over 100.000 particles in the big box, and around 1000 in the small one. And it's taking a lot of time to do so. All the self.YD, self.XD are very large arrays that i'm multiplying.
PS: The ind are indexes of the values that need to be multiplied. I had a nonzero condition before that line in my code.
Any ideas how I would do this calculation in a simpler way?
YD_zero = self.oldYD[ind] - ((self.oldYD[ind]-self.YD[ind]) * self.oldXD[ind])/(self.oldXD[ind]-self.XD[ind])
Thanks!
UPDATE
Would using multiply, divide, subtract and all that stuff of Numpy. make it faster?
Or if maybe if i split the calculation. for example.
to do this first:
YD_zero = self.oldYD[ind] - ((self.oldYD[ind]-self.YD[ind])*self.oldXD[ind])
and then the next line would be:
YD_zero /= (self.oldXD[ind]-self.XD[ind])
Any suggestions?!
UPDATE 2
I have been trying to figure this out, in a while now, but not much progress. My concern is that the denominator :
self.oldXL[ind]-self.XL[ind] == 0
and I am getting some weird results.
The other thing is the nonzero function. I have been testing it for a while now. Could anybody tell me that it is almost the same as find in Matlab
Perhaps I have got the wrong end of the stick but in Numpy you can perform vectorised calculations. Remove the enclosing while loop and just run this ...
YD_zero = self.oldYD - ((self.oldYD - self.YD) * self.oldXD) / (self.oldXD - self.XD)
It should be much faster.
Update: Iterative root finding using the Newton-Raphson method ...
unconverged_mask = np.abs(f(y_vals)) > CONVERGENCE_VALUE:
while np.any(unconverged_mask):
y_vals[unconverged_mask] = y_vals[unconverged_mask] - f(y_vals[unconverged_mask]) / f_prime(y_vals[unconverged_mask])
unconverged_mask = np.abs(f(y_vals)) > CONVERGENCE_VALUE:
This code is only illustrative but it shows how you can apply an iterative process using vectorised code to any function f which you can find the derivative of f_prime. The unconverged_mask means that the results of the current iteration will only be applied to those values that have not yet converged.
Note that in this case there is no need to iterate, Newton-Raphson will give you the correct answer in the first iteration since we are dealing with straight lines. What you have is an exact solution.
Second update
Ok, so you aren't using Newton-Raphson. To calculate YD_zero (the y intercept) in one go, you can use,
YD_zero = YD + (XD - X0) * dYdX
where dYdX is the gradient, which seems to be, in your case,
dYdX = (YD - oldYD) / (XD - oldXD)
I am assuming XD and YD are the current x,y values of the particle, oldXD and oldYD are the previous x,y values of the particle and X0 is the x value of the aperture.
Still not entirely clear why you have to iterate over all the particles, Numpy can do the calculation for all particles at once.
Since all the computations are done element-wise, it should be easy to re-write the expression in Cython. This will avoid all those very large temporary array that get created when you do oldYD-YD and such.
Another possibility is numexpr.
I would definitely go for numexpr. I'm not sure numexpr can handle indices, but I bet that the following (or something similar) would work:
import numexpr as ne
yold = self.oldYD[ind]
y = self.YD[ind]
xold = self.oldXD[ind]
x = self.XD[ind]
YD_zero = ne.evaluate("yold - ((yold - y) * xold)/(xold - x)")
Related
I have the following problem. I have a function f defined in python using numpy functions. The function is smooth and integrable on positive reals. I want to construct the double antiderivative of the function (assuming that both the value and the slope of the antiderivative at 0 are 0) so that I can evaluate it on any positive real smaller than 100.
Definition of antiderivative of f at x:
integrate f(s) with s from 0 to x
Definition of double antiderivative of f at x:
integrate (integrate f(t) with t from 0 to s) with s from 0 to x
The actual form of f is not important, so I will use a simple one for convenience. But please note that even though my example has a known closed form, my actual function does not.
import numpy as np
f = lambda x: np.exp(-x)*x
My solution is to construct the antiderivative as an array using naive numerical integration:
N = 10000
delta = 100/N
xs = np.linspace(0,100,N+1)
vs = f(xs)
avs = np.cumsum(vs)*delta
aavs = np.cumsum(avs)*delta
This of course works but it gives me arrays instead of functions. But this is not a big problem as I can interpolate aavs using a spline to get a function and get rid of the arrays.
from scipy.interpolate import UnivariateSpline
aaf = UnivariateSpline(xs, aavs)
The function aaf is approximately the double antiderivative of f.
The problem is that even though it works, there is quite a bit of overhead before I can get my function and precision is expensive.
My other idea was to interpolate f by a spline and take the antiderivative of that, however this introduces numerical errors that are too big for what I want to use the function.
Is there any better way to do that? By better I mean faster without sacrificing accuracy.
Edit: What I hope is possible is to use some kind of Fourier transform to avoid integrating twice. I hope that there is some convenient transform of vs that allows to multiply the values component-wise with xs and transform back to get the double antiderivative. I played with this a bit, but I got lost.
Edit: I figured out that by using the trapezoidal rule instead of a naive sum, increases the accuracy quite a bit. Using Simpson's rule should increase the accuracy further, but it's somewhat fiddly to do with numpy arrays.
Edit: As #user202729 rightfully complains, this seems off. The reason it seems off is because I have skipped some details. I explain here why what I say makes sense, but it does not affect my question.
My actual goal is not to find the double antiderivative of f, but to find a transformation of this. I have skipped that because I think it only confuses the matter.
The function f decays exponentially as x approaches 0 or infinity. I am minimizing the numerical error in the integration by starting the sum from 0 and going up to approximately the peak of f. This ensure that the relative error is approximately constant. Then I start from the opposite direction from some very big x and go back to the peak. Then I do the same for the antiderivative values.
Then I transform the aavs by another function which is sensitive to numerical errors. Then I find the region where the errors are big (the values oscillate violently) and drop these values. Finally I approximate what I believe are good values by a spline.
Now if I use spline to approximate f, it introduces an absolute error which is the dominant term in a rather large interval. This gets "integrated" twice and it ends up being a rather large relative error in aavs. Then once I transform aavs, I find that the 'good region' has shrunk considerably.
EDIT: The actual form of f is something I'm still looking into. However, it is going to be a generalisation of the lognormal distribution. Right now I am playing with the following family.
I start by defining a generalization of the normal distribution:
def pdf_n(params, center=0.0, slope=8):
scale, min, diff = params
if diff > 0:
r = min
l = min + diff
else:
r = min - diff
l = min
def retfun(m):
x = (m - center)/scale
E = special.expit(slope*x)*(r - l) + l
return np.exp( -np.power(1 + x*x, E)/2 )
return np.vectorize(retfun)
It may not be obvious what is happening here, but the result is quite simple. The function decays as exp(-x^(2l)) on the left and as exp(-x^(2r)) on the right. For min=1 and diff=0, this is the normal distribution. Note that this is not normalized. Then I define
g = pdf(params)
f = np.vectorize(lambda x:g(np.log(x))/x/area)
where area is the normalization constant.
Note that this is not the actual code I use. I stripped it down to the bare minimum.
You can compute the two np.cumsum (and the divisions) at once more efficiently using Numba. This is significantly faster since there is no need for several temporary arrays to be allocated, filled, read again and freed. Here is a naive implementation:
import numba as nb
#nb.njit('float64[::1](float64[::1], float64)') # Assume vs is contiguous
def doubleAntiderivative_naive(vs, delta):
res = np.empty(vs.size, dtype=np.float64)
sum1, sum2 = 0.0, 0.0
for i in range(vs.size):
sum1 += vs[i] * delta
sum2 += sum1 * delta
res[i] = sum2
return res
However, the sum is not very good in term of numerical stability. A Kahan summation is needed to improve the accuracy (or possibly the alternative Kahan–Babuška-Klein algorithm if you are paranoid about the accuracy and performance do not matter so much). Note that Numpy use a pair-wise algorithm which is quite good but far from being prefect in term of accuracy (this is a good compromise for both performance and accuracy).
Moreover, delta can be factorized during in the summation (ie. the result just need to be premultiplied by delta**2).
Here is an implementation using the more accurate Kahan summation:
#nb.njit('float64[::1](float64[::1], float64)')
def doubleAntiderivative_accurate(vs, delta):
res = np.empty(vs.size, dtype=np.float64)
delta2 = delta * delta
sum1, sum2 = 0.0, 0.0
c1, c2 = 0.0, 0.0
for i in range(vs.size):
# Kahan summation of the antiderivative of vs
y1 = vs[i] - c1
t1 = sum1 + y1
c1 = (t1 - sum1) - y1
sum1 = t1
# Kahan summation of the double antiderivative of vs
y2 = sum1 - c2
t2 = sum2 + y2
c2 = (t2 - sum2) - y2
sum2 = t2
res[i] = sum2 * delta2
return res
Here is the performance of the approaches on my machine (with an i5-9600KF processor):
Numpy cumsum: 51.3 us
Naive Numba: 11.6 us
Accutate Numba: 37.2 us
Here is the relative error of the approaches (based on the provided input function):
Numpy cumsum: 1e-13
Naive Numba: 5e-14
Accutate Numba: 2e-16
Perfect precision: 1e-16 (assuming 64-bit numbers are used)
If f can be easily computed using Numba (this is the case here), then vs[i] can be replaced by calls to f (inlined by Numba). This helps to reduce the memory consumption of the computation (N can be huge without saturating your RAM).
As for the interpolation, the splines often gives good numerical result but they are quite expensive to compute and AFAIK they require the whole array to be computed (each item of the array impact all the spline although some items may have a negligible impact alone). Regarding your needs, you could consider using Lagrange polynomials. You should be careful when using Lagrange polynomials on the edges. In your case, you can easily solve the numerical divergence issue on the edges by extending the array size with the border values (since you know the derivative on each edges of vs is 0). You can apply the interpolation on the fly with this method which can be good for both performance (typically if the computation is parallelized) and memory usage.
First, I created a version of the code I found more intuitive. Here I multiply cumulative sum values by bin widths. I believe there is a small error in the original version of the code related to the bin width issue.
import numpy as np
f = lambda x: np.exp(-x)*x
N = 1000
xs = np.linspace(0,100,N+1)
domainwidth = ( np.max(xs) - np.min(xs) )
binwidth = domainwidth / N
vs = f(xs)
avs = np.cumsum(vs)*binwidth
aavs = np.cumsum(avs)*binwidth
Next, for visualization here is some very simple plotting code:
import matplotlib
import matplotlib.pyplot as plt
plt.figure()
plt.scatter( xs, vs )
plt.figure()
plt.scatter( xs, avs )
plt.figure()
plt.scatter( xs, aavs )
plt.show()
The first integral matches the known result of the example expression and can be seen on wolfram
Below is a simple function that extracts an element from the second derivative. Note that int is a bad rounding function. I assume this is what you have implemented already.
def extract_double_antideriv_value(x):
return aavs[int(x/binwidth)]
singleresult = extract_double_antideriv_value(50.24)
print('singleresult', singleresult)
Whatever full computation steps are required, we need to know them before we can start optimizing. Do you have a million different functions to integrate? If you only need to query a single double anti-derivative many times, your original solution should be fairly ideal.
Symbolic Approximation:
Have you considered approximations to the original function f, which can have closed form integration solutions? You have a limited domain on which the function lives. Perhaps approximate f with a Taylor series (which can be constructed with known maximum error) then integrate exactly? (consider Pade, Taylor, Fourier, Cheby, Lagrange(as suggested by another answer), etc...)
Log Tricks:
Another alternative to dealing with spiky errors, would be to take the log of your original function. Is f always positive? Is the integration error caused because the neighborhood around the max is very small? If so, you can study ln(f) or even ln(ln(f)) instead. It would really help to understand what f looks like more.
Approximation Integration Tricks
There exist countless integration tricks in general, which can make approximate closed form solutions to undo-able integrals. A very common one when exponetnial functions are involved (I think yours is expoential?) is to use Laplace's Method. But which trick to pull out of the bag is highly dependent upon the conditions which f satisfies.
Is it possible to vectorize (or otherwise speedup) an element-wise optimization with NumPy (and SciPy)?
In the most abstract sense, I have a function, y, which is parabolically shaped and could be expressed basically as y=x^2+b*x+z, where x is an array of known values, and I want to find a z that makes the minimum value of y exactly zero (said another way, I want to find a value z that makes my parabola only have one zero). For this, I've chosen to implement a simple bisection-like method. The code for this is below:
import numpy as np
def find_single_root():
x = np.arange(-5, 6,0.1) # domain
z = 1 # initial guess
delta = 1 # initial step size
tol = 0.001 # tolerance
while True:
y = x**2-5*x+z
minimum = np.nanmin(y)
# update z
print(delta)
print(z)
if minimum > 0:
if delta > 0:
delta = -1*delta/2
z += delta
else:
if delta < 0:
delta = -1*delta/2
z += delta
# check if step is smaller than tolerance
if np.abs(delta) < tol:
return z
Now lets say x(v,w), and I want to create a 2D array of z values, where each is optimized. What I have right now is below (note, the new function definition and domain are as follows)
def find_single_root(v, w):
x = np.arange(-5*v/w, 6*w,0.1) # domain
... # rest of the function
vs = np.arange(1,5)
ws = np.arange(1,5)
zs = np.zeros((len(vs),len(ws)))
for i, v in enumerate(vs):
for j, w in enumerate(ws):
zs[i][j] = find_single_root(v,w)
Right now I just have these simple nested for loops, but is there a way I can approach this differently or speed it up with NumPy vectorizing?
Vectorization may be applicable when the computations to be performed are precisely known in advance. Like "take two arrays of numbers, and multiply them pairwise".
Vectorization is not applicable when the computations adapt to the given data. Any kind of optimization algorithm is adaptive, because where you look for the minimum depends on what the function returns. If you have a bunch of functions, and need to find the minimum of each, you are going to have to minimize them one at a time, in a loop. If this process is slow, it's because it takes long to minimize a bunch of function, not because there is a for loop in the program.
Concerning your program, I would try using some of SciPy methods for both minimization and root-finding. Have a function min_of_f(z) which finds the minimum for a given value of parameter z, possibly using minimize_scalar. Then feed min_of_f to a root-finding routine. How long these will take can be controlled by their tolerance parameters (xtol and others).
OP edit:
I wanted to give credit for this as a correct answer, but still provide more information.
I ended up using numpy.vectorize to vectorize without restructuring the problem. Although numpy.vectorize is not meant for increasing performance, the performance in my specific use case was a modest factor of two faster. Applying the same approach to the original problem in the question resulted in virtually no speed up with 100x100 vectors so YMMV.
Even though I wasn't able to vectorize this problem from a speed aspect for the reasons given in the above answer, being able to use plain vector syntax instead of nested for loops all over my code was useful.
Given this...
I have to explain what this code does, knowing that it performs the vectorized evaluation of F, using broadcasting and element wise operations concepts...
def F(x_pos, alpha):
D = x_pos.reshape(1,-1) - x_pos.reshape(-1,1)
return (1./alpha) * (alpha.reshape(1,-1) * R(D)).sum(axis=1)
My explanation is:
In the first line of the function F receives x_pos and alpha as parameters (both numpy arrays), in the second line the matrix D is calculated by means of broadcasting (basic operations such as addition in arrays numpy are performed elementwise, ie, element by element, but it is also possible with arranys of different size if numpy can transform them into others of the same size, this conversion is called broadcasting), subtracting an array of order 1xN with another of order Nx1, resulting in the matrix D of order NxN containing x_j - x_1, x_j - x_2, etc. as elements, finally, in the last line the reciprocal of alpha is calculated (which clearly is an arrangement), where each element is multiplied by the sum of the R evaluation of each cell of the matrix D multiplied by alpha_j horizontally (due to axis = 1 in the argument)
Questions:
Considering I'm new to Python, is my explanation OK?
The code has an error or not? Because I don't see that the "j must be different from 1, 2, ..., n" in each sum is taken into consideration in the code... and If it's in fact wrong... How can I fix the code so it do exactly the same thing as stated as in the image?
Few comments/improvements/fixes could be suggested here.
1] The first step could be alternatively done with just introducing a new axis and subtracting with itself, like so -
D = x_pos[:,None] - x_pos
In my opinion, this is a cleaner option. The performance benefit might be just marginal.
2] In the second line, I think it needs a fix as we need to avoid computations for the diagonal elements of R(D). So, If I got that correctly, the corrected code would be -
vals = R(D)
np.fill_diagonal(vals,0)
out = (1./alpha) * (alpha.reshape(1,-1) * vals).sum(axis=1)
Now, let's make the code a bit more idiomatic/cleaner.
At that line, we could write : (alpha * vals) instead of alpha.reshape(1,-1) * vals. This is because the shapes are already aligned for broadcasting as shown in a schematic diagram below -
alpha : n
vals : n x n
Thus, alpha would be automatically extended to 2D with its elements broadcasted along the first axis for the length of vals and then elementwise multiplications being generated with it. Again, this is meant as a cleaner code.
There's a further performance improvement possible here with (alpha.reshape(1,-1) * vals).sum(axis=1) being replaceable with a matrix-multiplicatiion using np.dot as alpha.dot(vals). The benefit on performance should be noticeable with this step.
So, the second step reduces to -
out = (1./alpha) * alpha.dot(vals)
I have a function for with i need to do an infinite summation on (over all the integers) numerically. The summation doesn't always need to converge as I can change internal parameters. The function looks like,
m(g, x, q0) = sum(abs(g(x - n*q0))^2 for n in Integers)
m(g, q0) = minimize(m(g, x, q0) for x in [0, q0])
using a Pythonic pseudo-code
Using Scipy integration methods, I was just flooring the n and integrating like for a fixed x,
m(g, z, q0) = integrate.quad(lambda n:
abs(g(x - int(n)*q0))**2,
-inf, +inf)[0]
This works pretty well, but then I have to do optimization on the x as a function of x, and then do another summation on that which yields a integral of a optimization of an integral. Pretty much it takes a really long time.
Do you know of a better way to do the summation that is faster? Hand coding it seemed to go slower.
Currently, I am working with
g(x) = (2/sqrt(3))*pi**(-0.25)*(1 - x**2)*exp(-x**2/2)
but the solution should be general
The paper this comes from is "The Wavelet Transform, Time-Frequency Localization and Signal Analysis" by Daubechies (IEEE 1990)
Thank you
Thanks to all the useful comment, I wrote my own summator that seems to run pretty fast. It anyone has any recommendations to make it better, I will gladly take them.
I will test this on the problem I am working on and once it demonstrates success, I will claim it functional.
def integers(blk_size=100):
x = arange(0, blk_size)
while True:
yield x
yield -x -1
x += blk_size
#
# For convergent summation
# on not necessarily finite sequences
# processes in blocks which can be any size
# shape that the function can handle
#
def converge_sum(f, x_strm, eps=1e-5, axis=0):
total = sum(f(x_strm.next()), axis=axis)
for x_blk in x_strm:
diff = sum(f(x_blk), axis=axis)
if abs(linalg.norm(diff)) <= eps:
# Converged
return total + diff
else:
total += diff
g(x) is almost certainly your bottleneck. A very quick-and-dirty solution would be to vectorize it to operate on an array of integers, then use np.trapz to estimate the integral using the trapezoid rule:
import numpy as np
# appropriate range and step size depends on how accurate you need to be and how
# quickly the sum converges
xmin = -1000000
xmax = 1000000
dx = 1
x = np.arange(xmin, xmax + dx, dx)
gx = (2 / np.sqrt(3)) * np.pi**(-0.25)*(1 - x**2) * np.exp(-x**2 / 2)
sum_gx = np.trapz(gx, x, dx)
Aside from that, you could re-write g(x) using Cython or numba to speed it up.
There's a chance Numba improves speed significantly - http://numba.pydata.org
It's slightly painful to install but very easy to use. Have a look at:
https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/
I am relatively new to python and am interested in any ideas to optimize and speed up this function. I have to call it tens~hundreds of thousands of times for a numerical computation I am doing and it takes a major fraction of the code's overall computational time.
I have written this in c, but I am interested to see any tricks to make it run faster in python specifically.
This code calculates a stereographic projection of a bigD-length vector to a littleD-length vector, per http://en.wikipedia.org/wiki/Stereographic_projection. The variable a is a numpy array of length ~ 96.
import numpy as np
def nsphere(a):
bigD = len(a)
littleD = 3
temp = a
# normalize before calculating projection
temp = temp/np.sqrt(np.dot(temp,temp))
# calculate projection
for i in xrange(bigD-littleD + 2,2,-1 ):
temp = temp[0:-1]/(1.0 - temp[-1])
return temp
#USAGE:
q = np.random.rand(96)
b = nsphere(q)
print b
This should be faster:
def nsphere(a, littleD=3):
a = a / np.sqrt(np.dot(a, a))
z = a[littleD:].sum()
return a[:littleD] / (1. - z)
Please do the math to double check that this is in fact the same as your iterative algorithm.
Obviously the main speedup here is going to come from the fact that this is a O(n) algorithm that replaces your O(n**2) algorithm for computing the projection. But specifically to speeding things up in python, you want to "vectorize your inner loop". Meaning try and avoid loops and anything else that is going to have high python overhead in the most performance critical parts of your code and instead try and use python and numpy builtins which are highly optimized. Hope that helps.