I have come to the conclusion that I simply do not understand python as well as I thought.
after a great many tries in python, I tried writing the same thing in Matlab and it just worked.
What my conclusion is, is that the way structures work is just a lot different from what I expect and I cannot see what that difference is.
For example in python, it could have a structure that looks like [[1], [2], [3]] and in Matlab, it would be [1,2,3]. running a loop over i in python would yield only [1] and the same in Matlab would be the sequence.
I remedied this by using np.hstack to get [1,2,3], so I fixed that issue, but I suspect that the rest of my issue right now is also structure-based. in the Matlab code, I get a coupling and the numbers converge. However, in my python code, all of them diverge.
Is there a great resource on how data structures work in python, that isn't the python doc, maybe something that compares Matlab and Python structures? or does anyone have an idea of how I should restructure my Python code?
EDIT: the code is an attempt at euler integration of coupled oscillators, where each oscillator couples to all of its neigbours
it takes in a frequency, w, as this is a solution for a coupling constant of K = 0
a loop runs from each oscillator i, over each neighbour j.
dTheta[i] is the frequency of the current oscillator in the loop.
k/N indicates the coupling strength based on the number of neighbours and theta[j,c] and theta[i,c] is respectively the previous angles for neighbour and current oscillator
a new angle is then assigned based on the integration step of the frequency
in Matlab, I wrote the following
%% Initialize items
k = 1; %coupling factor
N = 20 ; %Number of oscillators
tend = 10;
dt= tend/200;
t = 0:dt:tend;
theta=zeros(N,length(t));
theta(:,1)=abs(2*pi*rand(N,1));
dTheta=zeros(N,1);
w = .1.*ones(N,1); %Set the frequency of the oscillators
y = zeros(size(theta));
%% Calculations
for c=2:length(t)
dTheta=w;
for i=1:N
for j=1:N
dTheta(i)=dTheta(i)+((k/N)*sin(theta(j,c-1)-theta(i,c-1))); %Genereate delta theta.
end
end
theta(:,c)=theta(:,c-1)+(dTheta*dt); %Euler forward step
c/length(t);
end
for c=1:length(t)
y(:,c)=sin((5*t(c))+theta(:,c)); %Generate the y.
end
in python I have
import numpy as np
import matplotlib.pyplot as plt
k = 1
N = 10
tend = 20
dt = tend*4
t = np.linspace(0,tend,dt)
theta = np.zeros((N,len(t)))
theta[:,0] = 2*np.pi*abs(np.random.randint(0, high=N ,size=(N)))/10
dTheta = np.hstack(np.zeros((N,1)))
w = np.hstack(20*np.ones((N,1)))
y = np.zeros(theta.shape)
for c in range(1,len(t)):
dTheta=w
for i in range(N):
for j in range(N):
dTheta[i] = dTheta[i] + ((k/N)*np.sin(theta[j,c-1] - theta[i,c-1]))
theta[:,c] = theta[:,c-1] + (dTheta*dt)
c/len(t)
for c in range(len(t)):
y[:,c] = np.sin((5*t[c]) + theta[:,c])
plt.figure()
for c in range(N):
plt.plot(t,y[c,:])
plt.figure()
for c in range(N):
plt.plot(t,theta[c,:])
plt.show()
Related
So I'm trying to build a coupled map lattice on my computer.
A coupled map lattice (CML) is given by this eq'n:
where, the function f(Xn) is a logistic map :
with x value from 0-1, and r=4 for this CML.
Note: 'n' can be thought of as time, and 'i' as space
I have spent a lot of time understanding the iterations and i came up with a code as below, however i'm not sure if this is the correct code to iterate this equation.
Note: I have used 2d numpy arrays, where rows are 'n' and columns are 'i' as obvious from the code.
So basically, I want to develop a code to simulate this equation, and here is my take on that
Don't jump to the code directly, you won't understand what's happening without bothering to look at the equations first.
import numpy as np
import matplotlib.pyplot as plt
'''The 4 definitions created below are actually similar and only vary in their indexings. These 4
have been created only because of the if conditions I have put in the for loop '''
def logInit(r,x):
y[n,0]=r*x[n,0]*(1-x[n,0])
return y[n,0]
def logPresent(r,x):
y[n,i]=r*x[n,i]*(1-x[n,i])
return y[n,i]
def logLast(r,x):
y[n,L-1]=r*x[n,L-1]*(1-x[n,L-1])
return y[n,L-1]
def logNext(r,x):
y[n,i+1]=r*x[n,i+1]*(1-x[n,i+1])
return y[n,i+1]
def logPrev(r,x):
y[n,i-1]=r*x[n,i-1]*(1-x[n,i-1])
return y[n,i-1]
# 2d array with 4 row, 3 col. I created this because I want to store the evaluated values of log
function into this y[n,i] array
y=np.ones(12).reshape(4,3)
# creating an array of random numbers between 0-1 with 4 rows 3 columns
np.random.seed(0)
x=np.random.random((4,3))
L=3
r=4
eps=0.5
for n in range(3):
for i in range(L):
if i==0:
x[n+1,i]=(1-eps)*logPresent(r,x) + 0.5*eps*(logLast(r,x)+logNext(r,x))
elif i==L-1:
x[n+1,i]=(1-eps)*logPresent(r,x) + 0.5*eps*(logPrev(r,x) + logInit(r,x))
elif i > 0 and i < L - 1:
x[n+1,i]=(1-eps)*logPresent(r,x) + 0.5*eps*(logPrev(r,x) +logNext(r,x))
print(x)
This does give an output. Here it is:
[[0.5488135 0.71518937 0.60276338]
[0.94538775 0.82547604 0.64589411]
[0.43758721 0.891773 0.96366276]
[0.38344152 0.79172504 0.52889492]]
[[0.5488135 0.71518937 0.60276338]
[0.94538775 0.82547604 0.92306303]
[0.2449672 0.49731638 0.96366276]
[0.38344152 0.79172504 0.52889492]]
[[0.5488135 0.71518937 0.60276338]
[0.94538775 0.82547604 0.92306303]
[0.2449672 0.49731638 0.29789622]
[0.75613708 0.93368134 0.52889492]]
But I'm very sure this is not what I'm looking for.
If you can please figure out a correct way to iterate and loop the CML equation with code ? Suggest me the changes I have to make. Thank you very much!!
You'll have to think about the iterations and looping to be made to simulate this equation. It might be tedious, but that's the only way you can suggest me some changes in my code.
Your calculations seem fine to me. You could improve the speed by using vectorization along the space dimension and by reusing your intermediate results y. I restructured your program a little, but in essence it does the same thing as before. For me the results look plausible. The image shows the random initial vector in the first row and as the time goes on (top to bottom) the coupling comes in to play and little islands and patterns form.
import numpy as np
import matplotlib.pyplot as plt
L = 128 # grid size
N = 128 # time steps
r = 4
eps = 0.5
# Create random values for the initial time step
np.random.seed(0)
x = np.zeros((N+1, L))
x[0, :] = np.random.random(L)
# Create a helper matrix to save and reuse part of the calculations
y = np.zeros((N, L))
# Indices for previous, present, next position for every point on the grid
idx_present = np.arange(L) # 0, 1, ..., L-2, L-1
idx_next = (idx_present + 1) % L # 1, 2, ..., L-1, 0
idx_prev = (idx_present - 1) % L # L-1, 0, ..., L-3, L-2
def log_vector(rr, xx):
return rr * xx * (1 - xx)
# Loop over the time steps
for n in range(N):
# Compute y once for the whole time step and reuse it
# to build the next time step with coupling the neighbours
y[n, :] = log_vector(rr=r, xx=x[n, :])
x[n+1, :] = (1-eps)*y[n,idx_present] + 0.5*eps*(y[n,idx_prev]+y[n,idx_next])
# Plot the results
plt.imshow(x)
I am building an array of (1000,100,100,100) from two arrays of sizes (1000,100,100,100) and (100,100,100). For this, I am using a for-loop to run the first entry (0 - 1000). However, my code (below) is still pretty slow and as a beginner, I was wondering whether there is more efficient way to do it.
n_train = 1000
Nx = 100
Ny = 100
Nt = 100
x = np.linspace(-Nx, Nx, 100)
y = np.linspace(-Ny, Ny, 100)
t = np.linspace(0, Nt-1, 100)
def gw(xx, yy, tt):
num1 = tt - np.sqrt(xx**2+yy**2)
denom = (tt**2-xx**2-yy**2)
if denom < 0:
denom1 = 0
else:
denom1 = np.sqrt(denom)
kk = np.heaviside(num1,1)/(2*np.pi*denom1+1)
return (kk)
# Slow FOR-LOOP
for i_train in range (n_train):
ugreen = np.array([gw(i, j, k) for k in t for j in y for i in x])
Ugreen = ugreen.reshape(Nt, Ny, Nx)
prob = randrange(2)
Utot = UN[i_train,:,:,:] + Ugreen/1.75*prob
Utot = (Utot - np.min(Utot))/(np.max(Utot)-np.min(Utot))
Utot_green = 10*Ugreen/1.75*prob
P[i_train,:,:,:] = Utot
Pg[i_train,:,:,:]= Utot_green
I think several issues make this difficult. First of all, your code snippet does not compile, which IMHO is problematic in any programming language and corresponding forums.
And second, you do not explain what you are actually trying to calculate with your code. So I had to translate your code back to mathematics.
This brings me to the numpy-specific problem of this question.
Long story short: if you would have given some kind of mathematical equation in terms of vectors and matrices it would have been much easier to answer.
randrange
UN
P
PG
are not defined in your code.
In each loop you are evaluating Ugreen again, but I don't see that it changes in your loop. Move it outside the loop and save time.
gw()
can be written vectorized (and it is easily seen now what you are calculating - this should look very similar to your handwritten equation in your notes now):
import numpy as np
from scipy import sqrt
def gw(xx,yy,tt):
return (0<=tt-sqrt(xx**2+yy**2))/(2*np.pi*sqrt(np.clip(tt**2-xx**2-yy**2,0,None))+1)
Ugreen = gw(x[:,None,None],y[None,:,None],t[None,None,:])
prob = randrange(2)#I assume this is some scalar
Utot = UN + Ugreen[None,:,:,:]/1.75*prob
Utot = (Utot - np.min(Utot,axis=0)[None,...])/(np.max(Utot,axis=0)[None,...]-np.min(Utot,axis=0)[None,...]
Here is a working example of what I think you are trying to calculate
import numpy as np
from scipy import sqrt
def gw(x,y,t):
return (0<=t-sqrt(x**2+y**2))/(2*np.pi*sqrt(np.clip(t**2-x**2-y**2,0,None))+1)
n_train = 1000
Nx = 100
Ny = 100
Nt = 100
x = np.linspace(-Nx, Nx, 100)
y = np.linspace(-Ny, Ny, 100)
t = np.arange(Nt)
Ugreen = gw(x[:,None,None],y[None,:,None],t[None,None,:])
prob = 2
Utot = np.random.random((n_train,Nt,Ny,Nx))
Utot += Ugreen[None,:,:,:]/1.75*prob
Utot /= (np.max(Utot,axis=0)[None,...]-np.min(Utot,axis=0)[None,...])
Utot -= np.min(Utot,axis=0)[None,...]/(np.max(Utot,axis=0)[None,...]-np.min(Utot,axis=0)[None,...])
I had to arange the Utot-evaluation in order not to run into memory-issues.
However, on my machine it runs within a few seconds.
It is certainly possible to optimize this further and I hope for some responses from other users to learn new things.
Some general hints on numpy:
IMHO if you want to see the true practical power of numpy it's better to forget everything you know about numerical programming from other languages than sticking with any thoughts/ideas/loop-wise thinking you might have.
My experience is that, yes numpy is very fast and this is very nice and so on, but it also makes your code extremely short, compact, and best for scientific work: extremely close to whatever complicated equation you might want to solve. One should start implementing equations as close to the paper as possible and optimize only where and when necessary.
In the following code I am trying to implement the following
write a function naturalSpline that implements cubic spline interpolation with natural boundary conditions
Use a tridiagonal solver to solve the arising tridiagonal system for the first derivatives.
The prototype of the function should read yy=naturalSpline(x,y,xx) where (x,y) are the input points and data, and xx are the points where the data should be interpolated.
I figured first I would start with the second bullet point, creating the tridiagonal solver. So this is just the Thomas algorithm. I spent some time to create this part of the code and I have formatted it below. But now I am trying to finish the first and third bullet points but I am not sure how to use what I have done already to finish those. Looking for some help with this! Thanks in advance.
import numpy as np
def TDMA(a,b,c,d):
n = len(d)
w= np.zeros(n-1,float)
g= np.zeros(n, float)
p = np.zeros(n,float)
w[0] = c[0]/b[0]
g[0] = d[0]/b[0]
for i in range(1,n-1):
w[i] = c[i]/(b[i] - a[i-1]*w[i-1])
for i in range(1,n):
g[i] = (d[i] - a[i-1]*g[i-1])/(b[i] - a[i-1]*w[i-1])
p[n-1] = g[n-1]
for i in range(n-1,0,-1):
p[i-1] = g[i-1] - w[i-1]*p[i]
return p
A = np.array([[10,2,0,0],[3,10,4,0],[0,1,7,5], [0,0,3,4]],dtype=float)
a = np.array([3.,1,3])
b = np.array([10.,10.,7.,4.])
c = np.array([2.,4.,5.])
d = np.array([3,4,5,6.])
print (TDMA(a, b, c, d))
Which gives the correct output, I even tested it against np.linalg.solve(a,b,c,d) to make sure it was correct
[ 0.14877589 0.75612053 -1.00188324 2.25141243]
For each interval [x_k, x_(k+1)], you can solve the four equations
p_k(x_k) = f(x_k) = y_k
p_k'(x_k) = f'(x_k) = d_k
p_k(x_(k+1)) = f(x_(k+1)) = y_(k+1)
p_k'(x_(k+1)) = f'(x_(k+1)) = d_(k+1)
(without checking your code, I assume that this is what you did).
From this, you can construct a dict
{'polynomials': [ [a_0, ..., d_0], ..., [a_24, ..., d_24] ],
'knots': [x_0, ..., x_24]}
For each x of your 250 point, you check for which k the point x is in the interval [x_k, x_(k+1)] and evaluate p_k(x).
All of this is straight forward mathematics and python coding. If something is not clear, you are better of learning more about both fields, instead of getting specialized advise on this website.
I am trying to implement Quantitative Mobility Spectrum Analysis (QMSA) in Python 2.7 + NumPy + MatPlotLib. QMSA is a tool for analysis of magnetic field-dependent Hall-effect measurements. In a semiconductor, more than one type of carrier (electron or hole) with different concentration and mobility can be found. QMSA can seperate the number of different types of carriers, their density, mobility, and sign. There are some versions of it like improved version i-QMSA. However, even the building the standart QMSA from zero is a hard work to do.
The job is highly know. There are alot of scientific articles about the subject. However, because of copyrights of each article is held by a publisher I can not share with you. Mostly, you can reach them with an university account. There are some thesis about it like:
1.) digital.library.txstate.edu/bitstream/handle/10877/4390/CUNNINGHAM-THESIS.pdf?sequence=1
2.) wrap.warwick.ac.uk/56132/1/WRAP_thesis_Kiatgamolchai_2000.pdf
3.) etd.lib.nsysu.edu.tw/ETD-db/ETD-search/getfile?URN=etd-0629104-152644&filename=etd-0629104-152644.pdf (I think it is in Chinese)
4.) fbetezbankasi.gazi.edu.tr/pdf-indir/22233741 (I think it is in Turkish. Equations which is given in QMSA manual is listed in thesis at pages 73-75)
Code will I am trying to do Successive-over-relaxation (SOR) iterative approach as originally done. Firstly, I prepare I simple code to produce artificial experimental data of magnetic field dependent conductivity tensor sigmaxx(B) and sigmaxy(B). With this experimental input and a predefine mobility values, code is running...
for i in range (0,n,1):
bxx[i] = data[i][1]
bxy[i] = data[i][2]
for j in range (0,m,1):
if data[i][0] == 0:
data[i][0] = 0.001
Axx[j,i]=1/(1+(mobin[j]**2)*(data[i][0]**2))
Axy[j,i]=(mobin[j]*data[i][0])/(1+(mobin[j]**2)*(data[i][0]**2))
Here, bxx, bxy, mobin and data[i][0] is experimental sigmaxx, experimental sigmaxy, predefined mobility values taken from a text file and experimental magnetic field points, respectively. Therefore we are trying to solve two equations with SOR in the form of Ax=b. For XX part of the problem A, x and b are renamed as Axx, solxx and bxx. For XY part of the problem A, x and b are renamed as Axy, solxy and bxy.
For the SOR, you need a parameter called omega. I found optimum omega values with GAuss-Seidel (here I am showing this for XX part of the conductivity. Same procedures are also done for XY):
print "Iterations for finding omega..."
for it_count in range(1,501):
for i in range(0,n,1):
s1xx = np.dot(Axx[i, :i], solxx_new[:i])
s2xx = np.dot(Axx[i, i + 1:], solxx[i + 1:])
solxx_new[i] = (bxx[i] - s1xx - s2xx) / Axx[i, i]
dx = sqrt(np.dot(solxx_new-solxx,solxx_new-solxx))
for i in range(0,n,1):
solxx[i] = solxx_new[i]
if it_count == k: dx1 = dx
if it_count == k + 1:
dx2 = dx
omegaxx = 2.0/(1.0 + sqrt(abs(1.0 - (dx2/dx1))))
break
print "I think best omega value for these XX calculations is ", omegaxx
This "finding optimum omega" procedure is taken from Kiusalaas, Jaan. Numerical methods in engineering with Python 3. Cambridge university press, 2013, page 83.
After finding omega, this time same iterations are done with SOR:
for it_count in range(ITERATION_LIMIT):
for i in range(0,n,1):
s1xx = np.dot(Axx[i, :i], solxx_new[:i])
s2xx = np.dot(Axx[i, i + 1:], solxx[i + 1:])
solxx_new[i] = (1-omegaxx)*solxx[i-1]+omegaxx*(bxx[i] - s1xx - s2xx) / Axx[i, i]
if np.allclose(solxx, solxx_new, rtol=1e-9):
break
print "Iteration:",it_count
for i in range(0,n,1):
solxx[i] = solxx_new[i]
Then, I calculated conductivity spectrum values for each mobility with:
for i in range (0,n,1):
if i == 0:
deltamob = 100
else:
deltamob = mobin[i] - mobin[i-1]
sn[i] = abs((solxx[i] - solxy[i]))/(2*deltamob*1.6e-19)
sp[i] = abs((solxx[i] + solxy[i]))/(2*deltamob*1.6e-19)
x[i] = mobin[i]
B[i] = data[i][0]
Then x vs sn and x vs sp must be your mobility spectrums. Only thing that I can get is a single Gaussian like peak. And even without any hole carrier, electron and hole spectrums are identical. Problem is solxx and solxy is getting larger values after each iteration. Problem may caused by the SOR code is written for Python 3. But I am using Python 2.7.
I can send files if necessary.
Thanks for any responses.
I am trying to figure out if Python/Numpy is a viable alternative to develop my numerical software which is already available in C++. In order to get performance in Python/Numpy, one need to "vectorize" the code. But it turns out that as soon as I move away from very simple examples, I struggle to vectorize the code (I am not talking about SIMD instructions but "efficient Numpy code" without loops). Here is an algorithm that I want to get efficiently in Python/Numpy.
Create an numpy array containing: 1.0, 1.0 + 1/n, 1.0 + 2/n, ..., 2.0
For every u in the array, compute the root of x^2 - u, using a Newton method, stopping when |dx| <= 1.0e-7. Store the result in an array result.
Sum all the elements of the result array
Here is the algorithm in Python I want to speed up
import numpy as np
n = 1000000
data = np.arange(1.0, 2.0, 1.0 / n)
def newton(u):
x = 2.0
while True:
f = x**2 - u
df_dx = 2 * x
dx = f / df_dx
if (abs(dx) <= 1.0e-7):
break
x -= dx
return x
result = map(newton, data)
print result[n - 1]
Here is a version of the algorithm in C++11
#include <iostream>
#include <vector>
#include <cmath>
int main (int argc, char const *argv[]) {
auto n = std::size_t{100000000};
auto v = std::vector<double>(n + 1);
for(size_t k = 0; k < v.size(); ++k) {
v[k] = 1.0 + static_cast<double>(k) / n;
}
auto result = std::vector<double>(n + 1);
for(size_t k = 0; k < v.size(); ++k) {
auto x = double{2.0};
while(true) {
auto f = double{x * x - v[k]};
auto df_dx = double{2 * x};
auto dx = double{f / df_dx};
if (std::abs(dx) <= 1.0e-7) {
break;
}
x -= dx;
}
result[k] = x;
}
auto somme = double{0.0};
for(size_t k = 0; k < result.size(); ++k) {
somme += result[k];
}
std::cout << somme << std::endl;
return 0;
}
It takes 2.9 seconds to run on my machine. Is there a way to make a fast Python/Numpy algorithm that does the same thing (I am willing to get something that is less than 5 times slower).
Thanks.
You can do step 1. with numpy efficiently:
1.0 + np.arange(n + 1) / n
however I think you would need the np.vectorize() method to feed back x into your calculated values and it's not an efficient function (basically a wrapper for a python loop). If you can use scipy then there are built in methods that might do what you want http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.newton.html
EDIT: Having thought a bit more about this I followed up on #ev-br's point and tried some alternatives. The masking uses too much processing but the abs().max() is pretty fast so a compromise might be to "divide the problem into blocks" both in the 1st dimension of the array and in iteration direction. The following doesn't do too badly (< 20s) on my pretty low power laptop - certainly much faster than np.vectorize() or any of the scipy solving systems I could find. (If I set m too big it runs out of something (memory?) and grinds to a complete halt!)
n = 100000000
m = 5000000
block = 3
u = 1.0 + np.arange(n + 1) / n
x = np.full(u.shape, 2.0)
dx = np.ones(u.shape)
for i in range(0, n, m):
while np.abs(dx[i:i+m]).max() > 1.0e-7:
for j in range(block):
dx[i:i+m] = (x[i:i+m] ** 2 - u[i:i+m]) / (2 * x[i:i+m])
x[i:i+m] -= dx[i:i+m]
Here's a toy example. Notice that often vectorization means writing your code as if you're manipulating numbers, and letting numpy do its magic:
>>> import numpy as np
>>> a = np.array([1., 2., 3.])
>>> def f(x):
... return x**2 - a, 2.*x # function and derivative
>>>
>>> def newt(f, x0):
... x = np.asarray(x0)
... for _ in range(5): # hardcode the number of iterations (I know)
... v, dv = f(x)
... x -= v / dv
... return x
>>>
>>> newt(f, [1., 1., 1.])
array([ 1. , 1.41421356, 1.73205081])
If this is a performance bottleneck, this is unlikely to be competetive with hand-written C++ code: First of all, you're manipulating python objects with all the overhead; then numpy is likely doing a bunch of array allocations under the hood.
An often viable strategy is to start by writing things in python/numpy, and then move bottlenecks into a compiled code --- eg Cython or C++ wrapped by Cython. In this particular case since you already have the C++ code, just wrapping it with Cython is likely easiest but YMMV.
I'm not looking to wave small snippets of code as a solution, but here's something to get you started. I have a strong suspicion that you're having troubles just declaring such an array in python without spending too much time on it, so I'll mostly help you out there.
As far as the square roots come in, please add your example python code and I'll see what I can help optimize from that point on. In my example roots and sums are found with the default numpy functions/methods.
def summing():
n = 1000000
ar = np.arange(0, n)
ar = ar/float(n)
ar = ar + np.ones(n)
sqrt = np.sqrt(ar)
return np.sum(ar)
In short, to get the starting array it's best to use a "workaround".
initialize an array ar with values `[1,2,3,....n]
divide ar with n. This gets us the 1/n, 2/n ... members
add to that an array of same dimensions that contain just the number 1.0
This gets us the full array [ 1., 1.000001, 1.000002, ..., 1.999998, 1.999999]) we're after. If I understood you right.
find square roots, sum it
Average of 10 sequential execution times is 0.018786 seconds.
Obviously I'm 6 years late to this party, but this question is a common stumbling block for people in effectively using numpy for real scientific work. The basic idea is covered in #ev-br's answer. The OP points out that the solution offered there (even modified to stop iterating when a convergence criterion is met rather than after a fixed number of iterations) takes the same number of passes for each element of u. I want to show how you can avoid that objection using pure numpy code, making explicit the mask suggestion in #ev-br's comment.
However, I also want to point out that in many real world situations, the number of passes for Newton-like iteration to converge varies so little that this general technique I illustrate here will actually slow numpy code down significantly. If the average number of iterations will be within a factor of two or three of the maximum number of iterations, you should stick with something closer to #ev-br's answer (including his first comment).
The numpy performance numbers you need to understand are these: Loops over array indices will run 200 to 500 times slower in pure numpy code than in compiled code. On the other hand, if you manage to use numpy's array syntax to avoid all index loops, you can get within about a factor of 5 of compiled speed. (The factor of 5 is partly because of memory management as #ev-br mentions, but also because optimized compiled code overlaps many different arithmetical operations inside each index loop, while numpy just performs a single arithmetic operation, storing everything back to memory after each operation.) The point is that factor of 100 difference means that it often pays to do substantial amounts of "extra" work in numpy code: Even if you do 10 times the number of floating point operations in vectorized numpy code, it will still run 10 times faster than the index-loop code that avoids the "extra" work. (Incidentally, the python map function is implemented as an interpreted index loop - it has nothing to do with numpy array operations.)
from numpy import asfarray, broadcast_arrays, arange
# Begin by defining the function to be inverted by Newton's method.
def f_dfdx(x):
x = asfarray(x) # always avoid repeated type conversions
return x**2, 2.*x
# First, the simplest algorithm to find x such that f(x)=y.
# We must supply a starting guess x0 for x.
def f_inverse0(f_dfdx, y, x0, tol=1.e-7):
y, x = broadcast_arrays(asfarray(y), asfarray(x0))
x = x.copy() # without this may clobber input x0
for npass in range(20):
f, dfdx = f_dfdx(x)
dx = (f - y) / dfdx
if (abs(dx) <= tol).all():
break # iterate all x until all have converged
x -= dx
else:
raise RuntimeError("failed to converge")
return x
# A frequently slower algorithm that avoids extra iterations.
def f_inverse1(f_dfdx, y, x0, tol=1.e-7):
y, x = broadcast_arrays(asfarray(y), asfarray(x0))
shape = x.shape
y, x = y.ravel(), x.flatten() # avoid clobbering x0
unconverged = arange(y.size)
for npass in range(20):
f, dfdx = f_dfdx(x[unconverged])
dx = (f - y[unconverged]) / dfdx
unc = abs(dx) > tol
unconverged = unconverged[unc]
if not unconverged.size:
break # iterate all x until all have converged
x[unconverged] -= dx[unc]
else:
raise RuntimeError("failed to converge")
return x.reshape(shape)
On my machine, the OP's C++ program runs in 2.03 s (1.64+0.38 user+sys). For n=100 million as for the C++ program, f_inverse0 runs in 20.4 s (4.7+15.6 user+sys). As expected, f_inverse1 is slower, 51.3 s (11.5+39.8 user+sys). Again, don't automatically try to minimize total operation count when you are writing numpy code. The high system overhead is probably due to heavy memory management - every vector temporary is 0.8 GB and the memory manager is struggling.
Cutting the array size to n = 1 million elements (8 MB), then multiplying the runtime by 100 brings the system time down by a large factor, f_inverse0 now takes 16.1 s (12.5+3.6), while f_inverse1 takes 22.3 s (16.2+5.1). This factor of 8 to 10 slower than compiled code is not unreasonable to expect for numpy performance.