Code running since infinite time - python

Below is my code. It has been running since infinite time (almost a day). I am unable to figure out if it's because there are many loops or because there is come unending loop. Following is my code :
mat1 = np.zeros((1024,1024,360),dtype=np.int32)
k = 498
gamma = 0.00774267
R = 0.37
g = np.zeros(1024)
g[0:512] = np.linspace(0,1,512)
g[513:] = np.linspace(1,0,511)
pf = np.zeros((1024,1024,360))
pf1 = np.zeros((1024,1024,360))
for b in range(0,1023) :
for beta in range(0,359) :
for a in range(0,1023) :
pf[a,b,beta] = (R/(((R**2)+(a**2)+(b**2))**0.5))*mat[a,b,beta]
pf1[:,b,beta] = np.convolve(pf[:,b,beta],g,'same')
for x in range(0,1023) :
for y in range(0,1023) :
for z in range(0,359) :
for beta in range(0,359) :
a = R*((-x*0.005)*(sin(beta)) + (y*0.005)*(cos(beta)))/(R+(x*0.005)*(cos(beta))+(y*0.005)*(sin(beta)))
b = z*R/(R+(x*0.005)*(cos(beta))+(y*0.005)*(sin(beta)))
U = R+(x*0.005)*(cos(beta))+(y*0.005)*(sin(beta))
l = math.trunc(a)
m = math.trunc(b)
if (0<=l<1024 and 0<=m<1024) :
mat1[x,y,z] = mat[x,y,z] + (R**2/U**2)**pf1[l,m,beta]
import matplotlib.pyplot as plt
from skimage.transform import iradon
import matplotlib.cm as cm
from PIL import Image
I8 = (((mat1 - mat1.min()) / (mat1.max() - mat1.min())) * 255.9).astype(np.uint8)
img = Image.fromarray(I8)
img.save("M4.png")
im = Image.open("M4.png")
im.show()

Your code will run in finite time.
However, if you sprinkle in a few print statements to see where you are in the various loops, you can see why it will take so long. For instance, after the for y in range(0, 1023): line, add a print(y) line, you'll see it takes about 1 second between each printout, so that part of your code will take about 1023 x 1023 seconds, which is 12 days. You may want to look into modules like multiprocessing to parallelize some of the calculations, but even on a 32 core machine your code will still take around half a day to run.
There are several small optimizations you can do, I'm not sure entirely how much they will help. For one, you can calculate sin(beta) and cos(beta) once each in the inner loop, rather than 4 times each. You can calculate R**2 once globally, rather than every time inside the inner loop. You can calculate x*0.005 and y*0.005 less often, as well as a and l. You can split up the conditional involving l and m, and move the l conditional up above the z loop, thereby potentially avoiding that z loop sometimes.
Also, it seems weird that you're having beta range from 0 to 359, and then calculating its sin and cos values. Those functions expect arguments in radians, e.g. the sine of a right angle is not sin(90) but rather sin(math.pi/2).

Related

How to make a graph between order of the matrix and the time taken to multiply the two matrices?

import numpy as np
from time import time
import matplotlib.pyplot as plt
np.random.seed(27)
mysetup = "from math import sqrt"
begin=time()
i=int(input("Number of rows in first matrix"))
k=int(input("Number of column in first and rows in second matrix"))
j=int(input("Number of columns in second matrix"))
A = np.random.randint(1,10,size = (i,k))
B = np.random.randint(1,10,size = (k,j))
def multiply_matrix(A,B):
global C
if A.shape[1]==B.shape[0]:
C=np.zeros((A.shape[0],B.shape[1]),dtype=int)
for row in range(i):
for col in range(j):
for elt in range(0,len(B)):
C[row,col] += A[row,elt]*B[elt,col]
return C
else:
return "Cannot multiply A and B"
print(f"Matrix A:\n {A}\n")
print(f"Matrix B:\n {B}\n")
D=print(multiply_matrix(A, B))
end=time()
t=print(end-begin)
x=[0,100,10]
y=[100,100,1000]
plt.plot(x,y)
plt.xlabel('Time taken for the program to run')
plt.ylabel('Order of the matrix multiplication')
plt.show()
In the program, I have generated random elements for the matrices to be multiplied.Basically I am trying to compute the time it takes to multiply two matrices.The i,j and k will be considered as the order used for the matrix.As we cannot multiply matrices where number of columns of the first is not equal to the number of the rows in the second, I have already given them the variable 'k'.
Initially I considered to increment the order of the matrix using for loop but wasn't able to do so. I want the graph to display the time it took to multiply the matrices on the x axis and the order of the resultant matrix on the y axis.
There is a problem in the logic I applied but I am not able to find out how to do this problem as I am a beginner in programming
I was expecting to get the result as Y axis having a scale ranging from 0 to 100 with a difference of 10 and x axis with a scale of 100 to 1000 with a difference of 100.
The thousandth entity on the x axis will correspond to the time it took to compute the multiplication of two matrices with numbers of rows and columns as 1000.
Suppose the time it took to compute this was 200seconds. So the graph should be showing the point(1000,200).
Some problematic points I'd like to address -
You're starting the timer before the user chooses an input - which can differ, we want to be as precise as possible, thus we need to only calculate how much time it takes for the multiply_matrix function to run.
Because you're taking an input - it means that each run you will get one result, and one result is only a single point - not a full graph, so we need to get rid of the user input and generate our own.
Moreover to point #2 - we are not interested in giving "one shot" for each matrix order - that means that when we want to test how much time it takes to multiply two matrices of order 300 (for example) - we need to do it N times and take the average in order to be more precise, not to mention we are generating random numbers, and it is possible that some random generated matrices will be easier to compute than other... although taking the average over N tests is not 100% accurate - it does help.
You don't need to set C as a global variable as it can be a local variable of the function multiply_matrix that we anyways return. Also this is not the usage of globals as even with the global C - it will be undefined in the module level.
This is not a must, but it can improve a little bit your program - use time.perf_counter() as it uses the clock with the highest (available) resolution to measure a short duration, and it avoids precision loss by the float type.
You need to change the axes because we want to see how the time is affected by the order of the matrices, not the opposite! (so our X axis is now the order and the Y is the average time it took to multiply them)
Those fixes translate to this code:
Calculating how much it takes for multiply_matrix only.
begin = time.perf_counter()
C = multiply_matrix(A, B)
end = time.perf_counter()
2+3. Generating our own data, looping from order 1 to order maximum_order, taking 50 tests for each order:
maximum_order = 50
tests_number_for_each_order = 50
def generate_matrices_to_graph():
matrix_orders = [] # our X
multiply_average_time = [] # our Y
for order in range(1, maximum_order):
print(order)
times_for_each_order = []
for _ in range(tests_amount_for_each_order):
# generating random square matrices of size order.
A = np.random.randint(1, 10, size=(order, order))
B = np.random.randint(1, 10, size=(order, order))
# getting the time it took to compute
begin = time.perf_counter()
multiply_matrix(A, B)
end = time.perf_counter()
# adding it to the times list
times_for_each_order.append(end - begin)
# adding the data about the order and the average time it took to compute
matrix_orders.append(order)
multiply_average_time.append(sum(times_for_each_order) / tests_amount_for_each_order) # average
return matrix_orders, multiply_average_time
Minor changes to multiply_matrix as we don't need i, j, k from the user:
def multiply_matrix(A, B):
matrix_order = A.shape[1]
C = np.zeros((matrix_order, matrix_order), dtype=int)
for row in range(matrix_order):
for col in range(matrix_order):
for elt in range(0, len(B)):
C[row, col] += A[row, elt] * B[elt, col]
return C
and finally call generate_matrices_to_graph
# calling the generate_data_and_compute function
plt.plot(*generate_matrices_to_graph())
plt.xlabel('Matrix order')
plt.ylabel('Time [in seconds]')
plt.show()
Some outputs:
We can see that when our tests_number_for_each_order is small, the graph loses precision and crisp.
Going from order 1-40 with 1 test for each order:
Going from order 1-40 with 30 tests for each order:
Going from order 1-40 with 80 tests for each order:
I love this kind of questions:
import numpy as np
from time import time
import matplotlib.pyplot as plt
np.random.seed(27)
dim = []
times = []
for i in range(1,10001,10):
A = np.random.randint(1,10,size=(1,i))
B = np.random.randint(1,10,size=(i,1))
begin = time()
C = A*B
times.append(time()-begin)
dim.append(i)
plt.plot(times,dim)
This is a simplified test in which I tested 1 dimension matrices, (1,1)(1,1), (1,10)(10,1), (1,20)(20,1) and so on...
But you can make a double iteration to change also the "outer" dimension of the matrices and see how this affect the computational time

Is there a way to make my 1D random walk code more time efficient here?

So my code plots the average distance from equilibrium of a 1D random walk over 1000 steps. My code works, but takes an inordinate amount of time, I think probably due to the loop inside a loop of the system. Is there a way to make this more efficient or am I stuck with this? Thanks :)
nsteps = 1000
ndim = 1
numpy.seterr(invalid="ignore")
for i in range(100):
w = walker(numpy.zeros(1))
ys = w.doSteps(nsteps)
avgpos = []
for i in range(0, len(ys)):
avgpos.append(sum(ys[:i+1])/i+1)
plt.plot(range(nsteps+1),avgpos)
The ys are the results from doing n steps. I'm sure the inefficiency is from something within the loop rather than a problem in the earlier code
I'd suggest using the built in method for doing cumulative sums. I'd also suggest fixing the warnings from Numpy, I think you need some brackets around sum(...)/i+1. Python, like most languages, would evaluate this as (sum(...)/i)+1 because division binds more tightly than addition.
A minimal working example would thus be:
import numpy as np
import matplotlib.pyplot as plt
nsteps = 1000
for i in range(100):
ys = np.cumsum(np.random.standard_normal(nsteps))
avgpos = []
for i in range(0, len(ys)):
avgpos.append(sum(ys[:i+1])/(i+1)) # note brackets
plt.plot(np.array(avgpos))
which takes my laptop ~8 seconds.
I could instead use the Numpy cumsum method like this:
for i in range(100):
ys = np.cumsum(np.random.standard_normal(nsteps))
avgpos = np.cumsum(ys) / (np.arange(nsteps)+1)
plt.plot(avgpos)
which only takes ~0.1 seconds.

Efficient computation of a loop of integrals in Python

I was wondering how to speed up the following code in where I compute a probability function which involves numerical integrals and then I compute some confidence margins.
Some possibilities that I have thought about are Numba or vectorization of the code
EDIT:
I have made minor modifications because there was a mistake. I am looking for some modifications that provide major time improvements (I know that there are some minor changes that would provide some minor time improvements, such as repeated functions, but I am not concerned about them)
The code is:
# -*- coding: utf-8 -*-
"""
Created on Tue Jan 26 17:05:46 2021
#author: Ignacio
"""
import numpy as np
from scipy.integrate import simps
def pdf(V,alfa_points):
alfa=np.linspace(0,2*np.pi,alfa_points)
return simps(1/np.sqrt(2*np.pi)/np.sqrt(sigma_R2)*np.exp(-(V*np.cos(alfa)-eR)**2/2/sigma_R2)*1/np.sqrt(2*np.pi)/np.sqrt(sigma_I2)*np.exp(-(V*np.sin(alfa)-eI)**2/2/sigma_I2),alfa)
def find_nearest(array,value):
array=np.asarray(array)
idx = (np.abs(array-value)).argmin()
return array[idx]
N = 20
n=np.linspace(0,N-1,N)
d=1
sigma_An=0.1
sigma_Pn=0.2
An=np.ones(N)
Pn=np.zeros(N)
Vs=np.linspace(0,30,1000)
inc=np.max(Vs)/len(Vs)
th=np.linspace(0,np.pi/2,250)
R=np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
I=np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
fmin=np.zeros(len(th))
fmax=np.zeros(len(th))
for tt in range(len(th)):
eR=np.exp(-sigma_Pn**2/2)*np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[tt])*n*d))
eI=np.exp(-sigma_Pn**2/2)*np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[tt])*n*d))
sigma_R2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)+1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
sigma_I2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)-1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
PDF=np.zeros(len(Vs))
for vv in range(len(Vs)):
PDF[vv]=pdf(Vs[vv],100)
total=simps(PDF,Vs)
values=np.cumsum(PDF)*inc/total
xval_05=find_nearest(values,0.05)
fmin[tt]=Vs[values==xval_05]
xval_95=find_nearest(values,0.95)
fmax[tt]=Vs[values==xval_95]
This version's speedup: 31x
A simple profiling (%%prun) reveals that most of the time is spent in simps.
You are in control of the integration done in pdf(): for example, you can use the trapeze method instead of Simpson with negligible numerical difference if you increase a bit the resolution of alpha. In fact, the higher resolution obtained by a higher sampling of alpha more than makes up for the difference between simps and trapeze (see picture at the bottom as for why). This is by far the highest speedup. We go one bit further by implementing the trapeze method ourselves instead of using scipy, since it is so simple. This alone yields marginal gain, but opens the door for a more drastic optimization (below, about pdf2D.
Also, the remaining simps(PDF, ...) goes faster when it knows that the dx step is constant, so we can just say so instead of passing the whole alpha array.
You can avoid doing the loop to compute PDF and use np.vectorize(pdf) directly on Vs, or better (as in the code below), do a 2-D version of that calculation.
There are some other minor things (such as using an index directly fmin[tt] = Vs[closest(values, 0.05)] instead of finding the index, returning the value, and then using a boolean mask for where values == xval_05), or taking all the constants (including alpha) outside functions and avoid recalculating every time.
This above gives us a 5.2x improvement. There is a number of things I don't understand in your code, e.g. why having An (ones) and Pn (zeros)?
But, importantly, another ~6x speedup comes from the observation that, since we are implementing our own trapeze method by using numpy primitives, we can actually do it in 2D in one go for the whole PDF.
The final speed up of the code below is 31x. I believe that a better understanding of "the big picture" of what you want to do would yield additional, perhaps substantial, speed gains.
Modified code:
import numpy as np
from scipy.integrate import simps
alpha_points = 200 # more points as we'll use trapeze not simps
alpha = np.linspace(0, 2*np.pi, alpha_points)
cosalpha = np.cos(alpha)
sinalpha = np.sin(alpha)
d_alpha = np.mean(np.diff(alpha)) # constant dx
coeff = 1 / np.sqrt(2*np.pi)
Vs=np.linspace(0,30,1000)
d_Vs = np.mean(np.diff(Vs)) # constant dx
inc=np.max(Vs)/len(Vs)
def f2D(Vs, eR, sigma_R2, eI, sigma_I2):
a = coeff / np.sqrt(sigma_R2)
b = coeff / np.sqrt(sigma_I2)
y = a * np.exp(-(np.outer(cosalpha, Vs) - eR)**2 / 2 / sigma_R2) * b * np.exp(-(np.outer(sinalpha, Vs) - eI)**2 / 2 / sigma_I2)
return y
def pdf2D(Vs, eR, sigma_R2, eI, sigma_I2):
y = f2D(Vs, eR, sigma_R2, eI, sigma_I2)
s = y.sum(axis=0) - (y[0] + y[-1]) / 2 # our own impl of trapeze, on 2D y
return s * d_alpha
def closest(a, val):
return np.abs(a - val).argmin()
N = 20
n = np.linspace(0,N-1,N)
d = 1
sigma_An = 0.1
sigma_Pn = 0.2
An=np.ones(N)
Pn=np.zeros(N)
th = np.linspace(0,np.pi/2,250)
R = np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
I = np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
fmin=np.zeros(len(th))
fmax=np.zeros(len(th))
for tt in range(len(th)):
eR=np.exp(-sigma_Pn**2/2)*np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[tt])*n*d))
eI=np.exp(-sigma_Pn**2/2)*np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[tt])*n*d))
sigma_R2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)+1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
sigma_I2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)-1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
PDF=pdf2D(Vs, eR, sigma_R2, eI, sigma_I2)
total = simps(PDF, dx=d_Vs)
values = np.cumsum(PDF) * inc / total
fmin[tt] = Vs[closest(values, 0.05)]
fmax[tt] = Vs[closest(values, 0.95)]
Note: most of the fmin and fmax are np.allclose() compared with the original function, but some of them have a small error: after some digging, it turns out that the implementation here is more precise as that function f() can be pretty abrupt, and more alpha points actually help (and more than compensate the minuscule lack of precision due to using trapeze instead of Simpson).
For example, at index tt=244, vv=400:
Considering several methods, the one that provides the largest time improvement is the Numba method. The method proposed by Pierre is very interesting and it does not require to install other packages, which is an asset.
However, in the examples that I have computed, the time improvement is not as large as with the numba example, specially when the points in th grows to a few tenths of thousands (which is my actual case). I post here the Numba code just in case someone is interested:
import numpy as np
from numba import njit
#njit
def margins(val_min,val_max):
fmin=np.zeros(len(th))
fmax=np.zeros(len(th))
for tt in range(len(th)):
eR=np.exp(-sigma_Pn**2/2)*np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[tt])*n*d))
eI=np.exp(-sigma_Pn**2/2)*np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[tt])*n*d))
sigma_R2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)+1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
sigma_I2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)-1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
Vs=np.linspace(0,30,1000)
inc=np.max(Vs)/len(Vs)
integration_points=200
PDF=np.zeros(len(Vs))
for vv in range(len(Vs)):
PDF[vv]=np.trapz(1/np.sqrt(2*np.pi)/np.sqrt(sigma_R2)*np.exp(-(Vs[vv]*np.cos(np.linspace(0,2*np.pi,integration_points))-eR)**2/2/sigma_R2)*1/np.sqrt(2*np.pi)/np.sqrt(sigma_I2)*np.exp(-(Vs[vv]*np.sin(np.linspace(0,2*np.pi,integration_points))-eI)**2/2/sigma_I2),np.linspace(0,2*np.pi,integration_points))
total=np.trapz(PDF,Vs)
values=np.cumsum(PDF)*inc/total
idx = (np.abs(values-val_min)).argmin()
xval_05=values[idx]
fmin[tt]=Vs[np.where(values==xval_05)[0][0]]
idx = (np.abs(values-val_max)).argmin()
xval_95=values[idx]
fmax[tt]=Vs[np.where(values==xval_95)[0][0]]
return fmin,fmax
N = 20
n=np.linspace(0,N-1,N)
d=1
sigma_An=1/2**6
sigma_Pn=2*np.pi/2**6
An=np.ones(N)
Pn=np.zeros(N)
th=np.linspace(0,np.pi/2,250)
R=np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
I=np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
F=R+1j*I
Fa=np.abs(F)/np.max(np.abs(F))
fmin, fmax = margins(0.05,0.95)

How to make my python integration faster?

Hi i want to integrate a function from 0 to several different upper limits (around 1000). I have written a piece of code to do this using a for loop and appending each value to an empty array. However i realise i could make the code faster by doing smaller integrals and then adding the previous integral result to the one just calculated. So i would be doing the same number of integrals, but over a smaller interval, then just adding the previous integral to get the integral from 0 to that upper limit. Heres my code at the moment:
import numpy as np #importing all relevant modules and functions
from scipy.integrate import quad
import pylab as plt
import datetime
t0=datetime.datetime.now() #initial time
num=np.linspace(0,10,num=1000) #setting up array of values for t
Lt=np.array([]) #empty array that values for L(t) are appended to
def L(t): #defining function for L
return np.cos(2*np.pi*t)
for g in num: #setting up for loop to do integrals for L at the different values for t
Lval,x=quad(L,0,g) #using the quad function to get the values for L. quad takes the function, where to start the integral from, where to end the integration
Lv=np.append(Lv,[Lval]) #appending the different values for L at different values for t
What changes do I need to make to do the optimisation technique I've suggested?
Basically, we need to keep track of the previous values of Lval and g. 0 is a good initial value for both, since we want to start by adding 0 to the first integral, and 0 is the start of the interval. You can replace your for loop with this:
last, lastG = 0, 0
for g in num:
Lval,x = quad(L, lastG, g)
last, lastG = last + Lval, g
Lv=np.append(Lv,[last])
In my testing, this was noticeably faster.
As #askewchan points out in the comments, this is even faster:
Lv = []
last, lastG = 0, 0
for g in num:
Lval,x = quad(L, lastG, g)
last, lastG = last + Lval, g
Lv.append(last)
Lv = np.array(Lv)
Using this function:
scipy.integrate.cumtrapz
I was able to reduce time to below machine precision (very small).
The function does exactly what you are asking for in a highly efficient manner. See docs for more info: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.integrate.cumtrapz.html
The following code, which reproduces your version first and then mine:
# Module Declarations
import numpy as np
from scipy.integrate import quad
from scipy.integrate import cumtrapz
import time
# Initialise Time Array
num=np.linspace(0,10,num=1000)
# Your Method
t0 = time.time()
Lv=np.array([])
def L(t):
return np.cos(2*np.pi*t)
for g in num:
Lval,x=quad(L,0,g)
Lv=np.append(Lv,[Lval])
t1 = time.time()
print(t1-t0)
# My Method
t2 = time.time()
functionValues = L(num)
Lv_Version2 = cumtrapz(functionValues, num, initial=0)
t3 = time.time()
print(t3-t2)
Which consistently yields:
t1-t0 = O(0.1) seconds
t3-t2 = 0 seconds

Rewriting a for loop in pure NumPy to decrease execution time

I recently asked about trying to optimise a Python loop for a scientific application, and received an excellent, smart way of recoding it within NumPy which reduced execution time by a factor of around 100 for me!
However, calculation of the B value is actually nested within a few other loops, because it is evaluated at a regular grid of positions. Is there a similarly smart NumPy rewrite to shave time off this procedure?
I suspect the performance gain for this part would be less marked, and the disadvantages would presumably be that it would not be possible to report back to the user on the progress of the calculation, that the results could not be written to the output file until the end of the calculation, and possibly that doing this in one enormous step would have memory implications? Is it possible to circumvent any of these?
import numpy as np
import time
def reshape_vector(v):
b = np.empty((3,1))
for i in range(3):
b[i][0] = v[i]
return b
def unit_vectors(r):
return r / np.sqrt((r*r).sum(0))
def calculate_dipole(mu, r_i, mom_i):
relative = mu - r_i
r_unit = unit_vectors(relative)
A = 1e-7
num = A*(3*np.sum(mom_i*r_unit, 0)*r_unit - mom_i)
den = np.sqrt(np.sum(relative*relative, 0))**3
B = np.sum(num/den, 1)
return B
N = 20000 # number of dipoles
r_i = np.random.random((3,N)) # positions of dipoles
mom_i = np.random.random((3,N)) # moments of dipoles
a = np.random.random((3,3)) # three basis vectors for this crystal
n = [10,10,10] # points at which to evaluate sum
gamma_mu = 135.5 # a constant
t_start = time.clock()
for i in range(n[0]):
r_frac_x = np.float(i)/np.float(n[0])
r_test_x = r_frac_x * a[0]
for j in range(n[1]):
r_frac_y = np.float(j)/np.float(n[1])
r_test_y = r_frac_y * a[1]
for k in range(n[2]):
r_frac_z = np.float(k)/np.float(n[2])
r_test = r_test_x +r_test_y + r_frac_z * a[2]
r_test_fast = reshape_vector(r_test)
B = calculate_dipole(r_test_fast, r_i, mom_i)
omega = gamma_mu*np.sqrt(np.dot(B,B))
# write r_test, B and omega to a file
frac_done = np.float(i+1)/(n[0]+1)
t_elapsed = (time.clock()-t_start)
t_remain = (1-frac_done)*t_elapsed/frac_done
print frac_done*100,'% done in',t_elapsed/60.,'minutes...approximately',t_remain/60.,'minutes remaining'
One obvious thing you can do is replace the line
r_test_fast = reshape_vector(r_test)
with
r_test_fast = r_test.reshape((3,1))
Probably won't make any big difference in performance, but in any case it makes sense to use the numpy builtins instead of reinventing the wheel.
Generally speaking, as you probably have noticed by now, the trick with optimizing numpy is to express the algorithm with the help of numpy whole-array operations or at least with slices instead of iterating over each element in python code. What tends to prevent this kind of "vectorization" is so-called loop-carried dependencies, i.e. loops where each iteration is dependent on the result of a previous iteration. Looking briefly at your code, you have no such thing, and it should be possible to vectorize your code just fine.
EDIT: One solution
I haven't verified this is correct, but should give you an idea of how to approach it.
First, take the cartesian() function, which we'll use. Then
def calculate_dipole_vect(mus, r_i, mom_i):
# Treat each mu sequentially
Bs = []
omega = []
for mu in mus:
rel = mu - r_i
r_norm = np.sqrt((rel * rel).sum(1))
r_unit = rel / r_norm[:, np.newaxis]
A = 1e-7
num = A*(3*np.sum(mom_i * r_unit, 0)*r_unit - mom_i)
den = r_norm ** 3
B = np.sum(num / den[:, np.newaxis], 0)
Bs.append(B)
omega.append(gamma_mu * np.sqrt(np.dot(B, B)))
return Bs, omega
# Transpose to get more "natural" ordering with row-major numpy
r_i = r_i.T
mom_i = mom_i.T
t_start = time.clock()
r_frac = cartesian((np.arange(n[0]) / float(n[0]),
np.arange(n[1]) / float(n[1]),
np.arange(n[2]) / float(n[2])))
r_test = np.dot(r_frac, a)
B, omega = calculate_dipole_vect(r_test, r_i, mom_i)
print 'Total time for vectorized: %f s' % (time.clock() - t_start)
Well, in my testing, this is in fact slightly slower than the loop-based approach I started from. The thing is, in the original version in the question, it was already vectorized with whole-array operations over arrays of shape (20000, 3), so any further vectorization doesn't really bring much further benefit. In fact, it may worsen the performance, as above, maybe due to big temporary arrays.
If you profile your code, you'll see that 99% of the running time is in calculate_dipole so reducing the time for this looping really won't give a noticeable reduction in execution time. You still need to focus on calculate_dipole if you want to make this faster. I tried my Cython code for calculate_dipole on this and got a reduction by about a factor of 2 in the overall time. There might be other ways to improve the Cython code too.

Categories

Resources