I'm trying to improve time efficiency of part of my script but I don't have any more idea. I ran following script in either Matlab and Python but Matlab implementation is four times quicker than Python 's one. Any idea how to improve ?
Python:
import time
import numpy as np
def ComputeGradient(X, y, theta, alpha):
m = len(y)
factor = alpha / m
h = np.dot(X, theta)
theta = [theta[i] - factor * sum((h-y) * X[:,i]) for i in [0,1]]
#Also tried this but with worse performances
#diff = np.tile((h-y)[:, np.newaxis],2)
#theta = theta - factor * sum(diff * X)
return theta
if __name__ == '__main__':
data = np.loadtxt("data_LinReg.txt", delimiter=',')
theta = [0, 0]
alpha = 0.01
X = data[:,0]
y = data[:,1]
X = np.column_stack((np.ones(len(y)), X))
start_time = time.time()
for i in range(0, 1500, 1):
theta = ComputeGradient(X, y, theta, alpha)
stop_time = time.time()
print("--- %s seconds ---" % (stop_time - start_time))
--> 0.048s
Matlab:
data = load('data_LinReg.txt');
X = data(:, 1); y = data(:, 2);
m = length(y);
X = [ones(m, 1), data(:,1)]; % Add a column of ones to x
theta = zeros(2, 1);
iterations = 1500;
alpha = 0.01;
tic
for i = 1:1500
theta = gradientDescent(X, y, theta, alpha);
end
toc
function theta = gradientDescent(X, y, theta, alpha)
m = length(y); % number of training examples
h = X * theta;
t1 = theta(1) - alpha * sum(X(:,1).*(h-y)) / m;
t2 = theta(2) - alpha * sum(X(:,2).*(h-y)) / m;
theta = [t1; t2];
end
--> 0.01s
[EDIT] : solution avenue
One possible avenue is to use numpy vectorization instead of python root functions. In the proposed code, replacing sum by np.sum improves the time efficiency so that it is closer to Matlab (0.019s instead of 0.048s)
Furthermore, I tested separately the functions on vectors : np.dot, np.sum, * (product) and all these functions seems to be faster (really faster in some case) than the equivalent Matlab. I wonder then why it is still slower in Python....
This solution presents an optimized MATLAB implementation that does -
Funtion-inlining of the gradient-descent implementation.
Pre-computation of certain values that are repeatedly used inside the loop.
Code -
data = load('data_LinReg.txt');
iterations = 1500;
alpha = 0.01;
m = size(data,1);
M = alpha/m; %// scaling factor
%// Pre-compute certain values that are repeatedly used inside the loop
sum_a = M*sum(data(:,1));
sum_p = M*sum(data(:,2));
sum_ap = M*sum(data(:,1).*data(:,2));
sum_sqa = M*sum(data(:,1).^2);
one_minus_alpha = 1 - alpha;
one_minus_sum_sqa = 1 - sum_sqa;
%// Start processing
t1n0 = 0;
t2n0 = 0;
for i = 1:iterations
temp = t1n0*one_minus_alpha - t2n0*sum_a + sum_p;
t2n0 = t2n0*one_minus_sum_sqa - t1n0*sum_a + sum_ap;
t1n0 = temp;
end
theta = [t1n0;t2n0];
Quick tests show that this presents an appreciable speedup over the MATLAB code posted in the question.
Now, I am not too familiar with python, but I would assume that this MATLAB code could be easily ported to python.
I don't know how much of a difference it will make, but you can simplify your function with something like:
s = alpha / size(X,1);
gradientDescent = #(theta)( theta - s * X' * (X*theta - y) );
Since you need theta_{i} in order to find theta_{i+1}, I don't see any way to avoid the loop.
Related
I am in the process of converting some code from Python into Matlab. I have code working that produces the same results, but I am wondering if there may be a way to vectorize some of my for loops in the Matlab code as it take a long time to run. X in an Nxd matrix, diff is an NxNxd tensor, kxy is an NxN matrix, gradK is an NxNx2 tensor, and sumkxy, dxkxy, and obj are all Nxd matrices.
Here is the original Python Code:
diff = x[:, None, :] - x[None, :, :] # D_{ij, s}
kxy = np.exp(-np.sum(diff ** 2, axis=-1) / (2 * h ** 2)) / np.power(np.pi * 2.0 * h * h, d / 2) # -1 last dimension K_{ij]
gradK = -diff * kxy[:, :, None] / h ** 2 # N * N * 2
sumkxy = np.sum(kxy, axis=1)
dxkxy = np.sum(gradK, axis=1) # N * 2 sum_{i} d_i K_{ij, s}
obj = np.sum(gradK / sumkxy[None, :, None], axis=1) # N * 2
and here is my initial Matlab Code with all the for loops:
diff = zeros([n,n,d]);
for i = 1:n
for j = 1:n
for k = 1:d
diff(i,j,k) = x(i,k) - x(j,k);
end
end
end
kxy = exp(-sum(dif.^2, 3)/(2*h^2))/((2*pi*h^2)^(d/2));
sumkxy = sum(kxy,2);
gradK = zeros([n,n,d]);
for i = 1:n
for j = 1:n
for k = 1:d
gradK(i,j,k) = -diff(i,j,k)*kxy(i, j)/h^2;
end
end
end
dxkxy = squeeze(sum(gradK,2));
a = zeros([n,n,d]);
for i =1:n
for j = 1:n
for k = 1:d
a(i,j,k) = gradK(i,j,k)/sumkxy(i);
end
end
end
obj = squeeze(sum(a, 2));
I know a way a faster way to calculate the kxy term is to use the following code:
XY = x*x';
x2= sum(x.^2, 2);
X2e = repmat(x2, 1, n);
H = (X2e + X2e' - 2*XY); % calculate pairwise distance
Kxy = exp(-H/(2*h^2))/((2*pi*h*h)^(d/2));
But then I struggle on a way to then calculate gradK efficiently without diff. Any help or suggestions would be greatly appreciated!
If your goal is computation of obj you don't need even to compute gradK and a:
sx = sum(x.^2, 2);
H = sx - 2*x*x.' + sx.';
kxy = exp(-H/(2*h^2))/((2*pi*h^2)^(d/2));
kh = kxy / h^2;
sumkxy = sum(kxy, 2);
khs = kh ./ sumkxy;
obj = khs * x - sum(khs, 2) .* x;
gradK and dif can be computed this way:
dif = reshape(x, n, 1, d) - reshape(x, 1, n, d);
gradK = -dif .* (kxy / h^2);.
I like to try approaching problems like these by breaking it down into "subcomponents" with some bogus data that will execute quickly and that you can use to test the code functionality. The first subcomponent you might start with is your first nested loop calculating diff:
n = 100;
d = 50;
x = round(100*rand(n,d));
tic
diff = zeros([n,n,d]);
for i = 1:n
for j = 1:n
for k = 1:d
diff(i,j,k) = x(i,k) - x(j,k);
end
end
end
toc
First, consider the innermost loop on its own:
...
for k = 1:d
diff(i,j,k) = x(i,k) - x(j,k);
end
...
Looking at this loop (at least for me!) simplifies things greatly. To vectorize just this "subcomponent" we could write something like:
diff(i,j,:) = x(i,:) - x(j,:);
Now that the low hanging fruit is out of the way, lets consider the next layer of loop. Does doing the same trick as before work?
diff(i,:,:) = x(i,:) - x; % where x(:,:) can just be written as x.
If you aren't sure, you can check this by running both the nested loop version and the one above with the same (emphasis on same) bogus data and checking if they are equal using isequal(). To cut to the chase, it should come out the same and now your original loop is down to:
tic
diff = zeros([n,n,d]);
for i = 1:n
diff(i,:,:) = x(i,:) - x;
end
toc
For this final bit, you can exploit matlab's matrix/array reshaping/permuting functions. Look up the documentation for reshape() or permute() for more details. In brief, if you reshape or change the order of the dimensions of one copy of x from Nxd to 1xNxd, subtracting x from another, regularly sized matrix will perform the operations elementwise in matlab. So for example:
diff = permute(x,[1,3,2]) - permute(x,[3,1,2]); % this is Nx1xd - 1xNxd
should effectively compute the tensor difference you were looking for in the first loop!
I can expand this answer to show how the other loops might be worked out if you want, but give the other ones a try first with this same logic. Hopefully, you can keep diff and then calculate kxy much faster. Without knowing how big your original matrices are, I can't say how much speedup you should expect though.
Update:
I should add, to ensure that you are doing elementwise multiplication, division and transpose operations make sure to add a '.' before each command. E.g.
gradK(i,j,:) = -diff(i,j,:).*kxy(i, j)/h^2;
For more information, look up elementwise operations in Matlab
I've been using numpy.linalg.solve(A,B) to solve a linear equation. In my case: A is about 10,000x10,000 and B is around 10,000x5. If I initialize A and B randomly using:
A = numpy.random.rand(10000,10000)
B = numpy.random.rand(10000,5)
Then the computation time is <3 seconds. However, in my program that needs to solve this equation, the computational time is consistently about 14 seconds. This code is iterated over in a loop, so a speedup of almost 5 times is a big deal. Should the solution for linalg.solve() not be roughly constant for constant sized arrays?
Both implementations are using float64. And there is defintely enough ram (128 GB). I tried updating blas libraries on the computer (Ubuntu 16.04 - numpy installed with conda) and it showed improvements on the solutions from the randomly generated data, but not for the data in my program.
If anyone is looking for more specifics about the code, this is from line 27 of the .py file found at [1]. This program is doing registration of point clouds.
Any thoughts or help would be greatly appreciated.
[1]https://github.com/siavashk/pycpd/blob/master/pycpd/deformable_registration.py
Edit:
To try and make this more reproducible, I've generated some code to try and get me to the suspect np.linalg.solve() line:
import numpy as np
import time
def gaussian_kernel(Y, beta):
diff = Y[None,:,:] - Y[:,None,:]
diff = diff**2
diff = np.sum(diff, axis=2)
return np.exp(-diff / (2 * beta**2))
def initialize_sigma2(X, Y):
diff = X[None,:,:] - Y[:,None,:]
err = diff**2
return np.sum(err) / (X.shape[0] * Y.shape[0] * X.shape[1])
alpha = 0.1
beta = 3
X = np.random.rand(10000,5) * 100
Y = X + X*0.1
N, D = X.shape
M, _ = Y.shape
G = gaussian_kernel(Y, beta)
sigma2 = initialize_sigma2(X,Y)
TY = Y + np.random.rand(10000,5)
P = np.sum((X[None,:,:] - TY[:,None,:])**2, axis=2)
P /= np.sum(P,axis=0)
P1 = np.sum(P, axis=1)
Np = np.sum(P1)
A = np.dot(np.diag(P1), G) + alpha * sigma2 * np.eye(M)
B = np.dot(P, X) - np.dot(np.diag(P1), Y)
%time W = np.linalg.solve(A,B)
However, this code does not produce the time lag. Everything in this is the same as in the current script... except, the actual creation of the X and Y arrays. These should be two point clouds that are roughly close to one another in 3D space - this is why I have created one based on the other.
Hoping to get some help here with parallelising my python code, I've been struggling with it for a while and come up with several errors in whichever way I try, currently running the code will take about 2-3 hours to complete, The code is given below;
import numpy as np
from scipy.constants import Boltzmann, elementary_charge as kb, e
import multiprocessing
from functools import partial
Tc = 9.2
x = []
g= []
def Delta(T):
'''
Delta(T) takes a temperature as an input and calculates a
temperature dependent variable based on Tc which is defined as a
global parameter
'''
d0 = (pi/1.78)*kb*Tc
D0 = d0*(np.sqrt(1-(T**2/Tc**2)))
return D0
def element_in_sum(T, n, phi):
D = Delta(T)
matsubara_frequency = (np.pi * kb * T) * (2*n + 1)
factor_d = np.sqrt((D**2 * cos(phi/2)**2) + matsubara_frequency**2)
element = ((2 * D * np.cos(phi/2))/ factor_d) * np.arctan((D * np.sin(phi/2))/factor_d)
return element
def sum_elements(T, M, phi):
'''
sum_elements(T,M,phi) is the most computationally heavy part
of the calculations, the larger the M value the more accurate the
results are.
T: temperature
M: number of steps for matrix calculation the larger the more accurate the calculation
phi: The phase of the system can be between 0- pi
'''
X = list(np.arange(0,M,1))
Y = [element_in_sum(T, n, phi) for n in X]
return sum(Y)
def KO_1(M, T, phi):
Iko1Rn = (2 * np.pi * kb * T /e) * sum_elements(T, M, phi)
return Iko1Rn
def main():
for j in range(1, 92):
T = 0.1*j
for i in range(1, 314):
phi = 0.01*i
pool = multiprocessing.Pool()
result = pool.apply_async(KO_1,args=(26000, T, phi,))
g.append(result)
pool.close()
pool.join()
A = max(g);
x.append(A)
del g[:]
My approach was to try and send the KO1 function into a multiprocessing pool but I either get a Pickling error or a too many files open, Any help is greatly appreciated, and if multiprocessing is the wrong approach I would love any guide.
I haven't tested your code, but you can do several things to improve it.
First of all, don't create arrays unnecessarily. sum_elements creates three array-like objects when it can use just one generator. First, np.arange creates a numpy array, then the list function creates a list object and and then the list comprehension creates another list. The function does 4 times the work it should.
The correct way to implement it (in python3) would be:
def sum_elements(T, M, phi):
return sum(element_in_sum(T, n, phi) for n in range(0, M, 1))
If you use python2, replace range with xrange.
This tip will probably help you in any python script you'll write.
Also, try to utilize multiprocessing better. It seems what you need to do is to create a multiprocessing.Pool object once, and use the pool.map function.
The main function should look like this:
def job(args):
i, j = args
T = 0.1*j
phi = 0.01*i
return K0_1(26000, T, phi)
def main():
pool = multiprocessing.Pool(processes=4) # You can change this number
x = [max(pool.imap(job, ((i, j) for i in range(1, 314)) for j in range(1, 92)]
Notice that I used a tuple in order to pass multiple arguments to job.
This is not an answer to the question, but if I may, I would propose how to speed up the code using simple numpy array operations. Have a look at the following code:
import numpy as np
from scipy.constants import Boltzmann, elementary_charge as kb, e
import time
Tc = 9.2
RAM = 4*1024**2 # 4GB
def Delta(T):
'''
Delta(T) takes a temperature as an input and calculates a
temperature dependent variable based on Tc which is defined as a
global parameter
'''
d0 = (np.pi/1.78)*kb*Tc
D0 = d0*(np.sqrt(1-(T**2/Tc**2)))
return D0
def element_in_sum(T, n, phi):
D = Delta(T)
matsubara_frequency = (np.pi * kb * T) * (2*n + 1)
factor_d = np.sqrt((D**2 * np.cos(phi/2)**2) + matsubara_frequency**2)
element = ((2 * D * np.cos(phi/2))/ factor_d) * np.arctan((D * np.sin(phi/2))/factor_d)
return element
def KO_1(M, T, phi):
X = np.arange(M)[:,np.newaxis,np.newaxis]
sizeX = int((float(RAM) / sum(T.shape))/sum(phi.shape)/8) #8byte
i0 = 0
Iko1Rn = 0. * T * phi
while (i0+sizeX) <= M:
print "X = %i"%i0
indices = slice(i0, i0+sizeX)
Iko1Rn += (2 * np.pi * kb * T /e) * element_in_sum(T, X[indices], phi).sum(0)
i0 += sizeX
return Iko1Rn
def main():
T = np.arange(0.1,9.2,0.1)[:,np.newaxis]
phi = np.linspace(0,np.pi, 361)
M = 26000
result = KO_1(M, T, phi)
return result, result.max()
T0 = time.time()
r, rmax = main()
print time.time() - T0
It runs a bit more than 20sec on my PC. One has to be careful not to use too much memory, that is why there is still a loop with a bit complicated construction to use only pieces of X. If enough memory is present, then it is not necessary.
One should also note that this is just the first step of speeding up. Much improvement could be reached still using e.g. just in time compilation or cython.
I am trying to implement a finite difference approximation to solve the Heat Equation, u_t = k * u_{xx}, in Python using NumPy.
Here is a copy of the code I am running:
## This program is to implement a Finite Difference method approximation
## to solve the Heat Equation, u_t = k * u_xx,
## in 1D w/out sources & on a finite interval 0 < x < L. The PDE
## is subject to B.C: u(0,t) = u(L,t) = 0,
## and the I.C: u(x,0) = f(x).
import numpy as np
import matplotlib.pyplot as plt
# parameters
L = 1 # legnth of the rod
T = 10 # terminal time
N = 10
M = 100
s = 0.25
# uniform mesh
x_init = 0
x_end = L
dx = float(x_end - x_init) / N
x = np.arange(x_init, x_end, dx)
x[0] = x_init
# time discretization
t_init = 0
t_end = T
dt = float(t_end - t_init) / M
t = np.arange(t_init, t_end, dt)
t[0] = t_init
# Boundary Conditions
for m in xrange(0, M):
t[m] = m * dt
# Initial Conditions
for j in xrange(0, N):
x[j] = j * dx
# definition of solution u(x,t) to u_t = k * u_xx
u = np.zeros((N, M+1)) # array to store values of the solution
# Finite Difference Scheme:
u[:,0] = x**2 #initial condition
for m in xrange(0, M):
for j in xrange(1, N-1):
if j == 1:
u[j-1,m] = 0 # Boundary condition
elif j == N-1:
u[j+1,m] = 0
else:
u[j,m+1] = u[j,m] + s * ( u[j+1,m] -
2 * u[j,m] + u[j-1,m] )
print u, #t, x
plt.plot(u, t)
#plt.show()
I think my code is working properly and it is producing an output. I want to plot the output of the solution u versus t (my time vector). If I can plot the graph then I am able to check if my numerical approximation agrees with the expected phenomena for the Heat Equation. However, I am getting the error that "x and y must have same first dimension". How can I correct this issue?
An additional question: Am I better off attempting to make an animation with matplotlib.animation instead of using matplotlib.plyplot ???
Thanks so much for any and all help! It is very greatly appreciated!
Okay so I had a "brain dump" and tried plotting u vs. t sort of forgetting that u, being the solution to the Heat Equation (u_t = k * u_{xx}), is defined as u(x,t) so it has values for time. I made the following correction to my code:
print u #t, x
plt.plot(u)
plt.show()
And now my programming is finally displaying an image. And here it is:
It is absolutely beautiful, isn't it?
This code is taking more than half an hour for a data set of 200000 floats.
import numpy as np
try:
import progressbar
pbar = progressbar.ProgressBar(widgets=[progressbar.Percentage(),
progressbar.Counter('%5d'), progressbar.Bar(), progressbar.ETA()])
except:
pbar = list
block_length = np.loadtxt('bb.txt.gz') # get data file from http://filebin.ca/29LbYfKnsKqJ/bb.txt.gz (2MB, 200000 float numbers)
N = len(block_length) - 1
# arrays to store the best configuration
best = np.zeros(N, dtype=float)
last = np.zeros(N, dtype=int)
log = np.log
# Start with first data cell; add one cell at each iteration
for R in pbar(range(N)):
# Compute fit_vec : fitness of putative last block (end at R)
#fit_vec = fitfunc.fitness(
T_k = block_length[:R + 1] - block_length[R + 1]
#N_k = np.cumsum(x[:R + 1][::-1])[::-1]
N_k = np.arange(R + 1, 0, -1)
fit_vec = N_k * (log(N_k) - log(T_k))
prior = 4 - log(73.53 * 0.05 * ((R+1) ** -0.478))
A_R = fit_vec - prior #fitfunc.prior(R + 1, N)
A_R[1:] += best[:R]
i_max = np.argmax(A_R)
last[R] = i_max
best[R] = A_R[i_max]
# Now find changepoints by iteratively peeling off the last block
change_points = np.zeros(N, dtype=int)
i_cp = N
ind = N
while True:
i_cp -= 1
change_points[i_cp] = ind
if ind == 0:
break
ind = last[ind - 1]
change_points = change_points[i_cp:]
print edges[change_points] # show result
The first loop is very slow because the length of arrays is R at every iteration, i.e. increasing, leading to N^2 complexity.
Is there any way to optimize this code further, e.g. through pre-computation? I am also happy with solutions using other programming languages.
I can replicate A_R (up to the fit-prior step) as a upper triangular NxN matrix with:
def trilog(n):
nn = n[:-1,None]-n[None,1:]
nn[np.tril_indices_from(nn,-1)]=1
return nn
T_k = trilog(block_length)
N_k = trilog(-np.arange(N+1))
fit_vec = N_k * (np.log(N_k) - np.log(T_k))
R = np.arange(N)+1
prior = 4 - log(73.53 * 0.05 * (R ** -0.478))
A_R = fit_vec - prior
A_R = np.triu(A_R,0)
print(A_R)
I haven't worked through the logic of calculation and applying best.
I've only done this with small arrays. For your full problem, the corresponding matrix is too large for my memory.
B=np.ones((200000,200000),float)
So just from memory considerations you might be stuck with the for R in range(N) iteration.