Parallelize loops using OpenCL in Python

Parallelize loops using OpenCL in Python - python

I have a given dataset in the matrix y and I want to train different SOMs with it. The SOM is one-dimensional (a line), and its number of neurons varies. I train a SOM of size N=2 at first, and N=NMax at last, giving a total of NMax-2+1 SOMs. For each SOM, I want to store the weights once the training is over before moving on to the next SOM.
The whole point of using PyOpenCL here is that each one of the outer loops is independent of the others. Namely, for each value of N, the script doesn't care about what happens when N takes other values. One could have the same result running the script NMax-2+1 times changing the value of N manually.
With this in mind, I was hoping to be able to perform each one of these independent iterations at the same time using the GPU, so that the time spent reduces significantly. The increase in speed will be less than 1/(NMax-2+1) though, because each iteration is more expensive that the previous ones, as for larger values of N, more calculations are made.
Is there a way to 'translate' this code to run on the GPU? I've never used OpenCL before, so let me know if this is too broad or silly so I can ask a more specific question. The code is self-contained, so feel free to try it out.The four constants declared at the beginning can be changed to whatever you like (given that NMax > 1 and all the others are strictly positive).
import numpy as np
import time
m = 3 # Dimension of datapoints
num_points = 2000 # Number of datapoints
iterMax = 150 # Maximum number of iterations
NMax = 3 # Maximum number of neurons
#%%
np.random.seed(0)
y = np.random.rand(num_points,m) # Generate always the same dataset
sigma_0 = 5 # Initial value of width of the neighborhood function
eta_0 = 1 # Initial value of learning rate
w = list(range(NMax - 1))
wClusters = np.zeros((np.size(y,axis = 0),NMax - 1)) # Clusters for each N
t_begin = time.clock() # Start time
for N in range(NMax-1): # Number of neurons for this iteration
w[N] = np.random.uniform(0,1,(N+2,np.size(y,axis=1))) - 0.5 # Initialize weights
iterCount = 1
while iterCount < iterMax:
# Mix up the input patterns
mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
# Sigma reduction
sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
s2 = 2*sigma**2
# Learning rate reduction
eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
for selectedInput in mixInputs: # Pick up one pattern
# Search winning neuron
aux = np.sum((selectedInput - w[N])**2, axis = -1)
ii = np.argmin(aux) # Neuron 'ii' is the winner
jjs = abs(ii - list(range(N+2)))
dists = np.min(np.vstack([jjs , abs(jjs-(N+2))]), axis = 0)
# Update weights
w[N] = w[N] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N])
print(N+2,iterCount)
iterCount += 1
# Assign each datapoint to its nearest neuron
for kk in range(np.size(y,axis = 0)):
aux = np.sum((y[kk,] - w[N])**2,axis=-1)
ii = np.argmin(aux) # Neuron 'ii' is the winner
wClusters[kk,N] = ii + 1
t_end = time.clock() # End time
#%%
print(t_end - t_begin)

I'm trying to give a somewhat complete answer.
First of all:
Can this code be adapted to be run on the GPU using (py)OpenCL?
Most probably yes.
Can this been done automatically?
No (afaik).
Most of the questions I get about OpenCL are along the lines of: "Is it worth porting this piece of code to OpenCL for a speedup gain?" You are stating, that your outer loop is independent on the results of other runs, which makes the code basically parallelizable. In a straightforward implementation, each OpenCL working element would execute the same code with slightly different input parameters. Not regarding overhead by data transfer between host and device, the running time of this approach would be equal to the running time of the slowest iteration. Depending on the iterations in your outer loop, this could be a massive speed gain. As long as the numbers stay relatively small, you could try the multiprocessing module in python to parallelize these iterations on the CPU instead of the GPU.
Porting to the GPU usually only makes sense, if a huge number of processes are to be run in parallel (about 1000 or more). So in your case, if you really want an enormous speed boost, see if you can parallelize all calculations inside the loop. For example, you have 150 iterations and 2000 data points. If you could somehow parallelize these 2000 data points, this could offer a much bigger speed gain, which could justify the work of porting the whole code to OpenCL.
TL;DR:
Try parallelizing on CPU first. If you find the need to run more than several 100s of processes at the same time, move to GPU.
Update: Simple code for parallelizing on CPU using multiprocessing (without callback)
import numpy as np
import time
import multiprocessing as mp
m = 3 # Dimension of datapoints
num_points = 2000 # Number of datapoints
iterMax = 150 # Maximum number of iterations
NMax = 10 # Maximum number of neurons
#%%
np.random.seed(0)
y = np.random.rand(num_points,m) # Generate always the same dataset
sigma_0 = 5 # Initial value of width of the neighborhood function
eta_0 = 1 # Initial value of learning rate
w = list(range(NMax - 1))
wClusters = np.zeros((np.size(y,axis = 0),NMax - 1)) # Clusters for each N
def neuron_run(N):
w[N] = np.random.uniform(0,1,(N+2,np.size(y,axis=1))) - 0.5 # Initialize weights
iterCount = 1
while iterCount < iterMax:
# Mix up the input patterns
mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
# Sigma reduction
sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
s2 = 2*sigma**2
# Learning rate reduction
eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
for selectedInput in mixInputs: # Pick up one pattern
# Search winning neuron
aux = np.sum((selectedInput - w[N])**2, axis = -1)
ii = np.argmin(aux) # Neuron 'ii' is the winner
jjs = abs(ii - list(range(N+2)))
dists = np.min(np.vstack([jjs , abs(jjs-(N+2))]), axis = 0)
# Update weights
w[N] = w[N] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N])
print(N+2,iterCount)
iterCount += 1
# Assign each datapoint to its nearest neuron
for kk in range(np.size(y,axis = 0)):
aux = np.sum((y[kk,] - w[N])**2,axis=-1)
ii = np.argmin(aux) # Neuron 'ii' is the winner
wClusters[kk,N] = ii + 1
t_begin = time.clock() # Start time
#%%
def apply_async():
pool = mp.Pool(processes=NMax)
for N in range(NMax-1):
pool.apply_async(neuron_run, args = (N,))
pool.close()
pool.join()
print "Multiprocessing done!"
if __name__ == '__main__':
apply_async()
t_end = time.clock() # End time
print(t_end - t_begin)

Related

How do I optimize an array heavy code in Python?

Sorry, this is probably a very noob question, but I'm converting some code I've been modeling with from MATLAB to Python both to help me learn Python and to see if it could run it any faster. In MATLAB, this code takes about 1 second to run, but in Python, it takes about 1 minute. Is there some way to speed it up, or is this not a good application of Python?
import numpy as np
import matplotlib.pyplot as plt
N = 7e5 #number of time steps
dt = 1e-6 #Time step, in seconds
tf = dt*N #Final time, seconds.
trange = np.linspace(0,tf,int(N+1)) #time range
dx = L/M #spatial step size in thermoelectric, meters
#Define dimensionless fourier number in thermoelectric
Fo = dt*(k/c)/(dx**2)
#temperature profile in thermoelectric as a function of space and time
T = np.zeros((M+1,2))
#Allocate initial condition
T[:,0] = Ti
#Set boundary condition at x=L
T[M,:] = T2
#temperature v time profile of coldside of thermoelectric
coldTemp = np.zeros(len(trange))
#initial coldside temp
coldTemp[0] = Ti
#setting current to optimum DC value
I = Issmax
#iterate over timesteps
for p in range(int(N)):
#Use central difference forward time method to find temperature within
#thermoelectric material.
for n in range(M-1):
#calculate temp. change at next time step
T[n+1,1] = T[n+1,0] + Fo*(T[n+2,0]-2*T[n+1,0]+T[n,0]) + dt*((I)**2*rho/(c*d**2*w**2))
#Apply energy balance to the metal (assumed isothermal) and use the
#fact that the metal temp is equal to the thermoelectric temp
T[0,1] = T[0,0] + dt*((I)**2*rhom/(cm*lm**2*wm**2)) - (k*dt/(cm*dx*lm))*(T[0,0]-T[1,0]) - (dt*(I)*S*T[0,0]/(cm*d*w*lm))
#Saving coldside temp
coldTemp[p+1] = T[0,1]
#Setting current temperature profile to be calculated one
T[:,0] = T[:,1]
#Plotting coldside temp vs time
plt.plot(trange, coldTemp)

the suggestions in comments above are good, but before anything else, you are violating perhaps the #1 rule of making loops faster: Don't do things inside of a loop that can be done outside. You are re-computing the same values billions of times, and they are "expensive" with division and exponentiation. Consider something like this... (Check the math, I wasn't too careful, and perhaps there is more you can do)
...
# calculate the constants...
c1 = dt*((I)**2*rho/(c*d**2*w**2))
c2 = dt*((I)**2*rhom/(cm*lm**2*wm**2))
c3 = (k*dt/(cm*dx*lm))
c4 = (dt*(I)*S/(cm*d*w*lm))
#iterate over timesteps
for p in range(int(N)):
#Use central difference forward time method to find temperature within
#thermoelectric material.
for n in range(M-1):
#calculate temp. change at next time step
T[n+1,1] = T[n+1,0] + Fo*(T[n+2,0]-2*T[n+1,0]+T[n,0]) + c1
#Apply energy balance to the metal (assumed isothermal) and use the
#fact that the metal temp is equal to the thermoelectric temp
T[0,1] = T[0,0] + c2 - c3*(T[0,0]-T[1,0]) - c4*T[0,0]
#Saving coldside temp
coldTemp[p+1] = T[0,1]
#Setting current temperature profile to be calculated one
T[:,0] = T[:,1]

Taking numbers from a loop and applying them to a np.zeros array?

I'm trying to make a code for plotting a one-dimensional walk. In this instance, I'm trying to graph how the distance travelled changes with the number of steps taken per walk, which increases by an increment of 10 each time, from 10 steps to 1000. Because I wasn't able to make the dimensions work normally, I tried making a np.zeros array for each value of the distance. At the moment I have this:
minstep = 10
step_max = 1000
increment = 10
nsteps = np.arange(minstep,step_max+increment, increment)
for nsteps in range(step_max):
nwalks = 10
distance = np.zeros(nwalks)
for j in range(nwalks):
startpos = 0.0
pos=startpos
for i in range(nstep):
x = 0.5-random.random()
step = math.copysign(1.0, x)
stepsize = -1 * math.log(1 - random.random())
pos=pos+ (step * stepsize)
distance[j] = (math.fabs(pos-startpos))
nsteps = nsteps + increment
distance_total = np.zeros(99)
nsteps_total = np.arange(minstep, step_max, increment)
distance_total[nwalks]=distance
So, I'm trying to take each value of distance, and apply it to a corresponding spot in the distance_total array, but I'm not sure, how would I get each value of distance to be assigned to a zero in the distance_total array.

How can I optimize this code in python? For solving stochastic differential equations

I am developing a code that uses a method called Platen to solve stochastic differential equations. Then I must solve that stochastic differential equation many times (on the order of 10,000 times) to average all the results. My code is:
import numpy as np
import random
import numba
#numba.jit(nopython=True)
def integrador2(y,t,h): #this is the integrator of the function that solves the SDE
m = 6.6551079E-26 #parameters
gamma=0.05
T = 5E-3
k_b = 1.3806488E-23
b=np.sqrt(2*m*gamma*T*k_b)
c=np.sqrt(h)
for i in range(len(t)):
dW=c*random.gauss(0,1)
A=np.array([y[i,-1]/m,-gamma*y[i,-1]]) #this is the platen method that is applied at
B_dW=np.array([0,b*dW]) #each time step
z=y[i]+A*h+B_dW
Az=np.array([z[-1]/m,-gamma*z[-1]])
y[i+1]=y[i]+1/2*(Az+A)*h+B_dW
return y
def media(args): #args is a tuple with the parameters
y = args[0]
t = args[1]
k = args[2]
x=0
p=0
for n in range(k): #k=number of trajectories
y=integrador2(y,t,h)
x=(1./(n+1))*(n*x+y[:,0]) #I do the average like this so as not to have to save all the
p=(1./(n+1))*(n*p+y[:,1]) #solutions in memory
return x,p
The variables y, t and h are:
y0 = np.array([initial position, initial moment]) #initial conditions
t = np.linspace(initial time, final time, number of time intervals) #time array
y = np.zeros((len(t)+1,len(y0))) #array of positions and moments
y[0,:]=np.array(y0) #I keep the initial condition
h = (final time-initial time)/(number of time intervals) #time increment
I need to be able to run the program for a number of time intervals of 10 ** 7 and solve it 10 ** 4 times (k = 10 ** 4).
I feel that I have already reached a dead end because I already accelerate the function that calculates the result with Numba and then (although I do not put it here) I parallelize the "media" function to work with the four cores that my computer has. Even doing all this, my program takes an hour and a half to execute for 10 ** 6 time intervals and k = 10 ** 4, I have not had the courage to execute it for 10 ** 7 time intervals because my intuition tells me that it would take more than 10 hours.
I would really appreciate if someone could advise me to make some parts of the code faster.
Finally, I apologize if I have not expressed myself completely correctly in any part of the question, I am a physicist, not a computer scientist and my English is far from perfect.

I can save about 75% of compute time by simplifying the math in the loop:
def integrador2(y,t,h): #this is the integrator of the function that solves the SDE
m = 6.6551079E-26 #parameters
gamma=0.05
T = 5E-3
k_b = 1.3806488E-23
b=np.sqrt(2*m*gamma*T*k_b)
c=np.sqrt(h)
h = h * 1.
coeff0 = h/m - gamma*h**2/(2.*m)
coeff1 = (1. - gamma*h + gamma**2*h**2/2.)
coeffd = c*b*(1. - gamma*h/2.)
for i in range(len(t)):
dW=np.random.normal()
# Method 2
y[i+1] = np.array([y[i][0] + y[i][1]*coeff0, y[i][1]*coeff1 + dW*coeffd])
return y
Here's a method using filters with scipy, which I don't think is compatible with Numba, but is slightly faster than the solution above:
from scipy import signal
# #numba.jit(nopython=True)
def integrador2(y,t,h): #this is the integrator of the function that solves the SDE
m = 6.6551079E-26 #parameters
gamma=0.05
T = 5E-3
k_b = 1.3806488E-23
b=np.sqrt(2*m*gamma*T*k_b)
c=np.sqrt(h)
h = h * 1.
coeff0a = 1.
coeff0b = h/m - gamma*h**2/(2.*m)
coeff1 = (1. - gamma*h + gamma**2*h**2/2.)
coeffd = c*b*(1. - gamma*h/2.)
noise = np.zeros(y.shape[0])
noise[1:] = np.random.normal(0.,coeffd*1.,y.shape[0]-1)
noise[0] = y[0,1]
a = [1, -coeff1]
b = [1]
y[1:,1] = signal.lfilter(b,a,noise)[1:]
a = [1, -coeff0a]
b = [coeff0b]
y[1:,0] = signal.lfilter(b,a,y[:,1])[1:]
return y

Is there a way to increase the line length for an equation in Gekko after receiving "APM model error: string > 15000 characters"?

I'm using Gekko for an optimization problem with constraints that require summations over array variables. Because these arrays are long, I keep getting the error: APM model error: string > 15000 characters
The summation needs to be summed over three variables: i in range(1, years), n in range(1, i), and j in range(1,receptors). As it compiles, the number of variables included in each summation increases. I want to leave the code as a summation with the following line:
m.Equation(emissions[:,3] == sum(sum(sum(f[n,j]*-r[j,2]*unit *(.001*(i-n)**2 + 0.062*(i-n)) for i in range(years)) for n in range(i))for j in range(rec)))
However, these constraints cause the error of more than 15,000 characters for a line.
I have previously solved the problem using for loops and intermediates to solve all of these variables outside of the "constraint" environment. It has given me the right answer, but takes a long time to compile the model (upwards of 4 hours for model building, and less than 3 minutes to solve it). The code looked like this:
for i in range(years):
emissions[i,0] = s[i,1]
emissions[i,1] = s[i,3]
emissions[i,2] = s[i,5]
emissions[i,3] = 0
emissions[i,4] = 0
emissions[i,5] = 0
for n in range(i):
for j in range(rec):
#update + binary * flux * conversion * growth
emissions[i,3] = m.Intermediate(emissions[i,3] + f[n,j] * - rankedcopy[j,2] * unit * (.001*(i-n)**2 + 0.062*(i-n)))
emissions[i,4] = m.Intermediate(emissions[i,4] + f[n,j] * - rankedcopy[j,3] * unit * (.001*(i-n)**2 + 0.062*(i-n)))
emissions[i,5] = m.Intermediate(emissions[i,5] + f[n,j] * - rankedcopy[j,4] * unit * (.001*(i-n)**2 + 0.062*(i-n)))
I'm hoping that avoiding the for loops will increase the efficiency which enables me to expand the model, but I'm unsure of a way to increase the APM model string limit.
I am also open to other suggestions of how to embed intermediates into the summation.

Try using the m.sum() function as a built-in GEKKO object. If you use the Python sum function then it creates a large summation equation that needs to be interpreted at run-time and may exceed the equation size limit. The m.sum() creates the summation in byte-code instead.
m.Equation(emissions[:,3] == \
m.sum(m.sum(m.sum(f[n,j]*-r[j,2]*unit *(.001*(i-n)**2 + 0.062*(i-n)) \
for i in range(years)) for n in range(i))for j in range(rec)))
Here is a simple example that shows the difference in performance.
from gekko import GEKKO
import numpy as np
import time
n = 5000
v = np.linspace(0,n-1,n)
# summation method 1 - Python sum
m = GEKKO()
t = time.time()
s = sum(v)
y = m.Var()
m.Equation(y==s)
m.solve(disp=False)
print(y.value[0])
print('Elapsed time: ' + str(time.time()-t))
m.cleanup()
# summation method 2 - Intermediates
m = GEKKO()
t = time.time()
s = 0
for i in range(n):
s = m.Intermediate(s + v[i])
y = m.Var()
m.Equation(y==s)
m.solve(disp=False)
print(y.value[0])
print('Elapsed time: ' + str(time.time()-t))
m.cleanup()
# summation method 3 - Gekko sum
m = GEKKO()
t = time.time()
s = m.sum(v)
y = m.Var()
m.Equation(y==s)
m.solve(disp=False)
print(y.value[0])
print('Elapsed time: ' + str(time.time()-t))
m.cleanup()
Results
12497500.0
Elapsed time: 0.17874956130981445
12497500.0
Elapsed time: 5.171698570251465
12497500.0
Elapsed time: 0.1246955394744873
The 15,000 character limit for a single equation is a hard limit. We thought about making it adjustable with m.options.MAX_MEMORY but then large equations can make very dense matrix factorizations for the solver. It is often better to break up the equation or use other methods to reduce the equation size.

Cartpole - Simple backprop with 1 hidden layer?

I'm trying to solve the CartPole-v1 problem from OpenAI by using backprop on a one-layer neural network - while updating the model at every time step using State action values (Q(s,a)). I'm unable to get the average reward to go up beyond about 42 steps per episode. Could anyone help? Is my approach even correct - as in, is it even possible for the agent to learn the optimal solution if I'm updating the Q-values every time-step, instead of batch updates every episode? Seems like theoretically it should be possible.
Details: After playing around and experimenting with activation functions, stochastic policies and finally settling on a deterministic policy with linear activation function and the parameters mentioned below - i'm able to get my agent to consistently converge (in about 100-300 steps) to an average reward of about 42 steps. But it doesn't go beyond 45. Adjusting the parameters (epsilon, discount_rate, and learning rate) in the program below does not have a huge impact on this.
I've tried looking for a similar solution online but none of them seem to fit the approach that I'm following. Almost all of the solutions involve learning at the end of each episode (by storing SARS' data).
Increasing the number of hidden layers doesn't help either. I also think it is unlikely that the algorithm will converge to a better value in future as I've run it for 10000+ episodes and it average reward is still around 40.
First, the hyperparameters:
epsilon = 0.5
lr = 0.05
discount_rate=0.9
# number of features in environment observations
num_inputs = 4
hidden_layer_nodes = 6
num_outputs = 2
The q function:
def calculateNNOutput(observation, m1, m2):
scaled_observation = scaleFeatures(observation)
hidden_layer = np.dot(scaled_observation, m1) # 1x4 X 4x6 -> 1x6
outputs = np.dot(hidden_layer, m2) # 1x6 X 6x2
return np.asmatrix(outputs) # 1x2
Action selection (policy):
def selectAction(observation):
#explore
global epsilon
if random.uniform(0,1) < epsilon:
return random.randint(0,1)
#exploit
outputs = calculateNNOutputs(observation)
print(outputs)
if (outputs[0,0] > outputs[0,1]):
return 0
else:
return 1
Backprop:
def backProp(prev_obs, m1, m2, experimental_values):
global lr
scaled_observation = np.asmatrix(scaleFeatures(prev_obs))
hidden_layer = np.asmatrix(np.dot(scaled_observation, m1)) #
outputs = np.asmatrix(np.dot(hidden_layer, m2)) # 1x6 X 6x2
delta_out = np.asmatrix((outputs-experimental_values)) # 1x2
delta_2=np.transpose(np.dot(m2,np.transpose(delta_out))) # 6x2 X 2x1 = 6x1_T = 1x6
GRADIENT_2 = (np.transpose(hidden_layer))*delta_out # 6x1 X 1x2 = 6x2 - same as w2
GRADIENT_1 = np.multiply(np.transpose(scaled_observation), delta_2) # 4 x 6 - same as w1
m1 = m1 - lr*GRADIENT_1
m2 = m2 - lr*GRADIENT_2
return m1, m2
Q-learning:
def updateWeights(prev_obs, action, obs, reward, done):
global weights_1, weights_2
calculated_value = calculateNNOutputs(prev_obs)
if done:
experimental_value = -1
else:
actionValues = calculateNNOutputs(obs) # 1x2
experimental_value = reward + discount_rate*(np.amax(actionValues, axis = 1)[0,0])
if action==0:
weights_1, weights_2 = backProp(prev_obs, weights_1, weights_2, np.array([[experimental_value, calculated_value[0,1]]]))
else:
weights_1, weights_2 = backProp(prev_obs, weights_1, weights_2, np.array([[calculated_value[0,0],experimental_value]]))
EDIT: the main loop -
record = 0
total = 0
for i_episode in range(num_episodes):
if (i_episode%10 == 0):
print("W1 = ", weights_1)
print("W2 = ", weights_2)
observation = env.reset()
epsilon = max(epsilon*0.9,0.01)
lr = max(lr*0.9, 0.01)
print("Average steps = ", total/(i_episode+1))
print("Record = ", record)
for t in range(1000):
action_taken = selectAction(observation)
print(action_taken)
previous_observation=observation
observation, reward, done, info = env.step(action_taken) # take the selected action
updateWeights(previous_observation, action_taken, observation,reward, done) # perform backprop to update the action value
if done:
total = total+t
if t > record:
record = t
print("Episode {} finished after {} timesteps".format(i_episode,t+1))
break
Do I need to make any changes in approach/implementation/parameter tuning?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelize loops using OpenCL in Python - python

Related

How do I optimize an array heavy code in Python?

Taking numbers from a loop and applying them to a np.zeros array?

How can I optimize this code in python? For solving stochastic differential equations

Is there a way to increase the line length for an equation in Gekko after receiving "APM model error: string > 15000 characters"?

Cartpole - Simple backprop with 1 hidden layer?

Categories

Resources