I'm trying so simulate coin tosses and profits and plot the graph in matplotlib:
from random import choice
import matplotlib.pyplot as plt
import time
start_time = time.time()
num_of_graphs = 2000
tries = 2000
coins = [150, -100]
last_loss = 0
for a in range(num_of_graphs):
profit = 0
line = []
for i in range(tries):
profit = profit + choice(coins)
if (profit < 0 and last_loss < i):
last_loss = i
line.append(profit)
plt.plot(line)
plt.show()
print("--- %s seconds ---" % (time.time() - start_time))
print("No losses after " + str(last_loss) + " iterations")
The end result is
--- 9.30498194695 seconds ---
No losses after 310 iterations
Why is it taking so long to run this script? If I change num_of_graphs to 10000, the scripts never finishes.
How would you optimize this?
Your measure of execution time is too rough. The following allows you to measure the time needed for the simulation, separate from the time needed for plotting:
It is using numpy.
import matplotlib.pyplot as plt
import numpy as np
import time
def run_sims(num_sims, num_flips):
start = time.time()
sims = [np.random.choice(coins, num_flips).cumsum() for _ in range(num_sims)]
end = time.time()
print(f"sim time = {end-start}")
return sims
def plot_sims(sims):
start = time.time()
for line in sims:
plt.plot(line)
end = time.time()
print(f"plotting time = {end-start}")
plt.show()
if __name__ == '__main__':
start_time = time.time()
num_sims = 2000
num_flips = 2000
coins = np.array([150, -100])
plot_sims(run_sims(num_sims, num_flips))
result:
sim time = 0.13962197303771973
plotting time = 6.621474981307983
As you can see, the sim time is greatly reduced (it was on the order of 7 seconds on my 2011 laptop); The plotting time is matplotlib dependent.
matplotlib is getting slower as the script progresses because it is
redrawing all of the lines that you have previously plotted - even the
ones that have scrolled off the screen.
This is the answer from a previous post answered by Simon Gibbons.
matplotlib isn't optimized for speed, rather its graphics. Here are the links to a few which were developed for speed:
http://www.pyqtgraph.org/
http://code.google.com/p/guiqwt/
http://code.enthought.com/projects/chaco/
You can refer to the matplotlib cookbook for more about performance.
In order to better optimize your code, I would always try to replace loops by vectorization using numpy or, depending on my specific needs, other libraries that use numpy under the hood.
In this case, you could calculate and plot your profits this way:
import matplotlib.pyplot as plt
import time
import numpy as np
start_time = time.time()
num_of_graphs = 2000
tries = 2000
coins = [150, -100]
# Create a 2-D array with random choices
# rows for tries, columns for individual runs (graphs).
coin_tosses = np.random.choice(coins, (tries, num_of_graphs))
# Caculate 2-D array of profits by summing
# cumulatively over rows (trials).
profits = coin_tosses.cumsum(axis=0)
# Plot everything in one shot.
plt.plot(profits)
plt.show()
print("--- %s seconds ---" % (time.time() - start_time))
In my configuration, this code took aprox. 6.3 seconds (6.2 plotting) to run, while your code took almost 15 seconds.
Related
The following code (python) measures the speedup when increasing number of processing. The task in the multiprocessing is just multiplying a random matrix, the size of which is also varied and corresponding elapsed time is measured.
Note that, each process does not share any object and they are completely independent. So, I expected that performance curve when changing number of process will be almost same for all matrix size. However, when plotting the results (see below), I found that the expectation is false. Specifically, when matrix size becomes large (80, 160), the performance hardly be better though number of process increased. Note: The figures legend indicates the matrix sizes.
Could you explain, why performance does not become better when matrix size is large?
For your information, here is the spec of my CPU:
https://www.amd.com/en/products/cpu/amd-ryzen-9-3900x
Product Family: AMD Ryzen™ Processors
Product Line: AMD Ryzen™ 9 Desktop Processors
# of CPU Cores: 12
# of Threads: 24
Max. Boost Clock: Up to 4.6GHz
Base Clock: 3.8GHz
L1 Cache: 768KB
L2 Cache: 6MB
L3 Cache: 64MB
main script
import numpy as np
import pickle
from dataclasses import dataclass
import time
import multiprocessing
import os
import subprocess
import numpy as np
def split_number(n_total, n_split):
return [n_total // n_split + (1 if x < n_total % n_split else 0) for x in range(n_split)]
def task(args):
n_iter, idx, matrix_size = args
#cores = "{},{}".format(2 * idx, 2 * idx+1)
#os.system("taskset -p -c {} {}".format(cores, os.getpid()))
for _ in range(n_iter):
A = np.random.randn(matrix_size, matrix_size)
for _ in range(100):
A = A.dot(A)
def measure_time(n_process: int, matrix_size: int) -> float:
n_total = 100
assigne_list = split_number(n_total, n_process)
pool = multiprocessing.Pool(n_process)
ts = time.time()
pool.map(task, zip(assigne_list, range(n_process), [matrix_size] * n_process))
elapsed = time.time() - ts
return elapsed
if __name__ == "__main__":
n_experiment_sample = 5
n_logical = os.cpu_count()
n_physical = int(0.5 * n_logical)
result = {}
for mat_size in [5, 10, 20, 40, 80, 160]:
subresult = {}
result[mat_size] = subresult
for n_process in range(1, n_physical + 1):
elapsed = np.mean([measure_time(n_process, mat_size) for _ in range(n_experiment_sample)])
subresult[n_process] = elapsed
print("{}, {}, {}".format(mat_size, n_process, elapsed))
with open("result.pkl", "wb") as f:
pickle.dump(result, f)
plot script
import numpy as np
import matplotlib.pyplot as plt
import pickle
with open("result.pkl", "rb") as f:
result = pickle.load(f)
fig, ax = plt.subplots()
for matrix_size in result.keys():
subresult = result[matrix_size]
n_process_list = list(subresult.keys())
elapsed_time_list = np.array(list(subresult.values()))
speedups = elapsed_time_list[0] / elapsed_time_list
ax.plot(n_process_list, speedups, label=matrix_size)
ax.set_xlabel("number of process")
ax.set_ylabel("speed up compared to single process")
ax.legend(loc="upper left", borderaxespad=0, fontsize=10, framealpha=1.0)
plt.show()
Background: I'm trying to create a simple bootstrap function for sampling means with replacement. I want to parallelize the function since I will eventually be deploying this on data with millions of data points and will want to have sample sizes much larger. I've ran other examples such as the Mandelbrot example. In the code below you'll see that I have a CPU version of the code, which runs fine as well.
I've read several resources to get this up and running:
Random Numbers with CUDA
Writing Kernels in CUDA
The issue: This is my first foray into CUDA programming and I believe I have everything setup correctly. I'm getting this one error that I cannot seem to figure out:
TypingError: cannot determine Numba type of <class 'object'>
I believe the LOC in question is:
bootstrap_rand_gpu[threads_per_block, blocks_per_grid](rng_states, dt_arry_device, n_samp, out_mean_gpu)
Attempts to resolve the issue: I won't go into full detail, but here are the following attempts
Thought it might have something to do with cuda.to_device(). I changed it around and I also called cuda.to_device_array_like(). I've used to_device() for all parameters, and for just a few. I've seen code samples where it's used for every parameter and sometimes not. So I'm not sure what should be done.
I've removed the random number generator for GPUs (create_xoroshiro128p_states) and just used a static value to test.
Explicitly assigning integers with int() (and not). Not sure why I tried this. I read that Numba only supports a limited data types, so I made sure that they were ints
Numba Supported Datatypes
Few other things I don't recall...
Apologies for messy code. I'm a bit at wits' end on this.
Below is the full code:
import numpy as np
from numpy import random
from numpy.random import randn
import pandas as pd
from timeit import default_timer as timer
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
from numba import *
def bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean):
for i in range(boot_samp):
rand_idx = random.randint(n_samp-1,size=(50)) #get random array of indices 0-49, with replacement
out_mean[i] = dt_arry[rand_idx].mean()
#cuda.jit
def bootstrap_rand_gpu(rng_states, dt_arry, n_samp, out_mean):
thread_id = cuda.grid(1)
stride = cuda.gridsize(1)
for i in range(thread_id, dt_arry.shape[0], stride):
for k in range(0,n_samp-1,1):
rand_idx_arry[k] = int(xoroshiro128p_uniform_float32(rng_states, thread_id) * 49)
out_mean[thread_id] = dt_arry[rand_idx_arry].mean()
mean = 10
rand_fluc = 3
n_samp = int(50)
boot_samp = int(1000)
dt_arry = (random.rand(n_samp)-.5)*rand_fluc + mean
out_mean_cpu = np.empty(boot_samp)
out_mean_gpu = np.empty(boot_samp)
##################
# RUN ON CPU
##################
start = timer()
bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean_cpu)
dt = timer() - start
print("CPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_cpu.mean()))
print("Bootstrap CPU in %f s" % dt)
##################
# RUN ON GPU
##################
threads_per_block = 64
blocks_per_grid = 24
#create random state for each state in the array
rng_states = create_xoroshiro128p_states(threads_per_block * blocks_per_grid, seed=1)
start = timer()
dt_arry_device = cuda.to_device(dt_arry)
out_mean_gpu_device = cuda.to_device(out_mean_gpu)
bootstrap_rand_gpu[threads_per_block, blocks_per_grid](rng_states, dt_arry_device, n_samp, out_mean_gpu_device)
out_mean_gpu_device.copy_to_host()
dt = timer() - start
print("GPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_gpu.mean()))
print("Bootstrap GPU in %f s" % dt)
You seem to have at least 4 issues:
In your kernel code, rand_idx_arry is undefined.
You can't do .mean() in cuda device code
Your kernel launch config parameters are reversed.
Your kernel had an incorrect range for the grid-stride loop. dt_array.shape[0] is 50, so you were only populating the first 50 locations in your gpu output array. Just like your host code, the range for this grid-stride loop should be the size of the output array (which is boot_samp)
There may be other issues as well, but when I refactor your code like this to address those issues, it seems to run without error:
$ cat t65.py
#import matplotlib.pyplot as plt
import numpy as np
from numpy import random
from numpy.random import randn
from timeit import default_timer as timer
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
from numba import *
def bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean):
for i in range(boot_samp):
rand_idx = random.randint(n_samp-1,size=(50)) #get random array of indices 0-49, with replacement
out_mean[i] = dt_arry[rand_idx].mean()
#cuda.jit
def bootstrap_rand_gpu(rng_states, dt_arry, n_samp, out_mean):
thread_id = cuda.grid(1)
stride = cuda.gridsize(1)
for i in range(thread_id, out_mean.shape[0], stride):
my_sum = 0.0
for k in range(0,n_samp-1,1):
my_sum += dt_arry[int(xoroshiro128p_uniform_float32(rng_states, thread_id) * 49)]
out_mean[thread_id] = my_sum/(n_samp-1)
mean = 10
rand_fluc = 3
n_samp = int(50)
boot_samp = int(1000)
dt_arry = (random.rand(n_samp)-.5)*rand_fluc + mean
#plt.plot(dt_arry)
#figureData = plt.figure(1)
#plt.title('Plot ' + str(n_samp) + ' samples')
#plt.plot(dt_arry)
#figureData.show()
out_mean_cpu = np.empty(boot_samp)
out_mean_gpu = np.empty(boot_samp)
##################
# RUN ON CPU
##################
start = timer()
bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean_cpu)
dt = timer() - start
print("CPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_cpu.mean()))
print("Bootstrap CPU in %f s" % dt)
#figureMeanCpu = plt.figure(2)
#plt.title('Plot '+ str(boot_samp) + ' bootstrap means - CPU')
#plt.plot(out_mean_cpu)
#figureData.show()
##################
# RUN ON GPU
##################
threads_per_block = 64
blocks_per_grid = 24
#create random state for each state in the array
rng_states = create_xoroshiro128p_states(threads_per_block * blocks_per_grid, seed=1)
start = timer()
dt_arry_device = cuda.to_device(dt_arry)
out_mean_gpu_device = cuda.to_device(out_mean_gpu)
bootstrap_rand_gpu[blocks_per_grid, threads_per_block](rng_states, dt_arry_device, n_samp, out_mean_gpu_device)
out_mean_gpu = out_mean_gpu_device.copy_to_host()
dt = timer() - start
print("GPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_gpu.mean()))
print("Bootstrap GPU in %f s" % dt)
python t65.py
CPU Bootstrap mean of 1000 mean samples: 10.148048544038735
Bootstrap CPU in 0.037496 s
GPU Bootstrap mean of 1000 mean samples: 10.145088765532936
Bootstrap GPU in 0.416822 s
$
Notes:
I've commented out a bunch of stuff that doesn't seem to be relevant. You might want to do something like that in the future when posting code (remove stuff that is not relevant to your question.)
I've fixed some things about your final GPU printout at the end, also.
I haven't studied your code carefully. I'm not suggesting anything is defect free. I'm just pointing out some issues and providing a guide for how they might be addressed. I can see the results don't match between CPU and GPU, but since I don't know what you're doing, and also because the random number generators don't match between CPU and GPU code, it's not obvious to me that things should match.
I have a large array of values packed in a 4D numpy array (thousands of values in x,y,z for thousands of times). For each of these values I need 'color vector' (RGBA) from a matplotlib.cm.ScalarMappable object.
I've discovered that looping through such an array becomes rather slow, and I'm wondering if there's a way of significantly speeding it up by taking a different approach. For example, can an entire numpy array (greater than 2D) be passed to a ScalarMappable in order for this operation to take place in a more numpythonic or vectorized way?
My example code for a 3D case:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import timeit
def get_colors_1(data,x,y,z):
colors = np.zeros( (x,y,z,4), dtype=np.float16)
for i in range(x):
for j in range(y):
for k in range(z):
colors[i,j,k,:] = m.to_rgba(data[i,j,k])
return colors
def get_colors_2(data,x,y,z):
colors = np.array([[[m.to_rgba(data[i,j,k]) for k in range(z)] for j in range(y)] for i in range(x)], dtype=np.float16)
return colors
def get_colors_3(data,x,y,z):
colors = np.zeros((x,y,z,4), dtype=np.float16)
for i in range(x):
colors[i,:,:,:] = m.to_rgba(data[i,:,:])
return colors
x, y, z = 30, 20, 10
data = np.random.rand(x,y,z)
cmap = matplotlib.cm.get_cmap('jet')
norm = matplotlib.colors.PowerNorm(vmin=0.0, vmax=1.0, gamma=2.5)
m = matplotlib.cm.ScalarMappable(norm=norm, cmap=cmap)
start_time = timeit.default_timer()
colors = get_colors_1(data,x,y,z)
elapsed = timeit.default_timer() - start_time
print('time elapsed: '+str(elapsed))
start_time = timeit.default_timer()
colors = get_colors_2(data,x,y,z)
elapsed = timeit.default_timer() - start_time
print('time elapsed: '+str(elapsed))
start_time = timeit.default_timer()
colors = get_colors_3(data,x,y,z)
elapsed = timeit.default_timer() - start_time
print('time elapsed: '+str(elapsed))
The third method (passing 2D arrays at time) shows a big performance bump, but I'm wondering if this can be pushed significantly further.
time elapsed: 0.5877857000014046
time elapsed: 0.5911024999986694
time elapsed: 0.004590500000631437
Looking at help ScalarMappable.to_rgba,
def to_rgba(self, x, alpha=None, bytes=False, norm=True):
In the normal case, x is a 1D or 2D sequence of scalars, and
the corresponding ndarray of rgba values will be returned,
based on the norm and colormap set for this ScalarMappable.
There is one special case, for handling images that are already
rgb or rgba, such as might have been read from an image file.
If x is an ndarray with 3 dimensions ...
Is
sm = ScalarMappable( norm=None, cmap=cmap )
flat = data.reshape( -1 ) # a view
rgba = sm.to_rgba( flat, bytes=True, norm=False ).reshape( data.shape + (4,) )
much faster ?
(Source: ScalarMappable to_rgba .)
In doing some experiments to parallelise 3 encapsulated for cycle with numba, I realised that a naive approach is actually not improving the performance.
The following code produce the following times (in seconds):
0.154625177383 # no numba
0.420143127441 # numba first time (lazy initialisation)
0.196285963058 # numba second time
0.200047016144 # nubma third time
0.199403047562 # nubma fourth time
Any idea what am I doing wrong?
import numpy as np
from numba import jit, prange
import time
def run_1():
dims = [100,100,100]
a = np.zeros(dims)
for x in range(100):
for y in range(100):
for z in range(100):
a[x,y,z] = 1
return a
#jit
def run_2():
dims = [100,100,100]
a = np.zeros(dims)
for x in prange(100):
for y in prange(100):
for z in prange(100):
a[x,y,z] = 1
return a
if __name__ == '__main__':
t = time.time()
run_1()
elapsed1 = time.time() - t
print elapsed1
t = time.time()
run_2()
elapsed2 = time.time() - t
print elapsed2
t = time.time()
run_2()
elapsed3 = time.time() - t
print elapsed3
t = time.time()
run_2()
elapsed3 = time.time() - t
print elapsed3
t = time.time()
run_2()
elapsed3 = time.time() - t
print elapsed3
I wonder if there's any code to JIT in these loops: there's no non-trivial Python code to compile, only thin wrappers over C code (yes, range is C code). Possibly the JIT only adds overhead trying to profile and generate (unsuccessfully) more efficient code.
If you want speed-up, think about parallelization using scipy or maybe direct access to NumPy arrays from Cython.
I have compared the different methods for convolving/correlating two signals using numpy/scipy. It turns out that there are huge differences in speed. I compared the follwing methods:
correlate from the numpy package (np.correlate in plot)
correlate from the scipy.signal package (sps.correlate in plot)
fftconvolve from scipy.signal (sps.fftconvolve in plot)
Now I of course understand that there is a considerable difference between fftconvolve and the other two functions. What I do not understand is why the sps.correlate is so much slower than np.correlate. Does anybody know why scipy uses an implementation that is so much slower?
For completeness, here is the code that produces the plot:
import time
import numpy as np
import scipy.signal as sps
from matplotlib import pyplot as plt
if __name__ == '__main__':
a = 10**(np.arange(10)/2)
print(a)
results = {}
results['np.correlate'] = np.zeros(len(a))
results['sps.correlate'] = np.zeros(len(a))
results['sps.fftconvolve'] = np.zeros(len(a))
ii = 0
for length in a:
sig = np.random.rand(length)
t0 = time.clock()
for jj in range(3):
np.correlate(sig, sig, 'full')
t1 = time.clock()
elapsed = (t1-t0)/3
results['np.correlate'][ii] = elapsed
t0 = time.clock()
for jj in range(3):
sps.correlate(sig, sig, 'full')
t1 = time.clock()
elapsed = (t1-t0)/3
results['sps.correlate'][ii] = elapsed
t0 = time.clock()
for jj in range(3):
sps.fftconvolve(sig, sig, 'full')
t1 = time.clock()
elapsed = (t1-t0)/3
results['sps.fftconvolve'][ii] = elapsed
ii += 1
ax = plt.figure()
plt.loglog(a, results['np.correlate'], label='np.correlate')
plt.loglog(a, results['sps.correlate'], label='sps.correlate')
plt.loglog(a, results['sps.fftconvolve'], label='sps.fftconvolve')
plt.xlabel('Signal length')
plt.ylabel('Elapsed time in seconds')
plt.legend()
plt.grid()
plt.show()
According to the documentation, numpy.correlate was designed for 1D arrays, while scipy.correlate can accept ND-arrays.
The scipy implementation being more general and therefore complex, seem indeed to incur an additional computational overhead. You can compare the C code between numpy and scipy implementations.
Another difference, could be for instance, that numpy implementation gets better vectorized by the compiler on modern processors, etc.