Python Numba Cuda Copy_To_Host Slow - python

I have recently started looking into using cuda for optimising searches over numeric arrays. I have a simplified piece of code below which demonstrates the issue.
import numpy as np
import time
from numba import cuda
#cuda.jit
def count_array4(device_array, pivot_point, device_output_array):
for i in range(len(device_array)):
if (pivot_point - 0.05) < device_array[i] < (pivot_point + 0.05):
device_output_array[i] = True
else:
device_output_array[i] = False
width = 512
height = 512
size = width * height
print(f'Number of records {size}')
array_of_random = np.random.rand(size)
device_array = cuda.to_device(array_of_random)
start = time.perf_counter()
device_output_array = cuda.device_array(size)
print(f'Copy Host to Device: {time.perf_counter() - start}')
for x in range(10):
start = time.perf_counter()
count_array4[512, 512](device_array, .5, device_output_array)
print(f'Run: {x} Time: {time.perf_counter() - start}')
start = time.perf_counter()
output_array = device_output_array.copy_to_host()
print(f'Copy Device to Host: {time.perf_counter() - start}')
print(np.sum(output_array))
This gives me the expected optimization in processing, however the time it takes to return the data to the host seems extremely high.
Number of records 262144
Copy Host to Device: 0.00031610000000004135
Run: 0 Time: 0.0958601
Run: 1 Time: 0.0001626999999999601
Run: 2 Time: 0.00012100000000003774
Run: 3 Time: 0.00011590000000005762
Run: 4 Time: 0.00011419999999995323
Run: 5 Time: 0.0001126999999999656
Run: 6 Time: 0.00011289999999997136
Run: 7 Time: 0.0001122999999999541
Run: 8 Time: 0.00011490000000002887
Run: 9 Time: 0.00011200000000000099
Copy Device to Host: 13.0583358
26110.0
I'm fairly sure that I'm missing something basic here, or a technique that I don't know the correct term to search for. If anyone can point me in the right direction I'd be very grateful.

Kernel launches are asynchronous and the driver can queue multiple launches. As a result, you are measuring only kernel launch overhead within the loop, and then the data transfer, which is a blocking call, captures all the kernel execution time. You can change this behaviour by modifying your code like this:
for x in range(10):
start = time.perf_counter()
count_array4[512, 512](device_array, .5, device_output_array)
cuda.synchronize()
print(f'Run: {x} Time: {time.perf_counter() - start}')
The synchronize call ensures each kernel launch is completed and the device is idle before another kernel launches. The effect should be that each kernel run time will increase, the indicated transfer time will decrease.

Related

emcee issue MPI while sharing an executable file

So I want to run a python script that uses emcee librarie with MPI. I had run into some issues, running it on the server, so first, to verify that the MPI works will within emcee library and the nodes fo the server, I created this MWE:
import sys
import time
import emcee
import numpy as np
from schwimmbad import MPIPool
def log_prob(theta):
t = time.time() + np.random.uniform(0.005, 0.008)
while True:
if time.time() >= t:
break
return -0.5*np.sum(theta**2)
with MPIPool() as pool:
if not pool.is_master():
pool.wait()
sys.exit(0)
np.random.seed(42)
initial = np.random.randn(32, 5)
nwalkers, ndim = initial.shape
nsteps = 100
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob, pool=pool)
start = time.time()
sampler.run_mcmc(initial, nsteps)
end = time.time()
print(end - start)
When running with
srun --mpi=pmi2 -n 40 python3 MWE.py
and
srun --mpi=pmi2 -n 2 python3 MWE.py
I got respectively, 2.2 and 20 seconds of execution, which hits that it works correctly.
Now my code is sort of too long to write but here is the structure:
import os
os.environ["OMP_NUM_THREADS"] = "1" #I have also tried without it
import emcee as emcee
import numpy as np
# a funciton that uses a c++ executable to via os.sytem
def AG_plotter(angls,n,efre,efrb,gg,ee,ec):
os.system('LD_LIBRARY_PATH=/home/..cpp/lib; export LD_LIBRARY_PATH; ./a.out DATA_FILE_TO_PRINT'+str(nu)+str(angls)+gg+ee+n+efre+efrb+ec+'.dat')
data=np.loadtxt('DATA_FILE_TO_PRINT'+str(nu[i])+str(angls[j])+gg+ee+n+efre+efrb+ec+'.dat')
return data
def func(theta,gg,ee):
ang,n,efre,efrb,ec = theta
data_to_check = AG_plotter(ang,n,efre,efrb,gg,ee,ec)
.
#things with data_to_check#
.
return -res #assume prior(theta) = 0 everywhere
# I pass some paths. SO I want to run MPIPool for each path
for i in range(len(pathss)):
#everything here is according to the previous examples
with MPIPool() as pool:
if not pool.is_master():
pool.wait()
sys.exit(0)
gg = 'g'+pathss[i]'.txt'
ee = 'e'+pathss[i]'.txt'
ndim, nwalkers = len(initial), 100
sampler = emcee.EnsembleSampler(nwalkers, ndim, func, args = [gg,ee])
print("Running first burn-in...")
p0 = initial + 1e-5 * np.random.randn(nwalkers, ndim)
p0, lp, _ = sampler.run_mcmc(p0, 15)
This code should print a file with DATA_TO_PRINT...dat after each call. But what I saw is that, no matter the number of cores, it always prints one file and takes ages to move on.
I expected, each 'cycle' of function to print n number of files, where n is the number of cores I used.
Anything obvious I'm doing wrong?

Cuda Python Error: TypingError: cannot determine Numba type of <class 'object'>

Background: I'm trying to create a simple bootstrap function for sampling means with replacement. I want to parallelize the function since I will eventually be deploying this on data with millions of data points and will want to have sample sizes much larger. I've ran other examples such as the Mandelbrot example. In the code below you'll see that I have a CPU version of the code, which runs fine as well.
I've read several resources to get this up and running:
Random Numbers with CUDA
Writing Kernels in CUDA
The issue: This is my first foray into CUDA programming and I believe I have everything setup correctly. I'm getting this one error that I cannot seem to figure out:
TypingError: cannot determine Numba type of <class 'object'>
I believe the LOC in question is:
bootstrap_rand_gpu[threads_per_block, blocks_per_grid](rng_states, dt_arry_device, n_samp, out_mean_gpu)
Attempts to resolve the issue: I won't go into full detail, but here are the following attempts
Thought it might have something to do with cuda.to_device(). I changed it around and I also called cuda.to_device_array_like(). I've used to_device() for all parameters, and for just a few. I've seen code samples where it's used for every parameter and sometimes not. So I'm not sure what should be done.
I've removed the random number generator for GPUs (create_xoroshiro128p_states) and just used a static value to test.
Explicitly assigning integers with int() (and not). Not sure why I tried this. I read that Numba only supports a limited data types, so I made sure that they were ints
Numba Supported Datatypes
Few other things I don't recall...
Apologies for messy code. I'm a bit at wits' end on this.
Below is the full code:
import numpy as np
from numpy import random
from numpy.random import randn
import pandas as pd
from timeit import default_timer as timer
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
from numba import *
def bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean):
for i in range(boot_samp):
rand_idx = random.randint(n_samp-1,size=(50)) #get random array of indices 0-49, with replacement
out_mean[i] = dt_arry[rand_idx].mean()
#cuda.jit
def bootstrap_rand_gpu(rng_states, dt_arry, n_samp, out_mean):
thread_id = cuda.grid(1)
stride = cuda.gridsize(1)
for i in range(thread_id, dt_arry.shape[0], stride):
for k in range(0,n_samp-1,1):
rand_idx_arry[k] = int(xoroshiro128p_uniform_float32(rng_states, thread_id) * 49)
out_mean[thread_id] = dt_arry[rand_idx_arry].mean()
mean = 10
rand_fluc = 3
n_samp = int(50)
boot_samp = int(1000)
dt_arry = (random.rand(n_samp)-.5)*rand_fluc + mean
out_mean_cpu = np.empty(boot_samp)
out_mean_gpu = np.empty(boot_samp)
##################
# RUN ON CPU
##################
start = timer()
bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean_cpu)
dt = timer() - start
print("CPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_cpu.mean()))
print("Bootstrap CPU in %f s" % dt)
##################
# RUN ON GPU
##################
threads_per_block = 64
blocks_per_grid = 24
#create random state for each state in the array
rng_states = create_xoroshiro128p_states(threads_per_block * blocks_per_grid, seed=1)
start = timer()
dt_arry_device = cuda.to_device(dt_arry)
out_mean_gpu_device = cuda.to_device(out_mean_gpu)
bootstrap_rand_gpu[threads_per_block, blocks_per_grid](rng_states, dt_arry_device, n_samp, out_mean_gpu_device)
out_mean_gpu_device.copy_to_host()
dt = timer() - start
print("GPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_gpu.mean()))
print("Bootstrap GPU in %f s" % dt)
You seem to have at least 4 issues:
In your kernel code, rand_idx_arry is undefined.
You can't do .mean() in cuda device code
Your kernel launch config parameters are reversed.
Your kernel had an incorrect range for the grid-stride loop. dt_array.shape[0] is 50, so you were only populating the first 50 locations in your gpu output array. Just like your host code, the range for this grid-stride loop should be the size of the output array (which is boot_samp)
There may be other issues as well, but when I refactor your code like this to address those issues, it seems to run without error:
$ cat t65.py
#import matplotlib.pyplot as plt
import numpy as np
from numpy import random
from numpy.random import randn
from timeit import default_timer as timer
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
from numba import *
def bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean):
for i in range(boot_samp):
rand_idx = random.randint(n_samp-1,size=(50)) #get random array of indices 0-49, with replacement
out_mean[i] = dt_arry[rand_idx].mean()
#cuda.jit
def bootstrap_rand_gpu(rng_states, dt_arry, n_samp, out_mean):
thread_id = cuda.grid(1)
stride = cuda.gridsize(1)
for i in range(thread_id, out_mean.shape[0], stride):
my_sum = 0.0
for k in range(0,n_samp-1,1):
my_sum += dt_arry[int(xoroshiro128p_uniform_float32(rng_states, thread_id) * 49)]
out_mean[thread_id] = my_sum/(n_samp-1)
mean = 10
rand_fluc = 3
n_samp = int(50)
boot_samp = int(1000)
dt_arry = (random.rand(n_samp)-.5)*rand_fluc + mean
#plt.plot(dt_arry)
#figureData = plt.figure(1)
#plt.title('Plot ' + str(n_samp) + ' samples')
#plt.plot(dt_arry)
#figureData.show()
out_mean_cpu = np.empty(boot_samp)
out_mean_gpu = np.empty(boot_samp)
##################
# RUN ON CPU
##################
start = timer()
bootstrap_rand_cpu(dt_arry, n_samp, boot_samp, out_mean_cpu)
dt = timer() - start
print("CPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_cpu.mean()))
print("Bootstrap CPU in %f s" % dt)
#figureMeanCpu = plt.figure(2)
#plt.title('Plot '+ str(boot_samp) + ' bootstrap means - CPU')
#plt.plot(out_mean_cpu)
#figureData.show()
##################
# RUN ON GPU
##################
threads_per_block = 64
blocks_per_grid = 24
#create random state for each state in the array
rng_states = create_xoroshiro128p_states(threads_per_block * blocks_per_grid, seed=1)
start = timer()
dt_arry_device = cuda.to_device(dt_arry)
out_mean_gpu_device = cuda.to_device(out_mean_gpu)
bootstrap_rand_gpu[blocks_per_grid, threads_per_block](rng_states, dt_arry_device, n_samp, out_mean_gpu_device)
out_mean_gpu = out_mean_gpu_device.copy_to_host()
dt = timer() - start
print("GPU Bootstrap mean of " + str(boot_samp) + " mean samples: " + str(out_mean_gpu.mean()))
print("Bootstrap GPU in %f s" % dt)
python t65.py
CPU Bootstrap mean of 1000 mean samples: 10.148048544038735
Bootstrap CPU in 0.037496 s
GPU Bootstrap mean of 1000 mean samples: 10.145088765532936
Bootstrap GPU in 0.416822 s
$
Notes:
I've commented out a bunch of stuff that doesn't seem to be relevant. You might want to do something like that in the future when posting code (remove stuff that is not relevant to your question.)
I've fixed some things about your final GPU printout at the end, also.
I haven't studied your code carefully. I'm not suggesting anything is defect free. I'm just pointing out some issues and providing a guide for how they might be addressed. I can see the results don't match between CPU and GPU, but since I don't know what you're doing, and also because the random number generators don't match between CPU and GPU code, it's not obvious to me that things should match.

Why is this Python script with Matplotlib so slow?

I'm trying so simulate coin tosses and profits and plot the graph in matplotlib:
from random import choice
import matplotlib.pyplot as plt
import time
start_time = time.time()
num_of_graphs = 2000
tries = 2000
coins = [150, -100]
last_loss = 0
for a in range(num_of_graphs):
profit = 0
line = []
for i in range(tries):
profit = profit + choice(coins)
if (profit < 0 and last_loss < i):
last_loss = i
line.append(profit)
plt.plot(line)
plt.show()
print("--- %s seconds ---" % (time.time() - start_time))
print("No losses after " + str(last_loss) + " iterations")
The end result is
--- 9.30498194695 seconds ---
No losses after 310 iterations
Why is it taking so long to run this script? If I change num_of_graphs to 10000, the scripts never finishes.
How would you optimize this?
Your measure of execution time is too rough. The following allows you to measure the time needed for the simulation, separate from the time needed for plotting:
It is using numpy.
import matplotlib.pyplot as plt
import numpy as np
import time
def run_sims(num_sims, num_flips):
start = time.time()
sims = [np.random.choice(coins, num_flips).cumsum() for _ in range(num_sims)]
end = time.time()
print(f"sim time = {end-start}")
return sims
def plot_sims(sims):
start = time.time()
for line in sims:
plt.plot(line)
end = time.time()
print(f"plotting time = {end-start}")
plt.show()
if __name__ == '__main__':
start_time = time.time()
num_sims = 2000
num_flips = 2000
coins = np.array([150, -100])
plot_sims(run_sims(num_sims, num_flips))
result:
sim time = 0.13962197303771973
plotting time = 6.621474981307983
As you can see, the sim time is greatly reduced (it was on the order of 7 seconds on my 2011 laptop); The plotting time is matplotlib dependent.
matplotlib is getting slower as the script progresses because it is
redrawing all of the lines that you have previously plotted - even the
ones that have scrolled off the screen.
This is the answer from a previous post answered by Simon Gibbons.
matplotlib isn't optimized for speed, rather its graphics. Here are the links to a few which were developed for speed:
http://www.pyqtgraph.org/
http://code.google.com/p/guiqwt/
http://code.enthought.com/projects/chaco/
You can refer to the matplotlib cookbook for more about performance.
In order to better optimize your code, I would always try to replace loops by vectorization using numpy or, depending on my specific needs, other libraries that use numpy under the hood.
In this case, you could calculate and plot your profits this way:
import matplotlib.pyplot as plt
import time
import numpy as np
start_time = time.time()
num_of_graphs = 2000
tries = 2000
coins = [150, -100]
# Create a 2-D array with random choices
# rows for tries, columns for individual runs (graphs).
coin_tosses = np.random.choice(coins, (tries, num_of_graphs))
# Caculate 2-D array of profits by summing
# cumulatively over rows (trials).
profits = coin_tosses.cumsum(axis=0)
# Plot everything in one shot.
plt.plot(profits)
plt.show()
print("--- %s seconds ---" % (time.time() - start_time))
In my configuration, this code took aprox. 6.3 seconds (6.2 plotting) to run, while your code took almost 15 seconds.

Why increasing number of workers (more than number of cores) still decrease execution time?

I always sure that there is no point to have more threads/processes than CPU cores (from the performance perspective). However, my python sample shows me a different result.
import concurrent.futures
import random
import time
def doSomething(task_num):
print("executing...", task_num)
time.sleep(1) # simulate heavy operation that takes ~ 1 second
return random.randint(1, 10) * random.randint(1, 500) # real operation, used random to avoid caches and so on...
def main():
# This part is not taken in consideration because I don't want to
# measure the worker creation time
executor = concurrent.futures.ProcessPoolExecutor(max_workers=60)
start_time = time.time()
for i in range(1, 100): # execute 100 tasks
executor.map(doSomething, [i, ])
executor.shutdown(wait=True)
print("--- %s seconds ---" % (time.time() - start_time))
if __name__ == '__main__':
main()
Program results:
1 WORKER --- 100.28233647346497 seconds ---
2 WORKERS --- 50.26122164726257 seconds ---
3 WORKERS --- 33.32741022109985 seconds ---
4 WORKERS --- 25.399883031845093 seconds ---
5 WORKERS --- 20.434186220169067 seconds ---
10 WORKERS--- 10.903695344924927 seconds ---
50 WORKERS--- 6.363946914672852 seconds ---
60 WORKERS--- 4.819359302520752 seconds ---
How this can work faster having just 4 Logical processors ?
Here is my Computer specifications (Tested on Windows 8 and Ubuntu 14):
CPU Intel(R) Core(TM) i5-3210M CPU # 2.50GHz
Sockets: 1
Cores: 2
Logical processors: 4
The reason is because sleep() only uses a negligible amount of CPU. In this case, it is a poor simulation of actual work performed by a thread.
All sleep() really does is suspend the thread until the timer expires. While the thread is suspended, it doesn't use any CPU cycles.
I extended your example with a more intensive computation (eg. matrix inversion). You will see what you expected: the computation time will decrease to the number of cores and increase afterwards (because of the cost of context switching).
import concurrent.futures
import random
import time
import numpy as np
import matplotlib.pyplot as plt
def doSomething(task_num):
print("executing...", task_num)
for i in range(100000):
A = np.random.normal(0,1,(1000,1000))
B = np.inv(A)
return random.randint(1, 10) * random.randint(1, 500) # real operation, used random to avoid caches and so on...
def measureTime(nWorkers: int):
executor = concurrent.futures.ProcessPoolExecutor(max_workers=nWorkers)
start_time = time.time()
for i in range(1, 40): # execute 100 tasks
executor.map(doSomething, [i, ])
executor.shutdown(wait=True)
return (time.time() - start_time)
def main():
# This part is not taken in consideration because I don't want to
# measure the worker creation time
maxWorkers = 20
dT = np.zeros(maxWorkers)
for i in range(maxWorkers):
dT[i] = measureTime(i+1)
print("--- %s seconds ---" % dT[i])
plt.plot(np.linspace(1,maxWorkers, maxWorkers), dT)
plt.show()
if __name__ == '__main__':
main()

Why does my parallel performance top out?

I've been playing around with Python a lot lately, and in comparing numerous parallelization packages, I noticed that the performance increase from serial to parallel seems to top out at 6 processes instead of 8--the number of cores my MacBook Pro (OS X 10.8.2) has.
The attached plot compares the timing of different tasks as a function of number of processes (parallel or sequential). This example is using the python built-int 'multiprocessing' package 'Memory' vs. 'Processor' refers to memory-intensive (just allocating large arrays) vs. computationally intensive (many operations) functions.
What is the cause of the top-out below 8-processes?
(The 'Time's are averaged over 100 function calls for each number of processes)
import multiprocessing as mp
import time
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
iters = 100
mem_num = 1000
pro_num = 20000
max_procs = 10
line_width = 2.0
legend_size = 10
fig_name = 'timing.pdf'
def UseMemory(num):
test1 = np.zeros([num,num])
test2 = np.arange(num*num)
test3 = np.array(test2).reshape([num, num])
test4 = np.empty(num, dtype=object)
return
def UseProcessor(num):
test1 = np.arange(num)
test1 = np.cos(test1)
test1 = np.sqrt(np.fabs(test1))
test2 = np.zeros(num)
for i in range(num): test2[i] = test1[i]
return np.std(test2)
def MemJob(its):
for ii in range(its): UseMemory(mem_num)
def ProJob(its):
for ii in range(iters): UseProcessor(pro_num)
if __name__ == "__main__":
print '\nParTest\n'
proc_range = np.arange(1,max_procs+1,step=1)
test_times = np.zeros([len(proc_range),2,2]) # test_times[num_procs][0-ser,1-par][0-mem,1-pro]
tot_times = np.zeros([len(proc_range),2 ]) # tot_times[num_procs][0-ser,1-par]
print ' Testing %2d numbers of processors between [%d,%d]' % (len(proc_range), 1, max_procs)
print ' Iterations %d, Memory Length %d, Processor Length %d' % (iters, mem_num, pro_num)
for it in range(len(proc_range)):
procs = proc_range[it]
job_arg = procs*[iters]
print '\n - %2d, Processes = %3d' % (it, procs)
# --- Test Serial ---
print ' - - Serial'
# Test Memory
all_start = time.time()
start = time.time()
map(MemJob, [procs*iters])
ser_mem_time = time.time() - start
# Test Processor
start = time.time()
map(ProJob, job_arg)
ser_pro_time = time.time() - start
ser_time = time.time() - all_start
# --- Test Parallel : multiprocessing ---
print ' - - Parallel: multiprocessing'
pool = mp.Pool(processes=procs)
# Test Memory
all_start = time.time()
start = time.time()
pool.map(MemJob, job_arg)
par_mem_time = time.time() - start
# Test Processor
start = time.time()
pool.map(ProJob, job_arg)
par_pro_time = time.time() - start
par_time = time.time() - all_start
print ' - - Collecting'
ser_mem_time /= procs
ser_pro_time /= procs
par_mem_time /= procs
par_pro_time /= procs
ser_time /= procs
par_time /= procs
test_times[it][0] = [ ser_mem_time, ser_pro_time ]
test_times[it][1] = [ par_mem_time, par_pro_time ]
tot_times[it] = [ ser_time , par_time ]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_xlabel('Number of Processes')
ax.set_ylabel('Time [s]')
ax.xaxis.grid(True)
ax.yaxis.grid(True)
lines = []
names = []
l1, = ax.plot(proc_range, test_times[:,0,0], linewidth=line_width)
lines.append(l1)
names.append('Serial Memory')
l1, = ax.plot(proc_range, test_times[:,0,1], linewidth=line_width)
lines.append(l1)
names.append('Serial Processor')
l1, = ax.plot(proc_range, tot_times[:,0], linewidth=line_width)
lines.append(l1)
names.append('Serial')
l1, = ax.plot(proc_range, test_times[:,1,0], linewidth=line_width)
lines.append(l1)
names.append('Parallel Memory')
l1, = ax.plot(proc_range, test_times[:,1,1], linewidth=line_width)
lines.append(l1)
names.append('Parallel Processor')
l1, = ax.plot(proc_range, tot_times[:,1], linewidth=line_width)
lines.append(l1)
names.append('Parallel')
plt.legend(lines, names, ncol=2, prop={'size':legend_size}, fancybox=True, shadow=True, bbox_to_anchor=(1.10, 1.10))
fig.savefig(fig_name,dpi=fig.get_dpi())
print ' - Saved to ', fig_name
plt.show(block=True)
From the discussion above I think you have the information you need, but I'm adding an answer to collect facts in case it benefits others (plus I wanted to work it through myself). (Due credit to #bamboon who mentioned some of this first.)
First, your MacBook has a CPU with four physical cores, but the design of the chip is such that each core's hardware has the ability to run two threads. This is called "simultaneous multithreading" (SMT) and in this case is embodied by Intel's hyperthreading feature. So taken all together you have 8 "virtual cores" (4 + 4 = 8).
Note that the OS treats all the virtual cores the same, i.e. it does not distinguish between the two SMT threads offered by a physical core, and that's why sysctl returns 8 when you query it. Python will do the same thing:
>>> import multiprocessing
>>> multiprocessing.cpu_count()
8
Second, the speedup limit you're encountering is a well-known phenomenon in which parallel performance saturates and does not improve with the addition of more processors working on the problem. This effect is described by Amdahl's Law, a quantitative statement about how much speedup to expect from multiple processors depending on how much code can be parallelized and how much runs serially.
Typically a number of factors limit relative speedup, including details of the OS and even the computer's architecture (e.g. how SMT works in a hardware core), so that even if you parallelize as much of your code as you can, your performance will not scale indefinitely. Understanding where the serial bottleneck is can require very detailed analysis of your program and its running environment.
You can find a good example with discussion in this question.
I hope this helps.

Categories

Resources