I have built a backtester in python which does a complete run in 70 ms. On a single thread (using a for loop) I can run this backtester and it shows normal performance (around 70 ms per iteration):
for q in queue:
print(backtest(alldata1h, alldata, q['strats'], q['filts']))
My problem is the following: whenever I try to run this function using multiprocessing, the performance is much worse (~ 800 ms per backtest).
I have tried doing this using an array of Process and Queue objects:
def do_workload(q, wk, a1h, ad):
for w in wk:
c = backtest(a1h, ad, w['strats'], w['filts'])
q.put({"Strategy": w['sname'], "Filter": w['fname'], "c": c})
q.put('DONE')
#Please ignore unnecessary indentation
for i in range(thread_nr):
thread_pool.append({"Process": "", "Queue": "", "workload": workloads[i], "workindex": 0, "finished": False})
thread_pool[i]['Queue'] = Queue()
thread_pool[i]['Process'] = Process(target=do_workload, args=(thread_pool[i]['Queue'], workloads[i], alldata1h, alldata))
thread_pool[i]['Process'].start()
print("Total workload: {} backtests".format(len(queue)))
while queue_index < len(queue):
for t in range(len(thread_pool)):
time.sleep(0.1)
if thread_pool[t]['finished'] == False:
while not thread_pool[t]['Queue'].empty():
res = thread_pool[t]['Queue'].get()
if res == "DONE":
thread_pool[t]['finished'] = True
else:
final_results = final_results.append(res, ignore_index=True)
queue_index += 1
print("Read from threads: {}/{}".format(queue_index, len(queue)))
time.sleep(10)
print("DONE")
and I have also tried this using a Pool object:
print("Total workload: {} backtests".format(len(queue)))
from functools import partial
target = partial(do_workload, a1h=alldata1h, ad=alldata)
pool = Pool(processes=thread_nr)
print("Starting pool...")
print(len(pool.map(target, workloads, len(workloads[0]))))
My processor has 64 cores and 128 threads, so I gave it a high thread_nr (around 100-120), but performance is still horrible.
My question would be the following: is there a way to improve python multiprocessing enough in order to achieve 70 ms per backtest (per process)? Or should I rewrite the whole project (backtester and process manager) in C++ in order to achieve the best performance (using all of the possible threads/ the whole CPU).
Related
I am trying to improve the speed of some code with multiprocess. And I noticed the speed does not increase as expected. I know there are overheads for the spawn of child processes and there are overheads for data transfer between the parent process and child processes. However, even after I minimized the overheads, the performance with multiprocess is still not what I expected. So I write a simple test code:
import multiprocessing
import numpy as np
import time
def test_function():
start_time = time.time()
n = 1000
x = np.random.rand(n,n)
p = np.random.rand(n,n)
y = 0
for i in range(n):
for j in range(n):
y += np.power(x[i][j], p[i][j])
print ("= Running time:",time.time()-start_time)
return
def main():
procs = [1,2,3,4,5,6]
for proc in procs:
print("Number of process:", proc)
pool = multiprocessing.Pool(processes=proc)
para = [(),] * proc
pool.starmap(test_function,para)
pool.close()
pool.join()
if __name__ == '__main__':
main()
You can see that the test function only has two loops and some mathematics computations. There are no data transfer between the main process and the children process, and the time is calculated inside the child process, so no overhead will be included. And here is the output:
Number of process: 1
= Running time: 4.253360033035278
Number of process: 2
= Running time: 4.404280185699463
= Running time: 4.411274671554565
Number of process: 3
= Running time: 4.580170154571533
= Running time: 4.59316349029541
= Running time: 4.610152959823608
Number of process: 4
= Running time: 4.908967733383179
= Running time: 4.926954030990601
= Running time: 4.997913122177124
= Running time: 5.09885048866272
Number of process: 5
= Running time: 5.406658172607422
= Running time: 5.441636562347412
= Running time: 5.4576287269592285
= Running time: 5.473618030548096
= Running time: 5.621527671813965
Number of process: 6
= Running time: 6.195171594619751
= Running time: 6.225149869918823
= Running time: 6.256133079528809
= Running time: 6.290108919143677
= Running time: 6.339082717895508
= Running time: 6.3710620403289795
The code is executed under Windows 10 with i7 CPU with 4 cores,8 logic processes. Obviously the running time for each process is increasing as the number of process increases. Is this caused by the operating system or the limitation of the CPU itself or other hardware as well?
Update: here is the out put in Linux environment. It is interesting to see that with 5 processes, 2 processes time have big jump and with 6 processes, 4 processes time have big jump. It seems that it is related with the logic processors? The physical cores need to switch/swap sources for the logic processors?
Number of process: 1
= Running time: 4.039047479629517
Number of process: 2
= Running time: 4.150756597518921
= Running time: 4.159530878067017
Number of process: 3
= Running time: 4.228744745254517
= Running time: 4.261997938156128
= Running time: 4.324823379516602
Number of process: 4
= Running time: 4.342475891113281
= Running time: 4.347326755523682
= Running time: 4.350982427597046
= Running time: 4.370999574661255
Number of process: 5
= Running time: 4.369337797164917
= Running time: 4.391499757766724
= Running time: 4.43767237663269
= Running time: 6.300408124923706
= Running time: 6.31215763092041
Number of process: 6
= Running time: 4.366948366165161
= Running time: 4.38712739944458
= Running time: 6.366809844970703
= Running time: 6.370593786239624
= Running time: 6.422687530517578
= Running time: 6.433435916900635
Short answer: the low-population increases are likely caused by the OS, but you haven't provided the needed data for analysis.
Long answer: would entail an introduction to operating systems.
In your post, you claim
the time is calculated inside the child process, so no overhead will be included.
This is false. You calculate elapsed time, a.k.a. "wall-clock time". Any OS overhead is included in that time: garbage collection, context swapping, etc.
To properly profile your system, you need to profile your system, not merely one application. What else is running on your system while these processes execute? Since this is Windows, it's virtually guaranteed that your four cores have things to do other than the Python RTE (run-time environment). To understand what happens in your multi-process application, run with a dynamic profiler and watch what processes are active as the Python processes run. Graph the activity by process or job; I expect that you'll see several system services working as well.
For a simpler, less accurate metric of process activity, look up how to extract from Windows the CPU consumption for each process.
Have you figured out the problem? I had the same issue with multiprocessing. I found that if you add a certain delay (not too small) between different processes, the time consumption of each process will reduce down to same value as that of one parallel process. However, we end up gaining nothing from multiprocessing due to the delay. It's really confusing.
import multiprocessing as mp
import numpy as np
import time
def test_function():
start_time = time.time()
n = 1000
x = np.random.rand(n,n)
p = np.random.rand(n,n)
y = 0
for i in range(n):
for j in range(n):
y += np.power(x[i][j], p[i][j])
print ("= Running time:",time.time()-start_time)
return
def main():
N = 6
procs = []
for ii in range(N):
procs.append(mp.Process(target=test_function))
for p in procs:
p.start()
time.sleep(2)
for p in procs:
p.join()
if __name__ == '__main__':
main()
I think we should not focus on the executing time of each single process. Instead, it is more meaningful to look at the total time consumption for a certain job. Please check out the following code:
import multiprocessing as mp
import numpy as np
import time
def test_function(x,p,N_loop):
start_time = time.time()
y = 0
for i in range(N_loop):
y += np.power(x, p)
print ("= Running time:",time.time()-start_time)
return
def main():
N_total = 6000 # total loops
N_core = 6 # number of processes
Ni = int(N_total/N_core) # loops for each process
# data
n = 200
x = np.random.rand(n,n)
p = np.random.rand(n,n)
procs = []
for ii in range(N):
procs.append(mp.Process(target=test_function,args=(x,p,Ni)))
st = time.time()
for p in procs:
p.start()
for p in procs:
p.join()
print(f'total time: {time.time()-st}')
if __name__ == '__main__':
main()
The above code calculates the summation of the pow(x,p) for 6000 times. The total time consumption t6 for N_core=6 is significantly less than t1 for N_core=1, although t6 > (t1 / 6). So using two processes does not result in half the time for one process. The reason may be that the cpu cores always work together or share some common resources with a mechanism defined by the OS system even if only one process exists.
I am currently generating a nested dictionary that saves some arrays by using a nested for loop. Unfortunately, it takes quite some time; I realized that the server I am working on has a few cores available, so I was wondering if Python's multiprocessing library could be helpful to speed up the creation of the dictionary.
The nested for loop looks something like this (the actual computation is heavier and more complex):
import numpy as np
data_dict = {}
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
So this is what I tried:
from multiprocessing import Pool
import numpy as np
data_dict = {}
def process():
#sci=fits.open('{}.fits'.format(name))
for s in range(1,5):
data_dict[s] = {}
for d in range(1,5):
if s * d > 4:
data_dict[s][d] = np.zeros((s,d))
else:
data_dict[s][d] = np.ones((s,d))
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
pool.map(process)
But pool.map (last line) seems to require an iterable, which I'm not sure what to insert there.
In my opinion, the real problem is what kind of processing is needed to compute entries of the dictionary and how many entries are there.
The kind of processing is essential to understand if multiprocessing can significantly speed up the creation of the dictionary. If your computation is I/O bound, you should use multithreading, while if it's CPU bound you should use multiprocessing. You can find more bout this here.
Assuming that the value of each entry can be computed independently and that this computation is CPU bound, let's benchmark the difference between single process and multiprocess implementation (based on multiprocessing library).
The following code is used to test the two approaches in some scenarios, varying the complexity of the computation needed for each entry and the number of entries (for the multiprocess implementation, 7 processes were used).
import timeit
import numpy as np
def some_fun(s, d, n=1):
"""A function with an adaptable complexity"""
a = s * np.ones(np.random.randint(1, 10, (2,))) / (d + 1)
for _ in range(n):
a += np.random.random(a.shape)
return a
# Code to create dictionary with only one process
setup_simple = "from __main__ import some_fun, n_first_level, n_second_level, complexity"
code_simple = """
data_dict = {}
for s in range(n_first_level):
data_dict[s] = {}
for d in range(n_second_level):
data_dict[s][d] = some_fun(s, d, n=complexity)
"""
# Code to create a dictionary with multiprocessing: we are going to use all the available cores except 1
setup_mp = """import numpy as np
import multiprocessing as mp
import itertools
from functools import partial
from __main__ import some_fun, n_first_level, n_second_level, complexity
n_processes = mp.cpu_count() - 1
# Uncomment if you want to know how many concurrent processes are you going to use
# print(f'{n_processes} concurrent processes')
"""
code_mp = """
with mp.Pool(processes=n_processes) as pool:
dict_values = pool.starmap(partial(some_fun, n=complexity), itertools.product(range(n_first_level), range(n_second_level)))
data_dict = {
k: dict(zip(range(n_second_level), dict_values[k * n_second_level: (k + 1) * n_second_level]))
for k in range(n_first_level)
}
"""
# Time the code with different settings
print('Execution time on 10 repetitions: mean [std]')
for label, complexity, n_first_level, n_second_level in (
("TRIVIAL FUNCTION", 0, 10, 10),
("TRIVIAL FUNCTION", 0, 500, 500),
("SIMPLE FUNCTION", 5, 500, 500),
("COMPLEX FUNCTION", 50, 100, 100),
("HEAVY FUNCTION", 1000, 10, 10),
):
print(f'\n{label}, {n_first_level * n_second_level} dictionary entries')
for l, t in (
('Single process', timeit.repeat(stmt=code_simple, setup=setup_simple, number=1, repeat=10)),
('Multiprocess', timeit.repeat(stmt=code_mp, setup=setup_mp, number=1, repeat=10)),
):
print(f'\t{l}: {np.mean(t):.3e} [{np.std(t):.3e}] seconds')
These are the results:
Execution time on 10 repetitions: mean [std]
TRIVIAL FUNCTION, 100 dictionary entries
Single process: 7.752e-04 [7.494e-05] seconds
Multiprocess: 1.163e-01 [2.024e-03] seconds
TRIVIAL FUNCTION, 250000 dictionary entries
Single process: 7.077e+00 [7.098e-01] seconds
Multiprocess: 1.383e+00 [7.752e-02] seconds
SIMPLE FUNCTION, 250000 dictionary entries
Single process: 1.405e+01 [1.422e+00] seconds
Multiprocess: 2.858e+00 [5.742e-01] seconds
COMPLEX FUNCTION, 10000 dictionary entries
Single process: 1.557e+00 [4.330e-02] seconds
Multiprocess: 5.383e-01 [5.330e-02] seconds
HEAVY FUNCTION, 100 dictionary entries
Single process: 3.181e-01 [5.026e-03] seconds
Multiprocess: 1.171e-01 [2.494e-03] seconds
As you can see, assuming that you have a CPU bounded computation, the multiprocess approach achieves better results in most of the scenarios. Only if you have a very light computation for each entry and/or a very limited number of entries, the single process approach should be preferred.
On the other hand, the improvement provided by multiprocessing comes with a cost: for example, if your computation for each entry uses a significant amount of memory, you could incur an OutOfMemory error, meaning that you have to improve your code and make it more complex to avoid it, finding the right balance between memory occupation and decrease in execution time. If you look around, there are a lot of questions asking how to solve memory issues caused by a non-optimal use of multiprocessing. In other words, this means that your code will be less easy to read and maintain.
To sum up, you should judge if the improvement in execution time is worthed, even if it is possible.
I want to make a Dask Delayed flow which includes CPU and GPU tasks. GPU tasks can only run on GPU workers, and a GPU worker only has one GPU and can only handle one GPU task at a time.
Unfortunately, I see no way to specify worker resources in the Delayed API.
Here is common code:
client = Client(resources={'GPU': 1})
#delayed
def fcpu(x, y):
sleep(1)
return x + y
#delayed
def fgpu(x, y):
sleep(1)
return x + y
Here is the flow written in pure Delayed. This code will not behave properly because it doesn't know about the GPU resource.
# STEP ONE: two parallel CPU tasks
a = fcpu(1, 1)
b = fcpu(10, 10)
# STEP TWO: two GPU tasks
c = fgpu(a, b) # Requires 1 GPU
d = fgpu(a, b) # Requires 1 GPU
# STEP THREE: final CPU task
e = fcpu(c, d)
%time e.compute() # 3 seconds
This is the best solution I could come up with. It combines Delayed syntax with Client.compute() futures. It seems to behave correctly, but it is very ugly.
# STEP ONE: two parallel CPU tasks
a = fcpu(1, 1)
b = fcpu(10, 10)
a_future, b_future = client.compute([a, b]) # Wo DON'T want a resource limit
# STEP TWO: two GPU tasks - only resources to run one at a time
c = fgpu(a_future, b_future)
d = fgpu(a_future, b_future)
c_future, d_future = client.compute([c, d], resources={'GPU': 1})
# STEP THREE: final CPU task
e = fcpu(c_future, d_future)
res = e.compute()
Is there a better way to do this?
Maybe an approach similar to what is described in https://jobqueue.dask.org/en/latest/examples.html It is a case of processing on one GPU machine or a machine with SSD.
def step_1_w_single_GPU(data):
return "Step 1 done for: %s" % data
def step_2_w_local_IO(data):
return "Step 2 done for: %s" % data
stage_1 = [delayed(step_1_w_single_GPU)(i) for i in range(10)]
stage_2 = [delayed(step_2_w_local_IO)(s2) for s2 in stage_1]
result_stage_2 = client.compute(stage_2,
resources={tuple(stage_1): {'GPU': 1},
tuple(stage_2): {'ssdGB': 100}})
This is possible with annotations, see the example in docs:
x = dd.read_csv(...)
with dask.annotate(resources={'GPU': 1}):
y = x.map_partitions(func1)
z = y.map_partitions(func2)
z.compute(optimize_graph=False)
As noted in the graph, such annotations can be lost during optimization, hence the kwarg optimize_graph=False.
1. I have a function var. I want to know the best possible way to run the loop within this function quickly by multiprocessing/parallel processing by utilizing all the processors, cores, threads, and RAM memory the system has.
import numpy
from pysheds.grid import Grid
xs = 82.1206, 72.4542, 65.0431, 83.8056, 35.6744
ys = 25.2111, 17.9458, 13.8844, 10.0833, 24.8306
a = r'/home/test/image1.tif'
b = r'/home/test/image2.tif'
def var(interest):
variable_avg = []
for (x,y) in zip(xs,ys):
grid = Grid.from_raster(interest, data_name='map')
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = numpy.array(variable)
variablemean = (variable).mean()
variable_avg.append(variablemean)
return(variable_avg)
2. It would be great if I can run both function var and loop in it parallelly for the given multiple parameters of the function.
ex:var(a)and var(b) at the same time. Since it will consume much less time then just parallelizing the loop alone.
Ignore 2, if it does not make sense.
TLDR:
You can use the multiprocessing library to run your var function in parallel. However, as written you likely don't make enough calls to var for multiprocessing to have a performance benefit because of its overhead. If all you need to do is run those two calls, running in serial is likely the fastest you'll get. However, if you need to make a lot of calls, multiprocessing can help you out.
We'll need to use a process pool to run this in parallel, threads won't work here because Python's global interpreter lock will prevent us from true parallelism. The drawback of process pools is that processes are heavyweight to spin up. In the example of just running two calls to var the time to create the pool overwhelms the time spent running var itself.
To illiustrate this, let's use a process pool and use asyncio to run calls to var in parallel and compare it to just running things sequentially. Note to run this example I used an image from the Pysheds library https://github.com/mdbartos/pysheds/tree/master/data - if your image is much larger the below may not hold true.
import functools
import time
from concurrent.futures.process import ProcessPoolExecutor
import asyncio
a = 'diem.tif'
xs = 10, 20, 30, 40, 50
ys = 10, 20, 30, 40, 50
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
with ProcessPoolExecutor() as pool:
task_one = loop.run_in_executor(pool, functools.partial(var, a))
task_two = loop.run_in_executor(pool, functools.partial(var, a))
results = await asyncio.gather(task_one, task_two)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
result_one = var(a)
result_two = var(a)
serial_end = time.time()
print(f'Running in serial took {serial_end - serial_start}')
if __name__ == "__main__":
asyncio.run(main())
Running the above on my machine (a 2.4 GHz 8-Core Intel Core i9) I get the following output:
Process pool took 1.7581260204315186
Running in serial took 0.32335805892944336
In this example, a process pool is over five times slower! This is due to the overhead of creating and managing multiple processes. That said, if you need to call var more than just a few times, a process pool may make more sense. Let's adapt this to run var 100 times and compare the results:
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
tasks = []
with ProcessPoolExecutor() as pool:
for _ in range(100):
tasks.append(loop.run_in_executor(pool, functools.partial(var, a)))
results = await asyncio.gather(*tasks)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
for _ in range(100):
result = var(a)
serial_end = time.time()
print(f'Running in serial took {serial_end - serial_start}')
Running 100 times, I get the following output:
Process pool took 3.442288875579834
Running in serial took 13.769982099533081
In this case, running in a process pool is about 4x faster. You may also wish to try running each iteration of your loop concurrently. You can do this by creating a function that processes one x,y coordinate at a time and then run each point you want to examine in a process pool:
def process_poi(interest, x, y):
grid = Grid.from_raster(interest, data_name='map')
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = np.array(variable)
return variable.mean()
async def var_loop_async(interest, pool, loop):
tasks = []
for (x,y) in zip(xs,ys):
function_call = functools.partial(process_poi, interest, x, y)
tasks.append(loop.run_in_executor(pool, function_call))
return await asyncio.gather(*tasks)
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
tasks = []
with ProcessPoolExecutor() as pool:
for _ in range(100):
tasks.append(var_loop_async(a, pool, loop))
results = await asyncio.gather(*tasks)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
In this case I get Process pool took 3.2950568199157715 - so not really any faster than our first version with one process per each call of var. This is likely because the limiting factor at this point is how many cores we have available on our CPU, splitting our work into smaller increments does not add much value.
That said, if you have 1000 x and y coordinates you wish to examine across two images, this last approach may yield a performance gain.
I think this is a reasonable and straightforward way of speeding up your code by merely parallelizing only the main loop. You can saturate your cores with this, so there is no need to parallelize also for the interest variable. I can't test the code, so I assume that your function is correct, I have just encoded the loop in a new function and parallelized it in var().
from multiprocessing import Pool
def var(interest,xs,ys):
grid = Grid.from_raster(interest, data_name='map')
with Pool(4) as p: #uses 4 cores, adjust this as you need
variable_avg = p.starmap(loop, [(x,y,grid) for x,y in zip(xs,ys)])
return variable_avg
def loop(x, y, grid):
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = numpy.array(variable)
return variable.mean()
We are currently adding too many jobs to the pool and start executing them in multiprocesses. However, we are facing a memory problem due to the high number of optimization parameters and a very big dataset. Dataset is 110 MB before calculations (it grows bigger) and we currently use over 3 million possible parameter combinations. This consumes all available RAM (We have 128 GB RAM) and processes stop at some level.
Is it possible to add the jobs to the pool part by part so the executed jobs are deleted from RAM? One other option is to check the pool constantly and if there is nothing in the pool add, say, 1000 new jobs.
Is this possible?
This is how we add jobs to the pool and execute them:
def loop(self, parameters, parameter_index, pool, execute_method):
parameter = list(parameters.values())[parameter_index]
value = parameter["start_value"]
temp_value = Decimal(str(value))
while temp_value <= Decimal(str(parameter["end_value"])):
parameter["current_loop_value"] = value
value += parameter["step_size"]
temp_value = temp_value + Decimal(str(parameter["step_size"]))
if parameter_index + 1 < len(parameters):
self.loop(parameters, parameter_index + 1, pool, execute_method)
continue
self.thread_count += 1
current_parameter = copy.deepcopy(parameters)
current_parameter["index"] = self.thread_count
execute_method(pool, current_parameter)
if self.thread_count % self.thread_count_for_memory_check == 0:
while True:
if psutil.virtual_memory().percent < self.memory_leak_threshold:
break
time.sleep(self.memory_leak_sleep_time)
print("sleeping for {} seconds.... ".format(self.memory_leak_sleep_time), psutil.virtual_memory())
def execute_in_pool(self, pool, parameters):
pool.apply_async(self.execute_strategy, args=(self.dataset_dict, parameters, self.thread_count),
callback=self.thread_callback)