When i use the multiprocessing.pool to do a research in quantitative strategy, there is a problem occurred in the linux server monitor with htop. It seems that the parameters processes doesn't work and all cpus are occupied.
For example, if i set multiprocessing.pool(processes=8) and use pool.apply_async to run my backtest strategy with parameters' length 6000. It seems everything is right at the beginning of the backtest and after some time, it shows that all 24 cpus are occupied by this program, it seems that processes=8 doesn't work.
Pseudo code like this:
def stragegy(paras):
# some trading logic
return indicator
if __name == '__main__':
# paras_list has 6000 different parameters
# a_list: store the result of strategy
paras_list = [[..], [..], ...[..]]
a_list = []
pool = multiprocessing.Pool(processes=8) # create 8 processes
for i in range(len(paras_list)):
a_list.append(pool.apply_async(func=strategy, args=(paras_list[i])))
pool.close()
pool.join()
linux server -htop screen
Is anyone can help me about this problem?
Thanks a lot!!
Related
I am attempting to run a series of simulation sets in parallel (set A, set B,...). The simulation sets have subset variations (subset A1, subset A2,...), and each subset contains 500+ simulations. Currently I am working with 18 sets, and each set only has 1 subset. Realistically, I would like neither number to matter, and any number of sets+subset combinations could be processed. This is all occurring on a Linux machine running Anaconda Python 3.6.3 with multiprocessing.cpu_count() = 72.
There seems to be some sort of resource issue occurring. I have not been able to pin down exactly when the resource issue begins, but it is usually after a couple sets have fully completed. In the past, I have run this script with 3-4 sets and there was never a resource issue that was noticeable to me. What will happen is that memory usage increases drastically and it seems the number of processors that are used increases to be larger than what I specify which is 65. The simulations' CPU usage decreases drastically to less than 20% after it begins as well, and over time, the machine will lock up entirely requiring a forced restart in order to do anything.
My simplified code:
import multiprocessing as mp
import subprocess
def getCMDs(ARGS):
# do stuff
return subsetCMDs
def minions(cmd):
subprocess.run(cmd, shell=True)
return
def runParallel(sim_sets):
with mp.Pool(processes=65) as pool:
for i in sim_sets:
subsetCMDs = getCMDs(i)
for runCMDs in subsetCMDs:
term_flag = False
try:
pool.map(minions,runCMDs)
except KeyboardInterrupt:
pool.terminate()
pool.join()
term_flag = True
break
if term_flag:
break
return
if __name__ == '__main__':
runParallel(['set A', 'set B', 'set C'])
I have tried finding solutions, but I can't seem to find anything similar to my issue as far as I am understanding it. I have read into maxtasksperchild, but I am still unsure on whether this would be the correct solution as I believe it may be a too-many-processor issue rather than a too-much-memory-usage issue.
My thoughts could be entirely wrong, but I figured I would share to help with discussion. Thank you for the help!
Edit: Firstly, I figured I should provide a little more information about what is happening when the issue occurs. I usually leave htop open after I run it initially, and for the first ~6 hours, I see nothing out of the ordinary on htop:
All 72 processors at various usage amounts
~13/252 GB of Memory used
0/128 GB of Swap used
Tasks running: ~67
When I come to check on the script the next day, I see:
All 72 processors at 100%
252/252 GB of Memory used
128/128 GB of Swap used
Tasks running: 1000+
I also am not sure when throughout the night the issue begins.
After heavy research, I have tried updating my script to use asyncio subprocesses shown below, but I am still getting the same results as with multiprocessing.
Asyncio script:
import asyncio
def getCMDs(ARGS):
# do stuff
return subsetCMDs
async def execCMD(sem,cmd):
async with sem:
proc = await asyncio.create_subprocess_shell(cmd)
await proc.wait()
return
async def execAsyncBatch(nCPU,execCMDs):
sem = asyncio.Semaphore(nCPU)
all_future = []
for cmd in execCMDs:
coro = execCMD(sem,cmd)
future = asyncio.ensure_future(coro)
all_futures.append(future)
await asyncio.gather(*all_futures)
return
def runParallel(sim_sets,nCPU):
for i in sim_sets:
subsetCMDs = getCMDs(i)
for runCMDs in subsetCMDs:
term_flag = False
try:
loop = asyncio.get_event_loop()
job_future = asyncio.ensure_future(execAsyncBatch(nCPU,runCMDs))
loop.run_until_complete(job_future)
except KeyboardInterrupt:
term_flag = True
break
if term_flag:
break
return
if __name__ == '__main__':
reunParallel(['set A', 'set B', 'set C'], 65)
I plan on running each set individually with the multiprocessing script (as I understand multiprocessing much more than asyncio) to see if somewhere along the lines I have a bad set, but I figured I would provide an update in the meantime. Thanks again!
I have to run multiple simulations of the same model with varying parameters (or random number generator seed). Previously I worked on a server with many cores, where I used python multiprocessing library with apply_async. This was very handy as I could decide the maximum number of cores to occupy and simulations would just go into a queue.
Now I moved to a place with a hpc cluster working with pbs. It seems from trial and error and different answers that multiprocessing works only inside one node. Is there a way to make it work on many nodes, or any other library which reaches the same funtionality with the same easyness to use in few lines?
To let you understand my kind of code:
import functions_library as L
import multiprocessing as mp
if __name__ == "__main__":
N = 100
proc = 50
pool = mp.Pool(processes = proc)
seed = 342
np.random.seed(seed)
seeds = np.random.randint(low=1,high=100000,size=N)
resul = []
for SEED in seeds:
SEED = int(SEED)
resul.append(pool.apply_async(L.some_function, args = (some_args)))
print(SEED)
results = [p.get() for p in resul]
database = pd.DataFrame(results)
database.to_csv("prova.csv")
EDIT
As I understand, mpi4py might be helpful as naturally interacts with pbs. Is that correct? How can I adapt my code to mpi4py?
I have found that the schwimmbad package is quite handy to run code written for multiprocessing in an MPI cluster with minimal changes.
I hope it helps!
I am using the multiprocessing module in Python to train neural networks with keras in parallel, using a Pool(processes = 4) object with imap. This steadily uses more and more memory after every "cycle", i.e. every 4 processes, until it finally crashes.
I used the memory_profiler module to track my memory usage over time, training 12 networks. Here's using vanilla imap:
If I put maxtasksperchild = 1 in Pool:
If I use imap(chunksize = 3):
In the latter case, where everything works out fine, I'm only sending off a single batch to every process in the pool, so it seems that the problem is that the processes carry information about previous batches. If so, can I force the pool to not do that?
Even though the chunks solution seems to work I'd rather not use that, because
I'd like to track progress using the tqdm module, and in the chunks case it will only update after every chunk, which effectively means it won't really track anything at all, as all the chunks finish at the same time (in this example)
Currently all networks take exactly the same time to train, but I'd like to enable the possibility of them having separate training times, where the chunks solution would then potentially cause one process to get all the long training times.
Here's a code snippet in the vanilla case. In the other two cases I just changed the maxtasksperchild parameter in Pool, and the chunksize parameter in imap:
def train_network(network):
(...)
return score
pool = Pool(processes = 4)
scores = pool.imap(train_network, networks)
scores = tqdm(scores, total = networks.size)
for (network, score) in zip(networks, scores):
network.score = score
pool.close()
pool.join()
Unfortunaly, multiprocessing module in python come with a great expense. data is mostly not shared between processes and need to be replicated. This will change starting from python 3.8.
https://docs.python.org/3.8/library/multiprocessing.shared_memory.html
Although, the official release of python 3.8 is on 21 October 2019, you can already download it on github
I came up with a solution that seems to work. I ditched the pool and made my own simple queuing system. Aside from not increasing (it does increase ever so slightly though, but I think that's me storing some dictionaries as log), it even consumes less memory than the chunks solution above:
I have no idea why that's the case. Perhaps the Pool objects just take up a lot of memory? Anyway, here's my code:
def train_network(network):
(...)
return score
# Define queues to organise the parallelising
todo = mp.Queue(size = networks.size + 4)
done = mp.Queue(size = networks.size)
# Populate the todo queue
for idx in range(networks.size):
todo.put(idx)
# Add -1's which will be an effective way of checking
# if all todo's are finished
for _ in range(4):
todo.put(-1)
def worker(todo, done):
''' Network scoring worker. '''
from queue import Empty
while True:
try:
# Fetch the next todo
idx = todo.get(timeout = 1)
except Empty:
# The queue is never empty, so the silly worker has to go
# back and try again
continue
# If we have reached a -1 then stop
if idx == -1:
break
else:
# Score the network and store it in the done queue
score = train_network(networks[idx])
done.put((idx, score))
# Construct our four processes
processes = [mp.Process(target = worker,
args = (todo, done)) for _ in range(4)]
# Daemonise the processes, which closes them when
# they finish, and start them
for p in processes:
p.daemon = True
p.start()
# Set up the iterable with all the scores, and set
# up a progress bar
idx_scores = (done.get() for _ in networks)
pbar = tqdm(idx_scores, total = networks.size)
# Compute all the scores in parallel
for (idx, score) in pbar:
networks[idx].score = score
# Join up the processes and close the progress bar
for p in processes:
p.join()
pbar.close()
I am mining data from a website through Data Scraping in Python. I am using request package for sending the parameters.
Here is the code snippet in Python:
for param in paramList:
data = get_url_data(param)
def get_url_data(param):
post_data = get_post_data(param)
headers = {}
headers["Content-Type"] = "text/xml; charset=UTF-8"
headers["Content-Length"] = len(post_data)
headers["Connection"] = 'Keep-Alive'
headers["Cache-Control"] = 'no-cache'
page = requests.post(url, data=post_data, headers=headers, timeout=10)
data = parse_page(page.content)
return data
The variable paramList is a list of more than 1000 elements and the endpoint url remains the same. I was wondering if there is a better and more faster way to do this ?
Thanks
As there is a significant amount of networking I/O involved, threading should improve the overall performance significantly.
You can try using a ThreadPool and should test and tweak the number of threads to a one that is best suitable for the situation and shows the overall highest performance .
from multiprocessing.pool import ThreadPool
# Remove 'for param in paramList' iteration
def get_url_data(param):
# Rest of code here
if __name__ == '__main__':
pool = ThreadPool(15)
pool.map(get_url_data, paramList) # Will split the load between the threads nicely
pool.close()
I need to make 1000 post request to same domain, I was wondering if
there is a better and more faster way to do this ?
It depends, if it's a static asset or a servlet which you know what it does, if the same parameters will return the same reponse each time you can implement LRU or some other caching mechanism, if not, 1K of POST requests to some servlet doesn't matter even if they have the same domain.
There is an answer with using multiprocessing whith ThreadPool interface, which actually uses the main process with 15 threads, does it runs on 15 cores machine ? because a core can only run one thread each time (except hyper ones, does it run on 8 hyper-cores?)
ThreadPool interface inside library which has a trivial name, multiprocessing, because python has also threading module, this is confusing as f#ck, lets benchmark some lower level code:
import psutil
from multiprocessing.pool import ThreadPool
from time import sleep
def get_url_data(param):
print(param) # just for convenience
sleep(1) # lets assume it will take one second each time
if __name__ == '__main__':
paramList = [i for i in range(100)] # 100 urls
pool = ThreadPool(psutil.cpu_count()) # each core can run one thread (hyper.. not now)
pool.map(get_url_data, paramList) # splitting the jobs
pool.close()
The code above will use the main process with 4 threads in my case because my laptop has 4 CPUs, benchmark result:
$ time python3_5_2 test.py
real 0m28.127s
user 0m0.080s
sys 0m0.012s
Lets try spawning processes w/ multiprocessing
import psutil
import multiprocessing
from time import sleep
import numpy
def get_url_data(urls):
for url in urls:
print(url)
sleep(1) # lets assume it will take one second each time
if __name__ == "__main__":
jobs = []
# Split URLs into chunks as number of CPUs
chunks = numpy.array_split(range(100), psutil.cpu_count())
# Pass each chunk into process
for url_chunk in chunks:
jobs.append(multiprocessing.Process(target=get_url_data, args=(url_chunk, )))
# Start the processes
for j in jobs:
j.start()
# Ensure all of the processes have finished
for j in jobs:
j.join()
Benchmark result: less 3 seconds
$ time python3_5_2 test2.py
real 0m25.208s
user 0m0.260s
sys 0m0.216
If you will execute ps -aux | grep "test.py" you will see 5 processes because one is the main which manage the others.
There are some drawbacks:
You did not explain in depth what your code is doing, but if you doing some work which needs to be synchronized you need to know multiprocessing is NOT thread safe.
Spawning extra processes introduces I/O overhead as data is having to be shuffled around between processors.
Assuming the data is restricted to each process, it is possible to gain significant speedup, be aware of Amdahl's Law.
If you will reveal what your code does afterwards ( save it into file ? database ? stdout ? ) it will be easier to give better answer/direction, few ideas comes up to my mind like immutable infrastructure with Bash or Java to handle synchronization or is it a memory-bound issue and you need an objects pool to process the JSON responses.. might even be a job for fault tolerance Elixir)
Here's the program:
#!/usr/bin/python
import multiprocessing
def dummy_func(r):
pass
def worker():
pass
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
# clean up
pool.close()
pool.join()
I found memory usage (both VIRT and RES) kept growing up till close()/join(), is there any solution to get rid of this? I tried maxtasksperchild with 2.7 but it didn't help either.
I have a more complicated program that calles apply_async() ~6M times, and at ~1.5M point I've already got 6G+ RES, to avoid all other factors, I simplified the program to above version.
EDIT:
Turned out this version works better, thanks for everyone's input:
#!/usr/bin/python
import multiprocessing
ready_list = []
def dummy_func(index):
global ready_list
ready_list.append(index)
def worker(index):
return index
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=16)
result = {}
for index in range(0,1000000):
result[index] = (pool.apply_async(worker, (index,), callback=dummy_func))
for ready in ready_list:
result[ready].wait()
del result[ready]
ready_list = []
# clean up
pool.close()
pool.join()
I didn't put any lock there as I believe main process is single threaded (callback is more or less like a event-driven thing per docs I read).
I changed v1's index range to 1,000,000, same as v2 and did some tests - it's weird to me v2 is even ~10% faster than v1 (33s vs 37s), maybe v1 was doing too many internal list maintenance jobs. v2 is definitely a winner on memory usage, it never went over 300M (VIRT) and 50M (RES), while v1 used to be 370M/120M, the best was 330M/85M. All numbers were just 3~4 times testing, reference only.
I had memory issues recently, since I was using multiple times the multiprocessing function, so it keep spawning processes, and leaving them in memory.
Here's the solution I'm using now:
def myParallelProcess(ahugearray):
from multiprocessing import Pool
from contextlib import closing
with closing(Pool(15)) as p:
res = p.imap_unordered(simple_matching, ahugearray, 100)
return res
Simply create the pool within your loop and close it at the end of the loop with
pool.close().
Use map_async instead of apply_async to avoid excessive memory usage.
For your first example, change the following two lines:
for index in range(0,100000):
pool.apply_async(worker, callback=dummy_func)
to
pool.map_async(worker, range(100000), callback=dummy_func)
It will finish in a blink before you can see its memory usage in top. Change the list to a bigger one to see the difference. But note map_async will first convert the iterable you pass to it to a list to calculate its length if it doesn't have __len__ method. If you have an iterator of a huge number of elements, you can use itertools.islice to process them in smaller chunks.
I had a memory problem in a real-life program with much more data and finally found the culprit was apply_async.
P.S., in respect of memory usage, your two examples have no obvious difference.
I have a very large 3d point cloud data set I'm processing. I tried using the multiprocessing module to speed up the processing, but I started getting out of memory errors. After some research and testing I determined that I was filling the queue of tasks to be processed much quicker than the subprocesses could empty it. I'm sure by chunking, or using map_async or something I could have adjusted the load, but I didn't want to make major changes to the surrounding logic.
The dumb solution I hit on is to check the pool._cache length intermittently, and if the cache is too large then wait for the queue to empty.
In my mainloop I already had a counter and a status ticker:
# Update status
count += 1
if count%10000 == 0:
sys.stdout.write('.')
if len(pool._cache) > 1e6:
print "waiting for cache to clear..."
last.wait() # Where last is assigned the latest ApplyResult
So every 10k insertion into the pool I check if there are more than 1 million operations queued (about 1G of memory used in the main process). When the queue is full I just wait for the last inserted job to finish.
Now my program can run for hours without running out of memory. The main process just pauses occasionally while the workers continue processing the data.
BTW the _cache member is documented the the multiprocessing module pool example:
#
# Check there are no outstanding tasks
#
assert not pool._cache, 'cache = %r' % pool._cache
You can limit the number of task per child process
multiprocessing.Pool(maxtasksperchild=1)
maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool. link
I think this is similar to the question I posted, but I'm not sure you have the same delay. My problem was that I was producing results from the multiprocessing pool faster than I was consuming them, so they built up in memory. To avoid that, I used a semaphore to throttle the inputs into the pool so they didn't get too far ahead of the outputs I was consuming.