How do I avoid "out of memory" exception when a lot of sub processes are launched using multiprocessing.Pool?
First of all, my program loads 5GB file to a object. Next, parallel processing runs, where each process read that 5GB object.
Because my machine has more than 30 cores, I want to use full of my cores. However, when launching 30 sub processes, out of memory exception occurs.
Probably, each process has the copy of the large instance (5GB). The total memory is 5GB * 30 core = 150GB. That's why out of memory error occurs.
I believe there is a workaround to avoid this memory error because each process just read that object. If each process share memory of the huge object, only 5GB memory is enough for my multi processing.
Please let me know a workaround of this memory error.
import cPickle
from multiprocessing import Pool
from multiprocessing import Process
import multiprocessing
from functools import partial
with open("huge_data_5GB.pickle", "rb") as f
huge_instance = cPickle(f)
def run_process(i, huge_instance):
return huge_instance.get_element(i)
partial_process = partial(run_process, huge_instance=huge_instance)
p = Pool(30) # my machine has more than 30 cores
result = p.map(partial_process, range(10000))
Related
I am running a python component in Databricks environment which creates a set of JSON messages and each JSON message is encoded with Avro schema. The encoding was taking longer time (8 minutes for encoding 10K messages which have complex JSON structure) and hence I tried to use multiprocessing with pool map function. The process seems to work fine for the first execution, however for subsequent runs, the performance is degrading and eventually failing with oom error. I am making sure that at the end of execution pool.close() and pool.join() are issued but not sure if it's really freeing up the memory. When I look at Databricks Ganglia UI, it shows that Swap memory and CPU utilization is increasing for each run. I also tried to reduce the no of pools (driver node has 8 cores, so tried with 6 and 4 pools) and also maxtasksperchild=1 but still doesn't help. I am wondering if I'm doing anything wrong. Following is the code which I'm using now. Wondering what is cuasing the issue here. Any pointers / suggestions are appreciated.
from multiprocessing import Pool
import multiprocessing
import json
from avro.io import *
import avro.schema
from avro_json_serializer import AvroJsonSerializer, AvroJsonDeserializer
import pyspark.sql.functions as F
def create_json_avro_encoding(row):
row_dict = row.asDict(True)
json_data = json.loads(avro_serializer.to_json(row_dict))
#print(f"JSON created { multiprocessing.current_process().name }")
return json_data
avro_schema = avro.schema.SchemaFromJSONData(avro_schema_dict, avro.schema.Names())
avro_serializer = AvroJsonSerializer(avro_schema)
records = df.collect()
pool_cnt = int(multiprocessing.cpu_count()*0.5)
print(f"No of records: {len(records)}")
print(f"starting timestamp {datetime.now().isoformat(sep=' ')}")
with Pool(pool_cnt, maxtasksperchild=1) as pool:
json_data_ret = pool.map(create_json_avro_encoding, records)
pool.close()
pool.join()
You shouldn't close the pool before joining. In fact, you shouldn't close the pool at all when using it in a with block, it will close automatically when exiting the with block.
I have a Flask Application that is using multithreading to collect data via thousands of HTTP Requests.
When deploying the Application without multithreading it works as expected, however when I use multithreading there is ~200MB of ram not freed after each run, which leads to a MemoryError.
First I have used queue and multithreading and then repaced it with concurrent.futures.ThreadPoolExecutor. I have also tried to delete some variables and use garbage collect, but the MemoryError still persits.
This code is not leaking memory:
data = []
for subprocess in process:
result = process_valuechains(subprocess, publishedRevision)
data.extend(result)
This code is leaking memory:
import gc
from concurrent.futures import ThreadPoolExecutor
from itertools import repeat
subprocesses = []
for subprocess in process:
subprocesses.append(subprocess)
data = []
with ThreadPoolExecutor() as pool:
for res in pool.map(process_valuechains, subprocesses, repeat(publishedRevision)):
data.extend(res)
del res
gc.collect()
A simplified version of process_valuechains looks like following
def process_valuechains(subprocess, publishedRevision):
data = []
new_data_1= request_data_1(subprocess)
data.extend(new_data_1)
new_data_2= request_data_2(subprocess)
data.extend(new_data_2)
return data
Unfortunately, even after researching a lot I have no idea what exactly is causing the leak and how to fix it.
I have written an application with flask and uses celery for a long running task. While load testing I noticed that the celery tasks are not releasing memory even after completing the task. So I googled and found this group discussion..
https://groups.google.com/forum/#!topic/celery-users/jVc3I3kPtlw
In that discussion it says, thats how python works.
Also the article at https://hbfs.wordpress.com/2013/01/08/python-memory-management-part-ii/ says
"But from the OS’s perspective, your program’s size is the total (maximum) memory allocated to Python. Since Python returns memory to the OS on the heap (that allocates other objects than small objects) only on Windows, if you run on Linux, you can only see the total memory used by your program increase."
And I use Linux. So I wrote the below script to verify it.
import gc
def memory_usage_psutil():
# return the memory usage in MB
import resource
print 'Memory usage: %s (MB)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000.0)
def fileopen(fname):
memory_usage_psutil()# 10 MB
f = open(fname)
memory_usage_psutil()# 10 MB
content = f.read()
memory_usage_psutil()# 14 MB
def fun(fname):
memory_usage_psutil() # 10 MB
fileopen(fname)
gc.collect()
memory_usage_psutil() # 14 MB
import sys
from time import sleep
if __name__ == '__main__':
fun(sys.argv[1])
for _ in range(60):
gc.collect()
memory_usage_psutil()#14 MB ...
sleep(1)
The input was a 4MB file. Even after returning from the 'fileopen' function the 4MB memory was not released. I checked htop output while the loop was running, the resident memory stays at 14MB. So unless the process is stopped the memory stays with it.
So if the celery worker is not killed after its task is finished it is going to keep the memory for itself. I know I can use max_tasks_per_child config value to kill the process and spawn a new one. Is there any other way to return the memory to OS from a python process?.
I think your measurement method and interpretation is a bit off. You are using ru_maxrss of resource.getrusage, which is the "high watermark" of the process. See this discussion for details on what that means. In short, it is the peak RAM usage of your process, but not necessarily current. Parts of the process could be swapped out etc.
It also can mean that the process has freed that 4MiB, but the OS has not reclaimed the memory, because it's faster for the process to allocate new 4MiB if it has the memory mapped already. To make it even more complicated programs can and do use "free lists", lists of blocks of memory that are not in active use, but are not freed. This is also a common trick to make future allocations faster.
I wrote a short script to demonstrate the difference between virtual memory usage and max RSS:
import numpy as np
import psutil
import resource
def print_mem():
print("----------")
print("ru_maxrss: {:.2f}MiB".format(
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024))
print("virtual_memory.used: {:.2f}MiB".format(
psutil.virtual_memory().used / 1024 ** 2))
print_mem()
print("allocating large array (80e6,)...")
a = np.random.random(int(80e6))
print_mem()
print("del a")
del a
print_mem()
print("read testdata.bin (~400MiB)")
with open('testdata.bin', 'rb') as f:
data = f.read()
print_mem()
print("del data")
del data
print_mem()
The results are:
----------
ru_maxrss: 22.89MiB
virtual_memory.used: 8125.66MiB
allocating large array (80e6,)...
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8731.85MiB
del a
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8121.66MiB
read testdata.bin (~400MiB)
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8513.11MiB
del data
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8123.22MiB
It is clear how the ru_maxrss remembers the maximum RSS, but the current usage has dropped in the end.
Note on psutil.virtual_memory().used:
used: memory used, calculated differently depending on the platform and designed for informational purposes only.
I am trying to understand how to share read-only objects with multiprocessing. Sharing bigset when it is a global variable works fine:
from multiprocessing import Pool
bigset = set(xrange(pow(10, 7)))
def worker(x):
return x in bigset
def main():
pool = Pool(5)
print all(pool.imap(worker, xrange(pow(10, 6))))
pool.close()
pool.join()
if __name__ == '__main__':
main()
htop shows that the parent process uses 100% CPU and 0.8% memory, while the workload is distributed evenly among the five children processes: each is using 10% CPU and 0.8% memory. It's all good.
But the numbers start going crazy if I move bigset inside main:
from multiprocessing import Pool
from functools import partial
def worker(x, l):
return x in l
def main():
bigset = set(xrange(pow(10, 7)))
_worker = partial(worker, l=bigset)
pool = Pool(5)
print all(pool.imap(_worker, xrange(pow(10, 6))))
pool.close()
pool.join()
if __name__ == '__main__':
main()
Now htop shows 2 or 3 processes jumping up and down between 50% and 80% CPU, while the remaining processes use less than 10% CPU. And while the parent process still uses 0.8% memory, now all children use 1.9% memory.
What's happening?
When you pass bigset as an argument, it is pickled by the parent process and unplickled by the children processes.[1][2]
Pickling and unpickling a large set requires a lot of time. This explains why you are seeing few processes doing their job: the parent process has to pickle a lot of big objects, and the children have to wait for it. The parent process is a bottleneck.
Pickling parameters implies that parameters have to be sent to processes. Sending data from a process to another requires system calls, which is why you are not seeing 100% CPU usage by the user space code. Part of the CPU time is spent on kernel space.[3]
Pickling objects and sending them to subprocesses also implies that: 1. you need memory for the pickle buffer; 2. each subprocess gets a copy of bigset. This is why you are seeing an increase in memory usage.
Instead, when bigset is a global variable, it is not sent anywhere (unless you are using a start method other than fork). It is just inherited as-is by subprocesses, using the usual copy-on-write rules of fork().
Footnotes:
In case you don't know what "pickling" means: pickle is one of the standard Python protocols to transform arbitrary Python objects to and from a byte sequence.
imap() & co. use queues behind the scenes, and queues work by pickling/unpickling objects.
I tried running your code (with all(pool.imap(_worker, xrange(100))) in order to make the process faster) and I got: 2 minutes user time, 13 seconds system time. That's almost 10% system time.
there is a function in my code that should read the file .each file is about 8M,however the reading speed is too low,and to improve that i use the multiprocessing.sadly,it seems it got blocked.i wanna know is there any methods to help solve this and improve the reading speed?
my code is as follows:
import multiprocessing as mp
import json
import os
def gainOneFile(filename):
file_from = open(filename)
json_str = file_from.read()
temp = json.loads(json_str)
print "load:",filename," len ",len(temp)
file_from.close()
return temp
def gainSortedArr(path):
arr = []
pool = mp.Pool(4)
for i in xrange(1,40):
abs_from_filename = os.path.join(path, "outputDict"+str(i))
result = pool.apply_async(gainOneFile,(abs_from_filename,))
arr.append(result.get())
pool.close()
pool.join()
arr = sorted(arr,key = lambda dic:len(dic))
return arr
and the call function:
whole_arr = gainSortedArr("sortKeyOut/")
You have a few problems. First, you're not parallelizing. You do:
result = pool.apply_async(gainOneFile,(abs_from_filename,))
arr.append(result.get())
over and over, dispatching a task, then immediately calling .get() which waits for it to complete before you dispatch any additional tasks; you never actually have more than one worker running at once. Store all the results without calling .get(), then call .get() later. Or just use Pool.map or related methods and save yourself some hassle from manual individual result management, e.g. (using imap_unordered to minimize overhead since you're just sorting anyway):
# Make generator of paths to load
paths = (os.path.join(path, "outputDict"+str(i)) for i in xrange(1, 40))
# Load them all in parallel, and sort the results by length (lambda is redundant)
arr = sorted(pool.imap_unordered(gainOneFile, paths), key=len)
Second, multiprocessing has to pickle and unpickle all arguments and return values sent between the main process and the workers, and it's all sent over pipes that incur system call overhead to boot. Since your file system isn't likely to gain substantial speed from parallelizing the reads, it's likely to be a net loss, not a gain.
You might be able to get a bit of a boost by switching to a thread based pool; change the import to import multiprocessing.dummy as mp and you'll get a version of Pool implemented in terms of threads; they don't work around the CPython GIL, but since this code is almost certainly I/O bound, that hardly matters, and it removes the pickling and unpickling as well as the IPC involved in worker communications.
Lastly, if you're using Python 3.3 or higher on a UNIX like system, you may be able to get the OS to help you out by having it pull files into the system cache more aggressively. If you can open the file, then use os.posix_fadvise on the file descriptor (.fileno() on file objects) with either WILLNEED or SEQUENTIAL it might improve read performance when you read from the file at some later point by aggressively prefetching file data before you request it.