I have a Flask Application that is using multithreading to collect data via thousands of HTTP Requests.
When deploying the Application without multithreading it works as expected, however when I use multithreading there is ~200MB of ram not freed after each run, which leads to a MemoryError.
First I have used queue and multithreading and then repaced it with concurrent.futures.ThreadPoolExecutor. I have also tried to delete some variables and use garbage collect, but the MemoryError still persits.
This code is not leaking memory:
data = []
for subprocess in process:
result = process_valuechains(subprocess, publishedRevision)
data.extend(result)
This code is leaking memory:
import gc
from concurrent.futures import ThreadPoolExecutor
from itertools import repeat
subprocesses = []
for subprocess in process:
subprocesses.append(subprocess)
data = []
with ThreadPoolExecutor() as pool:
for res in pool.map(process_valuechains, subprocesses, repeat(publishedRevision)):
data.extend(res)
del res
gc.collect()
A simplified version of process_valuechains looks like following
def process_valuechains(subprocess, publishedRevision):
data = []
new_data_1= request_data_1(subprocess)
data.extend(new_data_1)
new_data_2= request_data_2(subprocess)
data.extend(new_data_2)
return data
Unfortunately, even after researching a lot I have no idea what exactly is causing the leak and how to fix it.
Related
I am running a python component in Databricks environment which creates a set of JSON messages and each JSON message is encoded with Avro schema. The encoding was taking longer time (8 minutes for encoding 10K messages which have complex JSON structure) and hence I tried to use multiprocessing with pool map function. The process seems to work fine for the first execution, however for subsequent runs, the performance is degrading and eventually failing with oom error. I am making sure that at the end of execution pool.close() and pool.join() are issued but not sure if it's really freeing up the memory. When I look at Databricks Ganglia UI, it shows that Swap memory and CPU utilization is increasing for each run. I also tried to reduce the no of pools (driver node has 8 cores, so tried with 6 and 4 pools) and also maxtasksperchild=1 but still doesn't help. I am wondering if I'm doing anything wrong. Following is the code which I'm using now. Wondering what is cuasing the issue here. Any pointers / suggestions are appreciated.
from multiprocessing import Pool
import multiprocessing
import json
from avro.io import *
import avro.schema
from avro_json_serializer import AvroJsonSerializer, AvroJsonDeserializer
import pyspark.sql.functions as F
def create_json_avro_encoding(row):
row_dict = row.asDict(True)
json_data = json.loads(avro_serializer.to_json(row_dict))
#print(f"JSON created { multiprocessing.current_process().name }")
return json_data
avro_schema = avro.schema.SchemaFromJSONData(avro_schema_dict, avro.schema.Names())
avro_serializer = AvroJsonSerializer(avro_schema)
records = df.collect()
pool_cnt = int(multiprocessing.cpu_count()*0.5)
print(f"No of records: {len(records)}")
print(f"starting timestamp {datetime.now().isoformat(sep=' ')}")
with Pool(pool_cnt, maxtasksperchild=1) as pool:
json_data_ret = pool.map(create_json_avro_encoding, records)
pool.close()
pool.join()
You shouldn't close the pool before joining. In fact, you shouldn't close the pool at all when using it in a with block, it will close automatically when exiting the with block.
How do I avoid "out of memory" exception when a lot of sub processes are launched using multiprocessing.Pool?
First of all, my program loads 5GB file to a object. Next, parallel processing runs, where each process read that 5GB object.
Because my machine has more than 30 cores, I want to use full of my cores. However, when launching 30 sub processes, out of memory exception occurs.
Probably, each process has the copy of the large instance (5GB). The total memory is 5GB * 30 core = 150GB. That's why out of memory error occurs.
I believe there is a workaround to avoid this memory error because each process just read that object. If each process share memory of the huge object, only 5GB memory is enough for my multi processing.
Please let me know a workaround of this memory error.
import cPickle
from multiprocessing import Pool
from multiprocessing import Process
import multiprocessing
from functools import partial
with open("huge_data_5GB.pickle", "rb") as f
huge_instance = cPickle(f)
def run_process(i, huge_instance):
return huge_instance.get_element(i)
partial_process = partial(run_process, huge_instance=huge_instance)
p = Pool(30) # my machine has more than 30 cores
result = p.map(partial_process, range(10000))
I am trying to use the python multiprocessing library in order to parallize a task I am working on:
import multiprocessing as MP
def myFunction((x,y,z)):
...create a sqlite3 database specific to x,y,z
...write to the database (one DB per process)
y = 'somestring'
z = <large read-only global dictionary to be shared>
jobs = []
for x in X:
jobs.append((x,y,z,))
pool = MP.Pool(processes=16)
pool.map(myFunction,jobs)
pool.close()
pool.join()
Sixteen processes are started as seen in htop, however no errors are returned, no files written, no CPU is used.
Could it happen that there is an error in myFunction that is not reported to STDOUT and blocks execution?
Perhaps it is relevant that the python script is called from a bash script running in background.
The lesson learned here was to follow the strategy suggested in one of the comments and use multiprocessing.dummy until everything works.
At least in my case, errors were not visible otherwise and the processes were still running as if nothing had happened.
there is a function in my code that should read the file .each file is about 8M,however the reading speed is too low,and to improve that i use the multiprocessing.sadly,it seems it got blocked.i wanna know is there any methods to help solve this and improve the reading speed?
my code is as follows:
import multiprocessing as mp
import json
import os
def gainOneFile(filename):
file_from = open(filename)
json_str = file_from.read()
temp = json.loads(json_str)
print "load:",filename," len ",len(temp)
file_from.close()
return temp
def gainSortedArr(path):
arr = []
pool = mp.Pool(4)
for i in xrange(1,40):
abs_from_filename = os.path.join(path, "outputDict"+str(i))
result = pool.apply_async(gainOneFile,(abs_from_filename,))
arr.append(result.get())
pool.close()
pool.join()
arr = sorted(arr,key = lambda dic:len(dic))
return arr
and the call function:
whole_arr = gainSortedArr("sortKeyOut/")
You have a few problems. First, you're not parallelizing. You do:
result = pool.apply_async(gainOneFile,(abs_from_filename,))
arr.append(result.get())
over and over, dispatching a task, then immediately calling .get() which waits for it to complete before you dispatch any additional tasks; you never actually have more than one worker running at once. Store all the results without calling .get(), then call .get() later. Or just use Pool.map or related methods and save yourself some hassle from manual individual result management, e.g. (using imap_unordered to minimize overhead since you're just sorting anyway):
# Make generator of paths to load
paths = (os.path.join(path, "outputDict"+str(i)) for i in xrange(1, 40))
# Load them all in parallel, and sort the results by length (lambda is redundant)
arr = sorted(pool.imap_unordered(gainOneFile, paths), key=len)
Second, multiprocessing has to pickle and unpickle all arguments and return values sent between the main process and the workers, and it's all sent over pipes that incur system call overhead to boot. Since your file system isn't likely to gain substantial speed from parallelizing the reads, it's likely to be a net loss, not a gain.
You might be able to get a bit of a boost by switching to a thread based pool; change the import to import multiprocessing.dummy as mp and you'll get a version of Pool implemented in terms of threads; they don't work around the CPython GIL, but since this code is almost certainly I/O bound, that hardly matters, and it removes the pickling and unpickling as well as the IPC involved in worker communications.
Lastly, if you're using Python 3.3 or higher on a UNIX like system, you may be able to get the OS to help you out by having it pull files into the system cache more aggressively. If you can open the file, then use os.posix_fadvise on the file descriptor (.fileno() on file objects) with either WILLNEED or SEQUENTIAL it might improve read performance when you read from the file at some later point by aggressively prefetching file data before you request it.
I have a python program that's been running for a while, and because of an unanticipated event, I'm now unsure that it will complete within a reasonable amount of time. The data it's collected so far, however, is valuable, and I would like to recover it if possible.
Here is the relevant code
from multiprocessing.dummy import Pool as ThreadPool
def pull_details(url):
#accesses a given URL
#returns some data which gets appended to the results list
pool = ThreadPool(25)
results = pool.map(pull_details, urls)
pool.close()
pool.join()
So I either need to access the data that is currently in results or somehow change the source of the code (or somehow manually change the program's control) to kill the loop so it continues to the later part of the program in which the data is exported (not sure if the second way is possible).
It seems as though the first option is also quite tricky, but luckily the IDE (Spyder) I'm using indicates the value of what I assume is the location of the list in the machine's memory (0xB73EDECCL).
Is it possible to create a C program (or another python program) to access this location in memory and read what's there?
Can't you use some sort of mechanism to exchange data between the two processes, like queues or pipes.
something like below:
from multiprocessing import Queue
from multiprocessing.dummy import Pool as ThreadPool
def pull_details(args=None):
q.put([my useful data])
q = Queue()
pool = ThreadPool(25)
results = pool.map(pull_details(args=q), urls)
while not done:
results = q.get()
pool.close()
pool.join()