How to clear objects from the object store in ray?

How to clear objects from the object store in ray? - python

I am trying out the promising multiprocessing package ray. I have a problem I seem not be able to solve. My program runs fine the first time, but at a second run this Exception is raised on the ray.put() line:
ObjectStoreFullError: Failed to put object ffffffffffffffffffffffffffffffffffffffff010000000c000000 in object store because it is full. Object size is 2151680255 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.
What do I want to do:
In my actual code (I'm planning to write) I need to process many big_data_objects sequentially. I want to hold one big_data_object in memory at a time and do several heavy (independent) computations on the big data object. I want to execute these computations in parallel. When these are done, I have to replace these big_data_object in the object store by new ones and start the computations (in parallel) again.
Using my test script I simulate this by starting the script again without ray.shutdown(). If I shutdown ray using ray.shutdown() the object store is cleared but then reinitializing takes a long time and I cannot process multiple big_data_object sequentially as I want to.
What sources of information have I studied:
I studied this document Ray Design Patterns and studied the section 'Antipattern: Closure capture of large / unserializable object' and how to the proper pattern(s) should look like. I also studied the getting started guide which lead to the following test script.
A minimum example to reproduce the problem:
I created a test script to test this. It is this:
#%% Imports
import ray
import time
import psutil
import numpy as np
#%% Testing ray
# Start Ray
num_cpus = psutil.cpu_count(logical=False)
if not ray.is_initialized():
ray.init(num_cpus=num_cpus, include_dashboard=False)
# Define function to do work in parallel
#ray.remote
def my_function(x): # Later I will have multiple (different) my_functions to extract different feature from my big_data_object
time.sleep(1)
data_item = ray.get(big_data_object_ref)
return data_item[0,0]+x
# Define large data
big_data_object = np.random.rand(16400,16400) # Define an object of approx 2 GB. Works on my machine (16 GB RAM)
# big_data_object = np.random.rand(11600,11600) # Define an object of approx 1 GB.
# big_data_object = np.random.rand(8100,8100) # Define an object of approx 500 MB.
# big_data_object = np.random.rand(5000,5000) # Define an object of approx 190 MB.
big_data_object_ref = ray.put(big_data_object)
# Start 4 tasks in parallel.
result_refs = []
# for item in data:
for item in range(4):
result_refs.append(my_function.remote(item))
# Wait for the tasks to complete and retrieve the results.
# With at least 4 cores, this will take 1 second.
results = ray.get(result_refs)
print("Results: {}".format(results))
#%% Clean-up object store data - Still their is a (huge) memory leak in the object store.
for index in range(4):
del result_refs[0]
del big_data_object_ref
Where do I think it's going wrong:
I think I delete all the references to the object store in the end of the script. As a result the objects should be cleared from the object store (as described here). Apparently, something is wrong because the big_data_object remains in the object store. The results are deleted from the object store just fine, however.
Some debug information:
I inspected the object store using ray memory command, this is what I get:
(c:\python\cenv38rl) PS C:\WINDOWS\system32> ray memory
---------------------------------------------------------------------------------------------------------------------
Object ID Reference Type Object Size Reference Creation Site
=====================================================================================================================
; worker pid=20952
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=29368
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=17388
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=24208
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=27684
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=6860
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; driver pid=28684
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\worker.py:put_object:277 | c:\python\cenv38rl\lib\site-packages\ray\worker.py:put:1489 | c:\python\cenv38rl\lib\site-packages\ray\_private\client_mode_hook.py:wrapper:47 | C:\Users\Stefan\Documents\Python examples\Multiprocess_Ray3_SO.py:<module>:42
---------------------------------------------------------------------------------------------------------------------
--- Aggregate object store stats across all nodes ---
Plasma memory usage 2052 MiB, 1 objects, 77.41% full
Some of the things I have tried:
If, I replace my_function for:
#ray.remote
def my_function(x): # Later I will have multiple different my_functions to extract separate feature from my big_data_objects
time.sleep(1)
# data_item = ray.get(big_data_object_ref)
# return data_item[0,0]+x
return 5
and then the script successfully clears the object store but my_function cannot use the big_data_object which I need it to.
My question is: How to fix my code so that the big_data_object is removed from the object store at the end of my script without shutting down ray?
Note: I installed ray using pip install ray which gave me version ray==1.2.0 which I am using now. I use ray on Windows and I develop in Spyder v4.2.5 in a conda (actually miniconda) environment in case it is relevant.
EDIT:
I have tested also on a Ubuntu machine with 8GB RAM. For this I used the big_data_object of 1GB.
I can confirm the issue also occurs on this machine.
The ray memory output:
(SO_ray) stefan#stefan-HP-ZBook-15:~/Documents/Ray_test_scripts$ ray memory
---------------------------------------------------------------------------------------------------------------------
Object ID Reference Type Object Size Reference Creation Site
=====================================================================================================================
; worker pid=18593
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
; worker pid=18591
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
; worker pid=18590
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
; driver pid=17712
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 (put object) | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/_private/client_mode_hook.py:wrapper:47 | /home/stefan/Documents/Ray_test_scripts/Multiprocess_Ray3_SO.py:<module>:43 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/spyder_kernels/customize/spydercustomize.py:exec_code:453
; worker pid=18592
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
---------------------------------------------------------------------------------------------------------------------
--- Aggregate object store stats across all nodes ---
Plasma memory usage 1026 MiB, 1 objects, 99.69% full
I have to run the program in Spyder so that after execution of the program I can inspect the object store's memory using ray memory. If I run the program in PyCharm for example, ray is automatically terminated as the script is completed so I cannot check if my script clears the object store as intended.

The problem is that your remote function captures big_data_object_ref , and the reference from there is never removed. Note that when you do this type of thing:
# Define function to do work in parallel
#ray.remote
def my_function(x): # Later I will have multiple (different) my_functions to extract different feature from my big_data_object
time.sleep(1)
data_item = ray.get(big_data_object_ref)
return data_item[0,0]+x
# Define large data
big_data_object = np.random.rand(16400,16400)
big_data_object_ref = ray.put(big_data_object)
big_data_object_ref is serialized to the remote function definition. Thus there's a permanent pointer until you remove this serialized function definition (which is in the ray internals).
Instead use this type of pattern:
#%% Imports
import ray
import time
import psutil
import numpy as np
#%% Testing ray
# Start Ray
num_cpus = psutil.cpu_count(logical=False)
if not ray.is_initialized():
ray.init(num_cpus=num_cpus, include_dashboard=False)
# Define function to do work in parallel
#ray.remote
def my_function(big_data_object, x):
time.sleep(1)
return big_data_object[0,0]+x
# Define large data
#big_data_object = np.random.rand(16400,16400) # Define an object of approx 2 GB. Works on my machine (16 GB RAM)
# big_data_object = np.random.rand(11600,11600) # Define an object of approx 1 GB.
big_data_object = np.random.rand(8100,8100) # Define an object of approx 500 MB.
# big_data_object = np.random.rand(5000,5000) # Define an object of approx 190 MB.
big_data_object_ref = ray.put(big_data_object)
print("ref in a driver ", big_data_object_ref)
# Start 4 tasks in parallel.
result_refs = []
# for item in data:
for item in range(4):
result_refs.append(my_function.remote(big_data_object_ref, item))
# Wait for the tasks to complete and retrieve the results.
# With at least 4 cores, this will take 1 second.
results = ray.get(result_refs)
print("Results: {}".format(results))
print(result_refs)
#%% Clean-up object store data - Still their is a (huge) memory leak in the object store.
#for index in range(4):
# del result_refs[0]
del result_refs
del big_data_object_ref
import time
time.sleep(1000)
The difference is that now we pass big_data_object_ref as argument to the remote function instead of capturing it in the remote function.
Note: When an object reference is passed to a remote function, they are automatically dereferenced. So no need to use ray.get() in the remote function. If you'd like to explicitly call ray.get() inside a remote function, then pass the object reference inside a list or dictionary as argument to the remote function. In this case, you get something like:
# Remote function
#ray.remote
def my_function(big_data_object_ref_list, x):
time.sleep(1)
big_data_object = ray.get(big_data_object_ref_list[0])
return big_data_object[0,0]+x
# Calling the remote function
my_function.remote([big_data_object_ref], item)
Note 2: You use Spyder which uses an IPython console. There are some known issues right now between ray and an IPython console. Just make sure you delete the references inside your script, and not using commands entered directly to the IPython console (because then the references will be remove but the items will not be removed from the object store). To inspect the object store using the ray memory command while your script is running, you can use some code at the end of your script like:
#%% Testing ray
# ... my ray testing code
#%% Clean-up object store data
print("Wait 10 sec BEFORE deletion")
time.sleep(10) # Now quickly use the 'ray memory' command to inspect the contents of the object store.
del result_refs
del big_data_object_ref
print("Wait 10 sec AFTER deletion")
time.sleep(10) # Now again use the 'ray memory' command to inspect the contents of the object store.

Related

When use Multiprocessing.Process doesn't use all process whit large data

I have a very large python code. The fundamental of that is I have a function which use a row of Dataframe and apply some formulas and save the object i've create whit joblib in my files. (Im gonna put a function to capture the essence of the script).
import Multiprocessing as multi
def somefunct(DataFrame_row, some_parameter1, some_parameter2, sema):
python_object = My_object(DataFrame_row['Column1'],DataFrame_row['Column2'])
python_object.some_complicate_method(some_parameter1, some_parameter2)
# for example calculate an integral of My_object data
#takes 50-60 second aprox per row
joblib.dump(python_object, path_save)
#Before of tried a function that save the object i tried afunction that
#save the object in the DataFrame
sema.release()
def apply_all_data_frame(df, n_procces):
sema = multi.Semaphore(n_procesos)
procesos_list = []
for index, row in df.iterrow():
sema.acquire()
p = multi.Process(target = somefunct,
args = (row, some_parameter1, some_parameter2, sema))
procesos_list.append(p)
p.start()
for proceso in procesos_list:
proceso.join()
So, the DataFrame contain 5000 rows and it maybe contain more in the future. I test the script with a data with 100 rows in a computer with 16 cores and 32 logic processor. I choose 30 process and with 100 rows use the 30 process(100% CPU) and finish quickly. But when i try again with all the data the computer only use 4 or 3 process (11%) and use 2.0 gb of RAM each process. Take to long.
My first try with the program was use Pool and Pool.map, but in that case is the same problem and full the RAM an broke everything despite having use less process (16 i think).
I've coment in the script that my first program was saving the object in the DataFrame but when i see that the RAM full 100% i decided to save the object. In that case i tried the Pool and freezing all, because create a python process with 0% work in the CPU.
I tried the function without Semaphore to.
I'm apologize for the English and for the explanation, is my first question online.
screenshot of how the process of computer works

Python multiprocessing gradually increases memory until it runs our

I have a python program with multiple modules. They go like this:
Job class that is the entry point and manages the overall flow of the program
Task class that is the base class for the tasks to be run on given data. Many SubTask classes created specifically for different types of calculations on different columns of data are derived from the Task class. think of 10 columns in the data and each one having its own Task to do some processing. eg. 'price' column can used by a CurrencyConverterTask to return local currency values and so on.
Many other modules like a connector for getting data, utils module etc, which I don't think are relevant for this question.
The general flow of program: get data from the db continuously -> process the data -> write back the updated data to the db.
I decided to do it in multiprocessing because the tasks are relatively simple. Most of them do some basic arithmetic or logic operations and running it in one process takes a long time, especially getting data from a large db and processing in sequence is very slow.
So the multiprocessing (mp) code looks something like this (I cannot expose the entire file so i'm writing a simplified version, the parts not included are not relevant here. I've tested by commenting them out so this is an accurate representation of the actual code):
class Job():
def __init__():
block_size = 100 # process 100 rows at a time
some_query = "SELECT * IF A > B" # some query to filter data from db
def data_getter():
# continusouly get data from the db and put it into a queue in blocks
cursor = Connector.get_data(some_query)
block = []
for item in cursor:
block.append(item)
if len(block) ==block_size:
data_queue.put(data)
block = []
data_queue.put(None) # this will indicate the worker processors when to stop
def monitor():
# continuously monitor the system stats
timer = Timer()
while (True):
if timer.time_taken >= 60: # log some stats every 60 seconds
print(utils.system_stats())
timer.reset()
def task_runner():
while True:
# get data from the queue
# if there's no data, break out of loop
data = data_queue.get()
if data is None:
break
# run task one by one
for task in tasks:
task.do_something(data)
def run():
# queue to put data for processing
data_queue = mp.Queue()
# start a process for reading data from db
dg = mp.Process(target=self.data_getter).start()
# start a process for monitoring system stats
mon = mp.Process(target=self.monitor).start()
# get a list of tasks to run
tasks = [t for t in taskmodule.get_subtasks()]
workers = []
# start 4 processes to do the actual processing
for _ in range(4):
worker = mp.Process(target=task_runner)
worker.start()
workers.append(worker)
for w in workers:
w.join()
mon.terminate() # terminate the monitor process
dg.terminate() # end the data getting process
if __name__ == "__main__":
job = Job()
job.run()
The whole program is run like: python3 runjob.py
Expected behaviour: continuous stream of data goes in the data_queue and the each worker process gets the data and processes until there's no more data from the cursor at which point the workers finish and the entire program finishes.
This is working as expected but what is not expected is that the system memory usage keeps creeping up continuously until the system crashes. The data i'm getting here is not copied anywhere (at least intentionally). I expect the memory usage to be steady throughout the program. The length of the data_queue rarely exceeds 1 or 2 since the processes are fast enough to get the data when available so It's not the queue holding too much data.
My guess is that all the processes initiated here are long running ones and that has something to do with this. Although I can print the pid and if I follow the PID on top command the data_getter and monitor processes don't exceed more than 2% of memory usage. the 4 worker processes also don't use a lot of memory. And neither does the main process the whole thing runs in. there is an unaccounted for process that takes up 20%+ of the ram. And it bugs me so much I can't figure out what it is.

Distributed TensorFlow - Not running some workers

I'm trying to get a very simple example of distributed TensorFlow working. However, I'm having a bug that appears non-deterministically between runs. On some runs, it works perfectly. Outputting something along the lines of:
Worker 2 | step 0
Worker 0 | step 0
Worker 1 | step 0
Worker 3 | step 0
Worker 2 | step 1
Worker 0 | step 1
Worker 1 | step 1
Worker 3 | step 1
...
However, every once in a while, one or more of the workers fails to run, resulting in output like this:
Worker 0 | step 0
Worker 3 | step 0
Worker 0 | step 1
Worker 3 | step 1
Worker 0 | step 2
Worker 3 | step 2
...
If I run the loop indefinitely, it seems that the missing workers always startup at some point, but only minutes later, which isn't practical.
I've found that two things make the issue go away (but make the program useless): 1. Not declaring any tf Variables inside the with tf.device(tf.train.replica_device_setter()) scope. If I even declare one variable (e.g. nasty_var below), the issue starts cropping up. and 2. setting the is_chief param in tf.train.MonitoredTrainingSession() to True for all workers. This causes the bug to go away even if variables are declared, but it seems wrong to make all of the workers the chief. The way I'm currently setting it below - is_chief=(task_index == 0) - is taken directly from a TensorFlow tutorial.
Here's the simplest code I can get to replicate the issue. (You may have to run multiple times to see the bug, but it almost always shows up within 5 runs
from multiprocessing import Process
import tensorflow as tf
from time import sleep
from numpy.random import random_sample
cluster = tf.train.ClusterSpec({'ps': ['localhost:2222'],
'worker': ['localhost:2223',
'localhost:2224',
'localhost:2225',
'localhost:2226']})
def create_worker(task_index):
server = tf.train.Server(cluster, job_name='worker', task_index=task_index)
with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d" % task_index, cluster=cluster)):
nasty_var = tf.Variable(0) # This line causes the problem. No issue when this is commented out.
with tf.train.MonitoredTrainingSession(master=server.target, is_chief=(task_index == 0)):
for step in xrange(10000):
sleep(random_sample()) # Simulate some work being done.
print 'Worker %d | step %d' % (task_index, step)
def create_ps(task_index):
param_server = tf.train.Server(cluster, job_name='ps',
task_index=task_index)
param_server.join()
# Launch workers and ps in separate processes.
processes = []
for i in xrange(len(cluster.as_dict()['worker'])):
print 'Forking worker process ', i
p = Process(target=create_worker, args=[i])
p.start()
processes.append(p)
for i in xrange(len(cluster.as_dict()['ps'])):
print 'Forking ps process ', i
p = Process(target=create_ps, args=[i])
p.start()
processes.append(p)
for p in processes:
p.join()

I'm guessing the cause here is the implicit coordination protocol in how a tf.train.MonitoredTrainingSession starts, which is implemented here:
If this session is the chief:
Run the variable initializer op.
Else (if this session is not the chief):
Run an op to check if the variables has been initialized.
While any of the variables has not yet been initialized.
Wait 30 seconds.
Try creating a new session, and checking to see if the variables have been initialized.
(I discuss the rationale behind this protocol in a video about Distributed TensorFlow.)
When every session is the chief, or there are no variables to initialize, the tf.train.MonitoredTrainingSession will always start immediately. However, once there is a single variable, and you only have a single chief, you will see that the non-chief workers have to wait for the chief to act.
The reason for using this protocol is that it is robust to various processes failing, and the delay—while very noticeable when running everything on a single process—is short compared to the expected running time of a typical distributed training job.
Looking at the implementation again, it does seem that this 30-second timeout should be configurable (as the recovery_wait_secs argument to tf.train.SessionManager()), but there is currently no way to set this timeout when you create a tf.train.MonitoredTrainingSession, because it uses a hardcoded set of arguments for creating a session manager.
This seems like an oversight in the API, so please feel free to open a feature request on the GitHub issues page!

As mrry said, the problem exists because:
Non-chief relies on chief to initialize the model.
If it isn't initialized, then it waits for 30 secs.
Performance-wise, there's no difference to wait for the chief and kicks in at the next 30s. However, I was doing a research recently which required me to enforce strictly synchronized update, and this problem needed to be taken care of.
The key here is to use a barrier, depending on your distributed setting. Assume you are using thread-1 to run ps, and thread-2~5 to run workers, then you only need to:
Instead of using a MonitoredTrainingSession, use a tf.train.Supervisor, which enables you to set recovery_wait_secs, with default=30s. Change it to 1s to reduce your wait time.
sv = tf.train.Supervisor(is_chief=is_chief,
logdir=...
init_op=...
...
recovery_wait_secs=1s)
sess = sv.prepare_or_wait_for_session(server.target,
config=sess_config)
Use a barrier. Assume you are using threads:
In main:
barrier = threading.Barrier(parties=num_workers)
for i in range(num_workers):
threads.append(threading.Thread(target=run_model, args=("worker", i, barrier, )))
threads.append(threading.Thread(target=run_model, args=("ps", 0, barrier, )))
In actual training function:
_ = sess.run([train_op], feed_dict=train_feed)
barrier.wait()
Then just proceeds happily. The barrier will make sure that all models reaches this step, and there's for sure no race conditions.

Python Garbage Collection: Memory no longer needed not released to OS?

I have written an application with flask and uses celery for a long running task. While load testing I noticed that the celery tasks are not releasing memory even after completing the task. So I googled and found this group discussion..
https://groups.google.com/forum/#!topic/celery-users/jVc3I3kPtlw
In that discussion it says, thats how python works.
Also the article at https://hbfs.wordpress.com/2013/01/08/python-memory-management-part-ii/ says
"But from the OS’s perspective, your program’s size is the total (maximum) memory allocated to Python. Since Python returns memory to the OS on the heap (that allocates other objects than small objects) only on Windows, if you run on Linux, you can only see the total memory used by your program increase."
And I use Linux. So I wrote the below script to verify it.
import gc
def memory_usage_psutil():
# return the memory usage in MB
import resource
print 'Memory usage: %s (MB)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000.0)
def fileopen(fname):
memory_usage_psutil()# 10 MB
f = open(fname)
memory_usage_psutil()# 10 MB
content = f.read()
memory_usage_psutil()# 14 MB
def fun(fname):
memory_usage_psutil() # 10 MB
fileopen(fname)
gc.collect()
memory_usage_psutil() # 14 MB
import sys
from time import sleep
if __name__ == '__main__':
fun(sys.argv[1])
for _ in range(60):
gc.collect()
memory_usage_psutil()#14 MB ...
sleep(1)
The input was a 4MB file. Even after returning from the 'fileopen' function the 4MB memory was not released. I checked htop output while the loop was running, the resident memory stays at 14MB. So unless the process is stopped the memory stays with it.
So if the celery worker is not killed after its task is finished it is going to keep the memory for itself. I know I can use max_tasks_per_child config value to kill the process and spawn a new one. Is there any other way to return the memory to OS from a python process?.

I think your measurement method and interpretation is a bit off. You are using ru_maxrss of resource.getrusage, which is the "high watermark" of the process. See this discussion for details on what that means. In short, it is the peak RAM usage of your process, but not necessarily current. Parts of the process could be swapped out etc.
It also can mean that the process has freed that 4MiB, but the OS has not reclaimed the memory, because it's faster for the process to allocate new 4MiB if it has the memory mapped already. To make it even more complicated programs can and do use "free lists", lists of blocks of memory that are not in active use, but are not freed. This is also a common trick to make future allocations faster.
I wrote a short script to demonstrate the difference between virtual memory usage and max RSS:
import numpy as np
import psutil
import resource
def print_mem():
print("----------")
print("ru_maxrss: {:.2f}MiB".format(
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024))
print("virtual_memory.used: {:.2f}MiB".format(
psutil.virtual_memory().used / 1024 ** 2))
print_mem()
print("allocating large array (80e6,)...")
a = np.random.random(int(80e6))
print_mem()
print("del a")
del a
print_mem()
print("read testdata.bin (~400MiB)")
with open('testdata.bin', 'rb') as f:
data = f.read()
print_mem()
print("del data")
del data
print_mem()
The results are:
----------
ru_maxrss: 22.89MiB
virtual_memory.used: 8125.66MiB
allocating large array (80e6,)...
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8731.85MiB
del a
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8121.66MiB
read testdata.bin (~400MiB)
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8513.11MiB
del data
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8123.22MiB
It is clear how the ru_maxrss remembers the maximum RSS, but the current usage has dropped in the end.
Note on psutil.virtual_memory().used:
used: memory used, calculated differently depending on the platform and designed for informational purposes only.

Memory leak in adding list values

i'm new to python and have big memory issue. my script runs 24/7 and each day it allocates about 1gb more of my memory. i could narrow it down to this function:
Code:
#!/usr/bin/env python
# coding: utf8
import gc
from pympler import muppy
from pympler import summary
from pympler import tracker
v_list = [{
'url_base' : 'http://www.immoscout24.de',
'url_before_page' : '/Suche/S-T/P-',
'url_after_page' : '/Wohnung-Kauf/Hamburg/Hamburg/-/-/50,00-/EURO--500000,00?pagerReporting=true',}]
# returns url
def get_url(v, page_num):
return v['url_base'] + v['url_before_page'] + str(page_num) + v['url_after_page']
while True:
gc.enable()
for v_idx,v in enumerate(v_list):
# mem test ouput
all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)
# magic happens here
url = get_url(v, 1)
# mem test ouput
all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)
# collects unlinked objects
gc.collect()
Output:
======================== | =========== | ============
list | 26154 | 10.90 MB
str | 31202 | 1.90 MB
dict | 507 | 785.88 KB
expecially the list attribute is getting bigger and bigger each cycle around 600kb and i don't have an idea why. in my opinion i do not store anything here and the url variable should be overwritten each time. so basically there should be any memory consumption at all.
what am i missing here? :-)

This "memory leak" is 100% caused by your testing for memory leaks. The all_objects list ends up maintaining a list of almost every object you ever created—even the ones you don't need anymore, which would have been cleaned up if they weren't in all_objects, but they are.
As a quick test:
If I run this code as-is, I get the list value growing by about 600KB/cycle, just as you say in your question, at least up to 20MB, where I killed it.
If I add del all_objects right after the sum1 = line, however, I get the list value bouncing back and forth between 100KB and 650KB.
If you think about why this is happening, it's pretty obvious in retrospect. At the point when you call muppy.get_objects() (except the first time), the previous value of all_objects is still alive. So, it's one of the objects that gets returned. That means that, even when you assign the return value to all_objects, you're not freeing the old value, you're just dropping its refcount from 2 to 1. Which keeps alive not just the old value itself, but every element within it—which, by definition, is everything that was alive last time through the loop.
If you can find a memory-exploring library that gives you weakrefs instead of normal references, that might help. Otherwise, make sure to do a del all_objects at some point before calling muppy.get_objects again. (Right after the only place you use it, the sum1 = line, seems like the most obvious place.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.