Python threading memory error / bug / race condition - python

I have an app where the following happens:
starts a thread to generate "work"
this thread then starts a thread pool with 5 workers to generate "work" and put it on to a FIFO queue
starts a thread pool of 20 workers to get work from the FIFO queue and executes it on a thread in the pool
When running just one piece of "work" through the system, it works great. When running multiple, it starts failing.
I logged out in the id() of the objects retrieved from the queue, it seems that memory addresses are being re-used repeatedly for some reason rather than storing objects in a new memory address. I suspect then there is a data race where multiple threads are then accessing an object (which in my view IS a different object) but from the same memory address thereby overwriting each others variables etc.
See the following snippet from the log:
[2023-02-16 14:33:02,695] INFO | App started with main PID: 26600
[2023-02-16 14:33:02,695] DEBUG | Max workers: 20
[2023-02-16 14:33:02,695] DEBUG | Max queue size: 60
[2023-02-16 14:33:02,695] INFO | Creating a work queue with size: 60
[2023-02-16 14:33:02,695] INFO | Starting the work generator thread
[2023-02-16 14:33:02,696] INFO | Creating a work consumer thread pool with max workers: 20
[2023-02-16 14:33:02,697] INFO | Found automation 'automation_d'
[2023-02-16 14:33:02,697] DEBUG | Submitting automation file to the work generator thread pool for execution
>>>>>>>>>>>>>>>>>>>id()==140299908643808
[2023-02-16 14:33:03,181] DEBUG | Putting 'T2149393' on to the queue for automation 'automation_d'
[2023-02-16 14:33:03,181] DEBUG | Putting 'T2149388' on to the queue for automation 'automation_d'
[2023-02-16 14:33:03,181] DEBUG | Putting 'T2149389' on to the queue for automation 'automation_d'
[2023-02-16 14:33:03,198] DEBUG | Retrieved a work item from the queue
[2023-02-16 14:33:03,198] DEBUG | Submitting work to the work consumer thread pool for execution
[2023-02-16 14:33:03,199] DEBUG | ==========================================================================================
>>>>>>>>>>>>>>>>>>>id()==140299908643808
[2023-02-16 14:33:03,199] DEBUG | <automation.TAutomation object at 0x7f9a1e377be0>
[2023-02-16 14:33:03,199] DEBUG | Task(num="T2149393", req="R2396580", who="", grp="AG1", desc="REQ - T"
[2023-02-16 14:33:03,199] DEBUG | ==========================================================================================
[2023-02-16 14:33:03,199] INFO | Running automation_d against T2149393 with internal automation id 18aa2e51-c94d-4d83-a033-44e30cca9dd3 in thread 140299891414784
[2023-02-16 14:33:03,199] INFO | Assigning T2149393 to API user
[2023-02-16 14:33:03,199] DEBUG | Retrieved a work item from the queue
[2023-02-16 14:33:03,201] DEBUG | Submitting work to the work consumer thread pool for execution
[2023-02-16 14:33:03,202] DEBUG | ==========================================================================================
>>>>>>>>>>>>>>>>>>>id()==140299908643808
[2023-02-16 14:33:03,202] DEBUG | <automation.TAutomation object at 0x7f9a1e377be0>
[2023-02-16 14:33:03,202] DEBUG | Task(num="T2149388", req="R2396575", who="", grp="AG1", desc="REQ - T"
[2023-02-16 14:33:03,202] DEBUG | ==========================================================================================
[2023-02-16 14:33:03,202] INFO | Running automation_d against T2149388 with internal automation id 18aa2e51-c94d-4d83-a033-44e30cca9dd3 in thread 140299883022080
[2023-02-16 14:33:03,202] DEBUG | Retrieved a work item from the queue
[2023-02-16 14:33:03,202] INFO | Assigning T2149388 to API user
[2023-02-16 14:33:03,203] DEBUG | Submitting work to the work consumer thread pool for execution
[2023-02-16 14:33:03,204] DEBUG | ==========================================================================================
>>>>>>>>>>>>>>>>>>>id()==140299908643808
[2023-02-16 14:33:03,204] DEBUG | <automation.TAutomation object at 0x7f9a1e377be0>
[2023-02-16 14:33:03,204] DEBUG | Task(num="T2149389", req="R2396576", who="", grp="AG1", desc="REQ - T"
[2023-02-16 14:33:03,205] DEBUG | ==========================================================================================
[2023-02-16 14:33:03,205] INFO | Running automation_d against T2149389 with internal automation id 18aa2e51-c94d-4d83-a033-44e30cca9dd3 in thread 140299670124288
As can be seen above, the id() is the same for all executions. Also the actual memory address of the object is the same each time, as well as the internal automation id which is a attribute on the object. Meaning when I eventually put this in to the queue, and it gets consumed and passed to another thread for execution, every thread has a pointer/reference to the same object which is causing the execution to fail in weird ways.
The below code sample is not intended to be a re-producible way to generate the error or the above log, it's intended as a visualisation and to give an example of how the app is structured currently. There is way too much code and custom logic to share here.
Rough, high-level code here:
import json
import os
import sys
import time
from concurrent.futures import (CancelledError, Future, ThreadPoolExecutor,
TimeoutError)
from dataclasses import dataclass
from logging import Logger
from pathlib import Path, PurePath
from queue import Empty, Full, Queue
from threading import Event, Thread
from types import FrameType
from typing import Any, Dict, List, Optional
import requests
import urllib3
#dataclass()
class WorkItem:
automation_object: Automation
target: AutomationTarget
config: AutomationConfig
def generate_work(work_queue, app_config, automation_file, automation_name):
automation_config_raw = load_automation_file(automation_file)
validate_automation_file(automation_config=automation_config_raw)
automation_config = build_automation_config(
automation_name=automation_name,
automation_config_raw=automation_config_raw,
log_dir=app_config.log_dir
)
automation_object = build_automation(automation_config=automation_config)
records = automation_object.get_records()
for record in records:
work_item = WorkItem(
automation_object=automation_object,
target=record,
config=automation_config
)
work_queue.put(item=work_item, block=False)
def work_generator(stop_app_event, app_config, app_logger, work_queue):
work_generator_thread_pool = ThreadPoolExecutor(max_workers=5)
while True:
automation_files = get_automation_files(app_config.automations_dir)
for automation_file in automation_files:
automation_name = PurePath(automation_file).stem
work_generator_thread_pool.submit(generate_work, work_queue, app_config, automation_file, automation_name)
def main():
work_generator_thread = Thread(target=work_generator, args=(stop_app_event, app_config, app_logger, work_queue))
work_generator_thread.start()
work_consumer_thread_pool = ThreadPoolExecutor(max_workers=max_workers)
while True:
work_item = work_queue.get()
work_consumer_thread_pool.submit(work_item.automation_object.execute, work_item.target)
if __name__ == "__main__":
main()
So, at a high level we have 1 thread generating work using a thread pool, and another thread consuming + executing work from the queue.
Why is Python re-using the same piece of memory repeatedly and how can I force it to use a new piece of memory when creating these objects?

Why is Python re-using the same piece of memory repeatedly and how can I force it
to use a new piece of memory when creating these objects?
Cpython uses an arena allocator, which reuses memory for objects when they are no longer reachable, the fact that both objects used the same id means either
the first object was deleted as it was not accessible anywhere, and the second object reused the memory location.
you used the same object in both places, you didn't create a new object or make a copy of it.
since these objects have different data then the memory location is just being reused because it is no longer reachable, python will never reuse a memory location if it is reachable because the garbage collector and the allocator are threadsafe (by the GIL).
as for the reason your code doesn't work, it's likely because whatever "tasks" you are doing cannot run concurrently as they are sharing some hidden state that's not present in the code above.

Related

How to clear objects from the object store in ray?

I am trying out the promising multiprocessing package ray. I have a problem I seem not be able to solve. My program runs fine the first time, but at a second run this Exception is raised on the ray.put() line:
ObjectStoreFullError: Failed to put object ffffffffffffffffffffffffffffffffffffffff010000000c000000 in object store because it is full. Object size is 2151680255 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.
What do I want to do:
In my actual code (I'm planning to write) I need to process many big_data_objects sequentially. I want to hold one big_data_object in memory at a time and do several heavy (independent) computations on the big data object. I want to execute these computations in parallel. When these are done, I have to replace these big_data_object in the object store by new ones and start the computations (in parallel) again.
Using my test script I simulate this by starting the script again without ray.shutdown(). If I shutdown ray using ray.shutdown() the object store is cleared but then reinitializing takes a long time and I cannot process multiple big_data_object sequentially as I want to.
What sources of information have I studied:
I studied this document Ray Design Patterns and studied the section 'Antipattern: Closure capture of large / unserializable object' and how to the proper pattern(s) should look like. I also studied the getting started guide which lead to the following test script.
A minimum example to reproduce the problem:
I created a test script to test this. It is this:
#%% Imports
import ray
import time
import psutil
import numpy as np
#%% Testing ray
# Start Ray
num_cpus = psutil.cpu_count(logical=False)
if not ray.is_initialized():
ray.init(num_cpus=num_cpus, include_dashboard=False)
# Define function to do work in parallel
#ray.remote
def my_function(x): # Later I will have multiple (different) my_functions to extract different feature from my big_data_object
time.sleep(1)
data_item = ray.get(big_data_object_ref)
return data_item[0,0]+x
# Define large data
big_data_object = np.random.rand(16400,16400) # Define an object of approx 2 GB. Works on my machine (16 GB RAM)
# big_data_object = np.random.rand(11600,11600) # Define an object of approx 1 GB.
# big_data_object = np.random.rand(8100,8100) # Define an object of approx 500 MB.
# big_data_object = np.random.rand(5000,5000) # Define an object of approx 190 MB.
big_data_object_ref = ray.put(big_data_object)
# Start 4 tasks in parallel.
result_refs = []
# for item in data:
for item in range(4):
result_refs.append(my_function.remote(item))
# Wait for the tasks to complete and retrieve the results.
# With at least 4 cores, this will take 1 second.
results = ray.get(result_refs)
print("Results: {}".format(results))
#%% Clean-up object store data - Still their is a (huge) memory leak in the object store.
for index in range(4):
del result_refs[0]
del big_data_object_ref
Where do I think it's going wrong:
I think I delete all the references to the object store in the end of the script. As a result the objects should be cleared from the object store (as described here). Apparently, something is wrong because the big_data_object remains in the object store. The results are deleted from the object store just fine, however.
Some debug information:
I inspected the object store using ray memory command, this is what I get:
(c:\python\cenv38rl) PS C:\WINDOWS\system32> ray memory
---------------------------------------------------------------------------------------------------------------------
Object ID Reference Type Object Size Reference Creation Site
=====================================================================================================================
; worker pid=20952
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=29368
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=17388
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=24208
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=27684
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; worker pid=6860
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\serialization.py:object_ref_deserializer:45 | c:\python\cenv38rl\lib\site-packages\ray\function_manager.py:fetch_and_register_remote_function:180 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_process_key:140 | c:\python\cenv38rl\lib\site-packages\ray\import_thread.py:_run:87
; driver pid=28684
ffffffffffffffffffffffffffffffffffffffff010000000b000000 LOCAL_REFERENCE 2151680261 c:\python\cenv38rl\lib\site-packages\ray\worker.py:put_object:277 | c:\python\cenv38rl\lib\site-packages\ray\worker.py:put:1489 | c:\python\cenv38rl\lib\site-packages\ray\_private\client_mode_hook.py:wrapper:47 | C:\Users\Stefan\Documents\Python examples\Multiprocess_Ray3_SO.py:<module>:42
---------------------------------------------------------------------------------------------------------------------
--- Aggregate object store stats across all nodes ---
Plasma memory usage 2052 MiB, 1 objects, 77.41% full
Some of the things I have tried:
If, I replace my_function for:
#ray.remote
def my_function(x): # Later I will have multiple different my_functions to extract separate feature from my big_data_objects
time.sleep(1)
# data_item = ray.get(big_data_object_ref)
# return data_item[0,0]+x
return 5
and then the script successfully clears the object store but my_function cannot use the big_data_object which I need it to.
My question is: How to fix my code so that the big_data_object is removed from the object store at the end of my script without shutting down ray?
Note: I installed ray using pip install ray which gave me version ray==1.2.0 which I am using now. I use ray on Windows and I develop in Spyder v4.2.5 in a conda (actually miniconda) environment in case it is relevant.
EDIT:
I have tested also on a Ubuntu machine with 8GB RAM. For this I used the big_data_object of 1GB.
I can confirm the issue also occurs on this machine.
The ray memory output:
(SO_ray) stefan#stefan-HP-ZBook-15:~/Documents/Ray_test_scripts$ ray memory
---------------------------------------------------------------------------------------------------------------------
Object ID Reference Type Object Size Reference Creation Site
=====================================================================================================================
; worker pid=18593
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
; worker pid=18591
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
; worker pid=18590
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
; driver pid=17712
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 (put object) | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/_private/client_mode_hook.py:wrapper:47 | /home/stefan/Documents/Ray_test_scripts/Multiprocess_Ray3_SO.py:<module>:43 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/spyder_kernels/customize/spydercustomize.py:exec_code:453
; worker pid=18592
ffffffffffffffffffffffffffffffffffffffff0100000001000000 LOCAL_REFERENCE 1076480259 /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/function_manager.py:fetch_and_register_remote_function:180 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_process_key:140 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/site-packages/ray/import_thread.py:_run:87 | /home/stefan/miniconda3/envs/SO_ray/lib/python3.8/threading.py:run:870
---------------------------------------------------------------------------------------------------------------------
--- Aggregate object store stats across all nodes ---
Plasma memory usage 1026 MiB, 1 objects, 99.69% full
I have to run the program in Spyder so that after execution of the program I can inspect the object store's memory using ray memory. If I run the program in PyCharm for example, ray is automatically terminated as the script is completed so I cannot check if my script clears the object store as intended.
The problem is that your remote function captures big_data_object_ref , and the reference from there is never removed. Note that when you do this type of thing:
# Define function to do work in parallel
#ray.remote
def my_function(x): # Later I will have multiple (different) my_functions to extract different feature from my big_data_object
time.sleep(1)
data_item = ray.get(big_data_object_ref)
return data_item[0,0]+x
# Define large data
big_data_object = np.random.rand(16400,16400)
big_data_object_ref = ray.put(big_data_object)
big_data_object_ref is serialized to the remote function definition. Thus there's a permanent pointer until you remove this serialized function definition (which is in the ray internals).
Instead use this type of pattern:
#%% Imports
import ray
import time
import psutil
import numpy as np
#%% Testing ray
# Start Ray
num_cpus = psutil.cpu_count(logical=False)
if not ray.is_initialized():
ray.init(num_cpus=num_cpus, include_dashboard=False)
# Define function to do work in parallel
#ray.remote
def my_function(big_data_object, x):
time.sleep(1)
return big_data_object[0,0]+x
# Define large data
#big_data_object = np.random.rand(16400,16400) # Define an object of approx 2 GB. Works on my machine (16 GB RAM)
# big_data_object = np.random.rand(11600,11600) # Define an object of approx 1 GB.
big_data_object = np.random.rand(8100,8100) # Define an object of approx 500 MB.
# big_data_object = np.random.rand(5000,5000) # Define an object of approx 190 MB.
big_data_object_ref = ray.put(big_data_object)
print("ref in a driver ", big_data_object_ref)
# Start 4 tasks in parallel.
result_refs = []
# for item in data:
for item in range(4):
result_refs.append(my_function.remote(big_data_object_ref, item))
# Wait for the tasks to complete and retrieve the results.
# With at least 4 cores, this will take 1 second.
results = ray.get(result_refs)
print("Results: {}".format(results))
print(result_refs)
#%% Clean-up object store data - Still their is a (huge) memory leak in the object store.
#for index in range(4):
# del result_refs[0]
del result_refs
del big_data_object_ref
import time
time.sleep(1000)
The difference is that now we pass big_data_object_ref as argument to the remote function instead of capturing it in the remote function.
Note: When an object reference is passed to a remote function, they are automatically dereferenced. So no need to use ray.get() in the remote function. If you'd like to explicitly call ray.get() inside a remote function, then pass the object reference inside a list or dictionary as argument to the remote function. In this case, you get something like:
# Remote function
#ray.remote
def my_function(big_data_object_ref_list, x):
time.sleep(1)
big_data_object = ray.get(big_data_object_ref_list[0])
return big_data_object[0,0]+x
# Calling the remote function
my_function.remote([big_data_object_ref], item)
Note 2: You use Spyder which uses an IPython console. There are some known issues right now between ray and an IPython console. Just make sure you delete the references inside your script, and not using commands entered directly to the IPython console (because then the references will be remove but the items will not be removed from the object store). To inspect the object store using the ray memory command while your script is running, you can use some code at the end of your script like:
#%% Testing ray
# ... my ray testing code
#%% Clean-up object store data
print("Wait 10 sec BEFORE deletion")
time.sleep(10) # Now quickly use the 'ray memory' command to inspect the contents of the object store.
del result_refs
del big_data_object_ref
print("Wait 10 sec AFTER deletion")
time.sleep(10) # Now again use the 'ray memory' command to inspect the contents of the object store.

Is there a way to know that a pathos/multiprocessing worker is finished?

I'd like to know when workers finish so that I can free up resources as the last action any worker. Alternatively I can also free up these resources on the main process, but I need to free these up after each worker one by one (in contrast to freeing them up once after all of the workers finish).
I'm running my workers as below, tracking progress and PIDs used:
from pathos.multiprocessing import ProcessingPool
pool = ProcessingPool(num_workers)
pool.restart(force=True)
# Loading PIDs of workers with my get_pid() function:
pids = pool.map(get_pid, xrange(num_workers))
try:
results = pool.amap(
exec_func,
exec_args,
)
counter = 0
while not results.ready():
sleep(2)
if counter % 60 == 0:
log.info('Waiting for children running in pool.amap() with PIDs: {}'.format(pids))
counter += 1
results = results.get()
# Attempting to close pool...
pool.close()
# The purpose of join() is to ensure that a child process has completed
# before the main process does anything.
# Attempting to join pool...
pool.join()
except:
# Try to terminate the pool in case some worker PIDs still run:
cls.hard_kill_pool(pids, pool)
raise
Because of load balancing, it is hard to know which job will be the last on a worker. Is there any way to know that some workers are already inactive?
I'm using pathos version 0.2.0.
I'm the pathos author. If you need to free up resources after each worker in a Pool is is done running, I'd suggest you not use a Pool. A Pool is meant to allocate resources, and keep using them until all jobs are done. What I'd suggest is to use a for loop that spawns a Process and then ensures that the spawned Process is joined when you are done with it. If you need to do this within pathos, the Process class is at the horribly named: pathos.helpers.mp.Process (or much more directly at multiprocess.Process from the multiprocess package).

Celery task reprocessing itself in an infinite loop

I'm running into an odd situation where celery would reprocess a task that's been completed. The overall design looks like this:
Celery Beat: Pulls files periodically, if a file was pulled it creates a new entry in the DB and delegates processing of that file to another celery task in a 1 worker queue (that way only 1 file gets processed at a time)
Celery Task: Processes the file, once it's done it's done, no retries, no loops.
#app.task(name='periodic_pull_file')
def periodic_pull_file():
for f in get_files_from_some_dir(...):
ingested_file = IngestedFile(filename=filename)
ingested_file.document.save(filename, File(f))
ingested_file.save()
process_import(ingested_file.id)
#deletes the file from the dir source
os.remove(....somepath)
def process_import(ingested_file_id):
ingested_file = IngestedFile.objects.get(id=ingested_file_id)
if 'foo' in ingested_file.filename.lower():
f = process_foo
else:
f = process_real_stuff
f.apply_async(args=[ingested_file_id], queue='import')
#app.task(name='process_real_stuff')
def process_real_stuff(file_id):
#dostuff
process_foo and process_real_stuff is just a function that loops over the file once and once it's done it's done. I can actually keep track of the percentage of where it's at and the interesting thing I noticed was that the same file kept getting processed over and over again (note that these are large files and processing is slow, takes hours to process. Now I started wondering if it was just creating duplicate tasks in the queue. I checked my redis queue when I have 13 pending files to import:
-bash-4.1$ redis-cli -p 6380 llen import
(integer) 13
And aha, 13, I checked the content of each queued task to see if it was just repeated ingested_file_ids using:
redis-cli -p 6380 lrange import 0 -1
And they're all unique tasks with unique ingested_file_id. Am I overlooking something? Is there any reason why it would finish a task -> loop over the same task over and over again? This only started happening recently with no code changes. Before things used to be pretty snappy and seamless. I know it's also not from a "failed" process that somehow magically retries itself because it's not moving down in the queue. i.e. it's receiving the same task in the same order again and again, so it never gets to touch the other 13 files it should've processed.
Note, this is my worker:
python manage.py celery worker -A myapp -l info -c 1 -Q import
Use this
celery -Q your_queue_name purge

Python multiprocessing: more processes than requested

Why do I see so many python processes running (in htop on RHEL 6) for the same script when I only use 1 core?
For each task, I init a worker class that manages the processing. It does init other classes, but not any subprocesses:
tasks = multiprocessing.JoinableQueue()
results = multiprocessing.Queue()
num_consumers = 1
consumers = [Consumer(tasks, results) for i in xrange(num_consumers)]
for i, consumer in enumerate(consumers):
logger.debug('Starting consumer %s (%i/%i)' % (consumer.name, i + 1, num_consumers))
consumer.start()
Note, atop shows the expected number of processes (in this case 2: 1 for the parent and 1 for the child). The %MEM often adds up to well over 100% so I gather I'm misunderstanding how multiprocessing or htop works.
I believe you're seeing helper threads spun up by the Multiprocessing module within the main pid from your app. These are in addition to the Threads/Processes you've spun up explicitly.

python Redis Connections

I am using Redis server with python.
My application is multithreaded ( I use 20 - 32 threads per process) and I also
I run the app in different machines.
I have noticed that sometimes Redis cpu usage is 100% and Redis server became unresponsive/slow.
I would like to use per application 1 Connection Pool of 4 connections in total.
So for example, if I run my app in 20 machines at maximum, there should be
20*4 = 80 connections to the redis Server.
POOL = redis.ConnectionPool(max_connections=4, host='192.168.1.1', db=1, port=6379)
R_SERVER = redis.Redis(connection_pool=POOL)
class Worker(Thread):
def __init__(self):
self.start()
def run(self):
while True:
key = R_SERVER.randomkey()
if not key: break
value = R_SERVER.get(key)
def _do_something(self, value):
# do something with value
pass
if __name__ = '__main__':
num_threads = 20
workers = [Worker() for _ in range(num_threads)]
for w in workers:
w.join()
The above code should run the 20 threads that get a connection from the connection pool of max size 4 when a command is executed.
When the connection is released?
According to this code (https://github.com/andymccurdy/redis-py/blob/master/redis/client.py):
#### COMMAND EXECUTION AND PROTOCOL PARSING ####
def execute_command(self, *args, **options):
"Execute a command and return a parsed response"
pool = self.connection_pool
command_name = args[0]
connection = pool.get_connection(command_name, **options)
try:
connection.send_command(*args)
return self.parse_response(connection, command_name, **options)
except ConnectionError:
connection.disconnect()
connection.send_command(*args)
return self.parse_response(connection, command_name, **options)
finally:
pool.release(connection)
After the execution of each command, the connection is released and gets back to the pool
Can someone verify that I have understood the idea correct and the above example code will work as described?
Because when I see the redis connections, there are always more than 4.
EDIT: I just noticed in the code that the function has a return statement before the finally. What is the purpose of finally then?
As Matthew Scragg mentioned, the finally clause is executed at the end of the test. In this particular case it serves to release the connection back to the pool when finished with it instead of leaving it hanging open.
As to the unresponsiveness, look to what your server is doing. What is the memory limit of your Redis instance? How often are you saving to disk? Are you running on a Xen based VM such as an AWS instance? Are you running replication, and if so how many slaves and are they in a good state or are they frequently calling for a full resync of data? Are any of your commands "save"?
You can answer some of these questions by using the command line interface. For example
redis-cli info persistence will tell you information about the process of saving to disk, redis-cli info memory will tell you about your memory consumption.
When obtaining the persistence information you want to specifically look at rdb_last_bgsave_status and rdb_last_bgsave_time_sec. These will tell you if the last save was successful and how long it took. The longer it takes the higher the chances are you are running into resource issues and the higher the chance you will encounter slowdowns which can appear as unresponsiveness.
Final block will always run though there is an return statement before it. You may have a look at redis-py/connection.py , pool.release(connection) only put the connection to available-connections pool, So the connection is still alive.
About redis server cpu usage, your app will always send request and has no breaks or sleep, so it just use more and more cpus , but not memory . and cpu usage has no relation with open file numbers.

Categories

Resources