Threading problem with threadpool and objects

Threading problem with threadpool and objects - python

I have a problem when using objects and treads.
Below follows a simplified example of the code.
I am using a threadpool to loop over a list of jobs.
class File(object):
def __init__(self, name, streams = [])
self.name = name
self.streams = streams
def appendStream(stream):
self.streams.append(stream)
class Job(object):
def __init__(self, file):
self.file = file
def main():
...
jobs = []
for f in input_files:
f_obj = File(f)
jobs.append(Job(f_obj))
with ThreadPool(processes = 2, initializer = init, initargs = (log, p_lock)) as pool:
pool.map(func = process_job, iterable = jobs, chunksize = 1)
...
The function (process_job) used by the thread pool resides in the same .py file.
def process_job(job):
...
get_info(job.file)
...
This function in turn uses a function (get_info) from a self defined package.
This function creates an argument list and then calls subprocess.check_output().
The subprocess returns a json struct which is looped over to update the contents of the input object.
def get_info(file):
...
args = ["ffprobe", ..., "-i", file.name]
try:
output = subprocess.check_output(args)
except Exception as e:
print(e)
data = info_json.decode('utf8')
json_data = json.loads(data)
for item in info_json:
file.appendStream(item["stream"])
...
The problem is that when running this code the threads spawned by the pool is updating each others file objects.
For example when running this with 5 input files the 5th job.file.streams will contain 5 streams i.e the 4 previous streams that belongs to the other files.
Why is this happening and how can I solve it.
Best regards!

As #torek spotted it seems to be a case of the "Mutable Default Argument".
“Least Astonishment” and the Mutable Default Argument

Related

Python Multiprocessing with shared data source and multiple class instances

My program needs to spawn multiple instances of a class, each processing data that is coming from a streaming data source.
For example:
parameters = [1, 2, 3]
class FakeStreamingApi:
def __init__(self):
pass
def data(self):
return 42
pass
class DoStuff:
def __init__(self, parameter):
self.parameter = parameter
def run(self):
data = streaming_api.data()
output = self.parameter ** 2 + data # Some CPU intensive task
print output
streaming_api = FakeStreamingApi()
# Here's how this would work with no multiprocessing
instance_1 = DoStuff(parameters[0])
instance_1.run()
Once the instances are running they don't need to interact with each other, they just have to get the data as it comes in. (and print error messages, etc)
I am totally at a loss how to make this work with multiprocessing, since I first have to create a new instance of the class DoStuff, and then have it run.
This is definitely not the way to do it:
# Let's try multiprocessing
import multiprocessing
for parameter in parameters:
processes = [ multiprocessing.Process(target = DoStuff, args = (parameter)) ]
# Hmm, this doesn't work...
We could try defining a function to spawn classes, but that seems ugly:
import multiprocessing
def spawn_classes(parameter):
instance = DoStuff(parameter)
instance.run()
for parameter in parameters:
processes = [ multiprocessing.Process(target = spawn_classes, args = (parameter,)) ]
# Can't tell if it works -- no output on screen?
Plus, I don't want to have 3 different copies of the API interface class running, I want that data to be shared between all the processes... and as far as I can tell, multiprocessing creates copies of everything for each new process.
Ideas?
Edit:
I think I may have got it... is there anything wrong with this?
import multiprocessing
parameters = [1, 2, 3]
class FakeStreamingApi:
def __init__(self):
pass
def data(self):
return 42
pass
class Worker(multiprocessing.Process):
def __init__(self, parameter):
super(Worker, self).__init__()
self.parameter = parameter
def run(self):
data = streaming_api.data()
output = self.parameter ** 2 + data # Some CPU intensive task
print output
streaming_api = FakeStreamingApi()
if __name__ == '__main__':
jobs = []
for parameter in parameters:
p = Worker(parameter)
jobs.append(p)
p.start()
for j in jobs:
j.join()

I came to the conclusion that it would be necessary to use multiprocessing.Queues to solve this. The data source (the streaming API) needs to pass copies of the data to all the different processes, so they can consume it.
There's another way to solve this using the multiprocessing.Manager to create a shared dict, but I didn't explore it further, as it looks fairly inefficient and cannot propagate changes to inner values (e.g if you have a dict of lists, changes to the inner lists will not propagate).

Python Multiprocessing.Pool in a class

I am trying to change my code to a more object oriented format. In doing so I am lost on how to 'visualize' what is happening with multiprocessing and how to solve it. On the one hand, the class should track changes to local variables across functions, but on the other I believe multiprocessing creates a copy of the code which the original instance would not have access to. I need to figure out a way to manipulate classes, within a class, using multiprocessing, and have the parent class retain all manipulated values in the nested classes.
A simple version (OLD CODE):
function runMultProc():
...
dictReports = {}
listReports = ['reportName1.txt', 'reportName2.txt']
tasks = []
pool = multiprocessing.Pool()
for report in listReports:
if report not in dictReports:
dictReports[today][report] = {}
tasks.append(pool.apply_async(worker, args=([report, dictReports[today][report]])))
else:
continue
for task in tasks:
report, currentReportDict = task.get()
dictReports[report] = currentFileDict
function worker(report, currentReportDict):
<Manipulate_reports_dict>
return report, currentReportDict
NEW CODE:
class Transfer():
def __init__(self):
self.masterReportDictionary[<todays_date>] = [reportObj1, reportObj2]
def processReports(self):
self.pool = multiprocessing.Pool()
self.pool.map(processWorker, self.masterReportDictionary[<todays_date>])
self.pool.close()
self.pool.join()
def processWorker(self, report):
# **process and manipulate report, currently no return**
report.name = 'foo'
report.path = '/path/to/report'
class Report():
def init(self):
self.name = ''
self.path = ''
self.errors = {}
self.runTime = ''
self.timeProcessed = ''
self.hashes = {}
self.attempts = 0
I don't think this code does what I need it to do, which is to have it process the list of reports in parallel AND, as processWorker manipulates each report class object, store those results. As I am fairly new to this I was hoping someone could help.
The big difference between the two is that the first one build a dictionary and returned it. The second model shouldn't really be returning anything, I just need for the classes to finish being processed and they should have relevant information within them.
Thanks!

The usual answer not helping with TypeError [duplicate]

This question already has an answer here:
'int' object is not callable in python
(1 answer)
Closed 5 years ago.
I'm getting a TypeError: 'list' object is not callable but I can't see where the usual answer of use list[] instead of list() applies. I don't seem to be using either notation, just calling functions on the list. I'm quite stuck and could use some help here
import scheduler
def main():
sched = scheduler.Scheduler()
sched.line_list("/home/scabandari/Desktop/threads.txt") # Error is caused by this line
# sched.create_processList()
if __name__ == "__main__":
main()
scheduler.py:
import process
class Scheduler:
def __init__(self):
self.line_list = []
self.process_list = []
pass
# populates line_list[] from "/location/.txt" file, each line reps a process object
def line_list(self, file):
f = open(file)
getlines = f.readlines()
for line in getlines:
self.line_list.append(line)
self.line_list.pop(0)
# populates process_list[] from line_list[]
def process_list(self):
for line in self.line_list:
temp_arr = line.split()
self.process_list.append(process.Process(temp_arr[0], temp_arr[1],
temp_arr[2], temp_arr[3]))
for proc in self.process_list:
proc.print_process()

Know that when you define your class object named Scheduler as follows
class Scheduler:
def __init__(self):
self.line_list = []
self.process_list = []
def line_list(self, file):
f = open(file)
getlines = f.readlines()
for line in getlines:
self.line_list.append(line)
self.line_list.pop(0)
...
and then instantiate it, i.e.
inst = Scheduler()
the definition you have done of your method line_list, is overriden by the post-instantiation execution of __init__, which turns it into a list.
Which means that, as melpomene mentions in comments,
you need to decide whether you want line_list to be a method or a list. It can't be both.
Thus, what you may want to do is renaming your method, furthermore giving it a name representative of what it actually does, e.g.
...
def populate_line_list(self, file):
'''
Method which populates line_list from "/location/.txt" file,
each line reps a process object
'''
f = open(file)
getlines = f.readlines()
for line in getlines:
self.line_list.append(line)
self.line_list.pop(0)
f.close() #do not forget to close your file, or use a with-statement
...
Finally you will be able to do
import scheduler
def main():
sched = scheduler.Scheduler()
sched.populate_line_list("/home/scabandari/Desktop/threads.txt")
if __name__ == "__main__":
main()

Python Running Multiple Locks across Multiple Threads

So the situation is that I have multiple methods, which might be threaded simaltenously, but all need their own lock
against being re-threaded until they have run. They are established by initialising a class with some dataprocessing options:
class InfrequentDataDaemon(object): pass
class FrequentDataDaemon(object): pass
def addMethod(name):
def wrapper(f):
setattr(processor, f.__name__, staticmethod(f))
return f
return wrapper
class DataProcessors(object):
lock = threading.Lock()
def __init__(self, options):
self.common_settings = options['common_settings']
self.data_processing_configurations = options['data_processing_configurations'] #Configs for each processing method
self.data_processing_types = options['data_processing_types']
self.Data_Processsing_Functions ={}
#I __init__ each processing method as a seperate function so that it can be locked
for type in options['data_processing_types']:
def bindFunction1(name):
def func1(self, data=None, lock=None):
config = self.data_processing_configurations[data['type']] #I get the right config for the datatype
with lock:
FetchDataBaseStuff(data['type'])
#I don't want this to be run more than once at a time per DataProcessing Type
# But it's fine if multiple DoSomethings run at once, as long as each DataType is different!
DoSomething(data, config)
WriteToDataBase(data['type'])
func1.__name__ = "Processing_for_{}".format(type)
self.Data_Processing_Functions[func1.__name__] = func1 #Add this function to the Dictinary object
bindFunction1(type)
#Then I add some methods to a daemon that are going to check if our Dataprocessors need to be called
def fast_process_types(data):
if not example_condition is True: return
if not data['type'] in self.data_processing_types: return #Check that we are doing something with this type of data
threading.Thread(target=self.Data_Processing_Functions["Processing_for_{}".format(data['type'])], args=(self,data, lock)).start()
def slow_process_types(data):
if not some_other_condition is True: return
if not data['type'] in self.data_processing_types: return #Check that we are doing something with this type of data
threading.Thread(target=self.Data_Processing_Functions["Processing_for_{}".format(data['type'])], args=(self,data, lock)).start()
addMethod(InfrequentDataDaemon)(slow_process_types)
addMethod(FrequentDataDaemon)(fast_process_types)
The idea is to lock each method in
DataProcessors.Data_Processing_Functions - so that each method is only accessed by one thread at a time (and the rest of the threads for the same method are queued). How does Locking need to be set up to achieve this effect?

I'm not sure I completely follow what you're trying to do here, but could you just create a separate threading.Lock object for each type?
class DataProcessors(object):
def __init__(self, options):
self.common_settings = options['common_settings']
self.data_processing_configurations = options['data_processing_configurations'] #Configs for each processing method
self.data_processing_types = options['data_processing_types']
self.Data_Processsing_Functions ={}
self.locks = {}
#I __init__ each processing method as a seperate function so that it can be locked
for type in options['data_processing_types']:
self.locks[type] = threading.Lock()
def bindFunction1(name):
def func1(self, data=None):
config = self.data_processing_configurations[data['type']] #I get the right config for the datatype
with self.locks[data['type']]:
FetchDataBaseStuff(data['type'])
DoSomething(data, config)
WriteToDataBase(data['type'])
func1.__name__ = "Processing_for_{}".format(type)
self.Data_Processing_Functions[func1.__name__] = func1 #Add this function to the Dictinary object
bindFunction1(type)
#Then I add some methods to a daemon that are going to check if our Dataprocessors need to be called
def fast_process_types(data):
if not example_condition is True: return
if not data['type'] in self.data_processing_types: return #Check that we are doing something with this type of data
threading.Thread(target=self.Data_Processing_Functions["Processing_for_{}".format(data['type'])], args=(self,data)).start()
def slow_process_types(data):
if not some_other_condition is True: return
if not data['type'] in self.data_processing_types: return #Check that we are doing something with this type of data
threading.Thread(target=self.Data_Processing_Functions["Processing_for_{}".format(data['type'])], args=(self,data)).start()
addMethod(InfrequentDataDaemon)(slow_process_types)
addMethod(FrequentDataDaemon)(fast_process_types)

replace class method with simple function

as an example here, i want to make a function to temporarily direct the stdout to a log file.
the tricky thing is that the codes have to keep the file handler and std sav for restoration after the redirect, i wrote it in class type to keep these two variables.
here below the full code:
class STDOUT2file:
def __init__(self,prefix='report#'):
now=dt.date.today()
repname=repnameprefix=prefix+now.strftime("%Y%m%d")+'.txt'
count=0
while os.path.isfile(repname):
count+=1
repname=repnameprefix+(".%02d" %(count))
self.sav=sys.stdout
f=open(repname,'w')
sys.stdout=f
self.fname=repname
self.fhr=f
def off(self,msg=False):
sys.stdout=self.sav
self.fhr.close()
if msg:
print('output to:'+self.fname)
return
here is the code to apply it:
outbuf=STDOUT2file()
#codes to print out to stdout
outbuf.off(msg=True)
i want to make it more clean, read about 'closure' but it returns a function at the first call, kind of assigment type as similar as class.
i want it to be like:
STDOUT2file('on')
STDout2file('off',msg=True)
note: redirecting to stdout is an example i encountered just now.. what i am wondering is, any way other than class type to make simple functionality like those on/off type, which involve store/retrieval of state variables that should be better made invisible to outside.

Try using a context manager instead. This idiom is common enough that it was included in the PEP that introduced context managers (slightly modified here):
from contextlib import contextmanager
#contextmanager
def redirect_stdout(new_stdout):
import sys
save_stdout = sys.stdout
sys.stdout = new_stdout
try:
yield
finally:
sys.stdout = save_stdout
Or, if you like, the class-based version with __enter__ and __exit__:
class redirect_stdout:
"""Context manager for temporarily redirecting stdout to another file
docstring truncated
"""
def __init__(self, new_target):
self.new_target = new_target
def __enter__(self):
self.old_target = sys.stdout
sys.stdout = self.new_target
return self.new_target
def __exit__(self, exctype, excinst, exctb):
sys.stdout = self.old_target
Raymond Hettinger actually committed this to contextlib, it will be included in python 3.4 as contextlib.redirect_stdout().
Basic usage:
with open('somelogfile','a') as f:
with stdout_redirected(f):
print(something)

Yes, you can save state information in a function. Just name the variable functionname.something and it will be saved. For example:
def stdout2file(status, prefix='pre', msg=False):
import datetime as dt
import os
import sys
if not hasattr(stdout2file, 'sav'):
stdout2file.sav = None
if status == 'on':
if stdout2file.sav:
print('You have already triggered this once Ignoring this request.')
else:
now = dt.date.today()
repname = repnameprefix = prefix + now.strftime("%Y%m%d") + '.txt'
count = 0
while os.path.isfile(repname):
count += 1
repname = repnameprefix + (".%02d" %(count))
stdout2file.sav = sys.stdout
f = open(repname,'w')
sys.stdout = f
stdout2file.fhr = f
elif status == 'off':
if not stdout2file.sav:
print('Redirect is "off" already. Ignoring this request')
else:
sys.stdout = stdout2file.sav
stdout2file.fhr.close()
if msg:
print('output to:' + stdout2file.fhr.name)
stdout2file.sav = None
else:
print('Unrecognized request')
It is also possible to keep status information in mutable keyword parameters like so:
def stdout_toggle(prefix='pre', msg=False, _s=[None, None]):
import datetime as dt
import os
import sys
if _s[0] is None:
now = dt.date.today()
repname = repnameprefix = prefix + now.strftime("%Y%m%d") + '.txt'
count = 0
while os.path.isfile(repname):
count += 1
repname = repnameprefix + (".%02d" %(count))
f = open(repname,'w')
_s[:] = [sys.stdout, f]
sys.stdout = f
else:
sys.stdout = _s[0]
_s[1].close()
if msg:
print('output to:' + _s[1].name)
_s[:] = [None, None]
The user can call the above without any arguments and it will toggle between the redirect between on and off. The function remembers the current status through the keyword parameter _s which is a mutable list.
Although some consider the fact that mutable keyword parameters are preserved between function calls to be a language flaw, it is consistent with python philosophy. It works because the default values for keyword parameters are assigned when the function is first defined, that is when the def statement is executed, and not when the function is called. Consequently, _s=[None, None] is assigned once at definition and is free to vary thereafter.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Threading problem with threadpool and objects - python

As #torek spotted it seems to be a case of the "Mutable Default Argument". “Least Astonishment” and the Mutable Default Argument

Related

Python Multiprocessing with shared data source and multiple class instances

Python Multiprocessing.Pool in a class

The usual answer not helping with TypeError [duplicate]

Python Running Multiple Locks across Multiple Threads

replace class method with simple function

Categories

Resources