How to handle output with Luigi - python

I'm trying to grasp how luigi works, and I get the idea, but actual implementation is a bit harder ;) This is what i have:
class MyTask(luigi.Task):
x = luigi.IntParameter()
def requires(self):
return OtherTask(self.x)
def run(self):
print(self.x)
class OtherTask(luigi.Task):
x = luigi.IntParameter()
def run(self):
y = self.x + 1
print(y)
And this fails with RuntimeError: Unfulfilled dependency at run time: OtherTask_3_5862334ee2. I've figured that I need to produce output using def output(self): to workaround this issue\feature. And I can't comprehend how do I produce reasonable output without writing to a file, say:
def output(self):
return luigi.LocalTarget('words.txt')
def run(self):
words = [
'apple',
'banana',
'grapefruit'
]
with self.output().open('w') as f:
for word in words:
f.write('{word}\n'.format(word=word))
I've tried reading the documentation, but I can't understand the concept behind output at all. What if I need to output to screen only. What if I need to output an object to another task? Thanks!

What if I need to output an object to another task?
Luigi tasks can run in different processes. Therefore you do usually have to write to disk, a database, pickle, or some external mechanism that allows data to be exchanged between the processes (and the existence of which can be verified) if you want to exchange an object that is the result of a task.
As opposed to writing the output() method, which requires a target, you can also override the complete() method where you can write any custom logic that allows the tasks to be considered complete.

Related

Updating the same instance variables from different processes

Here is a simple secinaro:
class Test:
def __init__(self):
self.foo = []
def append(self, x):
self.foo.append(x)
def get(self):
return self.foo
def process_append_queue(append_queue, bar):
while True:
x = append_queue.get()
if x is None:
break
bar.append(x)
print("worker done")
def main():
import multiprocessing as mp
bar = Test()
append_queue = mp.Queue(10)
append_queue_process = mp.Process(target=process_append_queue, args=(append_queue, bar))
append_queue_process.start()
for i in range(100):
append_queue.put(i)
append_queue.put(None)
append_queue_process.join()
print str(bar.get())
if __name__=="__main__":
main()
When you call bar.get() at the end of the main() function why does it still return an empty list? How can I make it so that the child process also works with the same instance of Test not a new one?
All answers appreciated!
In general, processes have distinct address spaces, so that mutations of an object in one process have no effect on any object in any other process. Interprocess communication is needed to tell a process about changes made in another process.
That can be done explicitly (using things like multiprocessing.Queue), or implicitly if you use a facility implemented by multiprocessing for this purpose. For example, a great deal of work is done under the covers to make changes to a multiprocessing.Queue visible across processes.
The easiest way in your specific example is to replace your __init__ function like so:
def __init__(self):
import multiprocessing as mp
self.foo = mp.Manager().list()
It so happens that an mp.Manager instance supports a list() method that creates a process-aware list object (really a proxy for a list object, which forwards list operations to an under-the-covers server process that maintains a single copy of "the real" list - the list object isn't really shared across processes, because that's impossible - but the proxies make it appear to be shared).
So if you make that change, your code will display the results you expect - and there is no simpler way.
Note that multiprocessing works better the less IPC (interprocess communication) you need, and that's true pretty much regardless of application or programming language.
Objects are copied between processes by pickling them and passing the string over a pipe. There is no way to achieve true "shared memory" for pure Python objects between processes. To achieve precisely this type of synchronization, take a look at the multiprocessing.Manager documentation (https://docs.python.org/2/library/multiprocessing.html#managers) which provides you with examples about synchronized versions of common Python container types. These are "proxied" containers where operations on the proxy send all arguments across the process boundary, pickled, and are then executed in the parent process.

Structuring Python Code for Data Analysis

I wrote code for a data analysis project, but it's becoming unwieldy and I'd like to find a better way of structuring it so I can share it with others.
For the sake of brevity, I have something like the following:
def process_raw_text(txt_file):
# do stuff
return token_text
def tag_text(token_text):
# do stuff
return tagged
def bio_tag(tagged):
# do stuff
return bio_tagged
def restructure(bio_tagged):
# do stuff
return(restructured)
print(restructured)
Basically I'd like the program to run through all of the functions sequentially and print the output.
In looking into ways to structure this, I read up on classes like the following:
class Calculator():
def add(x, y):
return x + y
def subtract(x, y):
return x - y
This seems useful when structuring a project to allow individual functions to be called separately, such as the add function with Calculator.add(x,y), but I'm not sure it's what I want.
Is there something I should be looking into for a sequential run of functions (that are meant to structure the data flow and provide readability)? Ideally, I'd like all functions to be within "something" I could call once, that would in turn run everything within it.
Chain together the output from each function as the input to the next:
def main():
print restructure(bio_tag(tag_text(process_raw_text(txt_file))
if __name__ == '__main__':
main()
#SvenMarnach makes a nice suggestion. A more general solution is to realise that this idea of repeatedly using the output as the input for the next in a sequence is exactly what the reduce function does. We want to start with some input txt_file:
def main():
pipeline = [process_raw_text, tag_text, bio_tag, restructure]
print reduce(apply, pipeline, txt_file)
There's nothing preventing you from creating a class (or set of classes) that represent that you want to manage with implementations that will call the functions you need in a sequence.
class DataAnalyzer():
# ...
def your_method(self, **kwargs):
# call sequentially, or use the 'magic' proposed by others
# but internally to your class and not visible to clients
pass
The functions themselves could remain private within the module, which seem to be implementation details.
you can implement a simple dynamic pipeline just using modules and functions.
my_module.py
def 01_process_raw_text(txt_file):
# do stuff
return token_text
def 02_tag_text(token_text):
# do stuff
return tagged
my_runner.py
import my_module
if __name__ == '__main__':
funcs = sorted([x in my_module.__dict__.iterkeys() if re.match('\d*.*', x)])
data = initial_data
for f in funcs:
data = my_module.__dict__[f](data)

Python multiprocessing seems near impossible to do within classes/using any class instances. What is its intended use?

I have an alogirithm that I am trying to parallelize, because of very long run times in serial. However, the function that needs to be parallelized is inside a class. multiprocessing.Pool seems to be the best and fastest way to do this, but there is a problem. It's target function can not be a function of an object instance. Meaning this; you declare a Pool in the following way:
import multiprocessing as mp
cpus = mp.cpu_count()
poolCount = cpus*2
pool = mp.Pool(processes = poolCount, maxtasksperchild = 2)
And then actually use it as so:
pool.map(self.TargetFunction, args)
But this throws an error, because object instances cannot be pickled, as the Pool function does to pass information to all of its child processes. But I have to use self.TargetFunction
So I had an idea, I would create a new Python file named parallel and simply write a couple of functions without putting them in a class, and call those functions from within my original class (of whose function I want to parallelize)
So I tried this:
import multiprocessing as mp
def MatrixHelper(args):
WM = args[0][0]
print(WM.CreateMatrixMp(*args))
return WM.CreateMatrixMp(*args)
def Start(sigmaI, sigmaX, numPixels, WM):
cpus = mp.cpu_count()
poolCount = cpus * 2
args = [(WM, sigmaI, sigmaX, i) for i in range(numPixels)]
print('Number of cpu\'s to process WM:%d'%cpus)
pool = mp.Pool(processes = poolCount, maxtasksperchild = 2)
tempData = pool.map(MatrixHelper, args)
return tempData
These functions are not part of a class, using MatrixHelper in Pools map function works fine. But I realized while doing this that it was no way out. The function in need of parallelization (CreateMatrixMp) expects an object to be passed to it (it is declared as def CreateMatrixMp(self, sigmaI, sigmaX, i))
Since it is not being called from within its class, it doesn't get a self passed to it. To solve this, I passed the Start funtion the calling object itself. As in, I say parallel.Start(sigmaI, sigmaX, self.numPixels, self). The object self then becomes WM so that I will be able to finally call the desired function as WM.CreateMatrixMp().
I'm sure that that is a very sloppy way of coding, but I just wanted to see if it would work. But nope, pickling error again, the map function cannot handle any objects instances at all.
So my question is, why is it designed this way? It seems useless, it seems to be completely disfunctional in any program that uses classes at all.
I tried using Process rather than Pool, but this requires the array that I am ultimately writing to to be shared, which requires processes waiting for eachother. If I don't want it to be shared, then I have each process write its own smaller array, and do one big write at the end. But both of these result in slower run times than when I was doing this serially! Pythons builtin multiprocessing seems absolutely useless!
Can someone please give me some guidance as to how to actually save time with multiprocessing, in the context of my tagret function being inside a class? I have read on posts here to use pathos.multiprocessing instead, but I am on Windows, and am working on this project with multiple people who all have different set ups. Having everyone try to install it would be inconveinient.
I was having a similar issue with trying to use multiprocessing within a class. I was able to solve it with a relatively easy workaround I found online. Basically you use a function outside of your class that unwraps/unpacks the method inside your function that you're trying to parallelize. Here are the two websites I found that explain how to do it.
Website 1 (joblib example)
Website 2 (multiprocessing module example)
For both, the idea is to do something like this:
rom multiprocessing import Pool
import time
def unwrap_self_f(arg, **kwarg):
return C.f(*arg, **kwarg)
class C:
def f(self, name):
print 'hello %s,'%name
time.sleep(5)
print 'nice to meet you.'
def run(self):
pool = Pool(processes=2)
names = ('frank', 'justin', 'osi', 'thomas')
pool.map(unwrap_self_f, zip([self]*len(names), names))
if __name__ == '__main__':
c = C()
c.run()
The essence of how multiprocessing works is that it spawns sub-processes that receive parameters to run a certain function. In order to pass these arguments, it needs that they are, well, passable: non-exclusive to the main process, s.a. sockets, file descriptors and other low-level, OS related stuff.
This translates into "need to be pickleable or serializable".
On the same topic, parallel processing works best when you (can) have self-contained divisions of a problem. I can tell you want to share some sort of input/stream/database source, but this will probably create a bottleneck that you'll have to tackle at some point (at least, from the "python script" side, rather than the "OS/database" side. Fortunately, you have to tackle it early now.
You can re-code your classes to spawn/create these non-pickable resources when neeeded rather than at start
def targetFunction(self, range_params):
if not self.ready():
self._init_source()
#rest of the code
You kinda tackled the problem the other way around (initialized an object based on params). And yes, parallel processing comes with a cost.
You can see the multiprocessing programming guidelines for an even more thorough insight on this matter.
this is an old post but it still is one of the top results when you search for the topic. Some good info for this question can be found at this stack overflow: python subclassing multiprocessing.Process
I tried some workarounds to try calling pool.starmap from inside of a class to another function in the class. Making it a staticmethod or having a function on the outside call it didn't work and gave the same error. A class instance just can't be pickled so we need to create the instance after we start the multiprocessing.
What I ended up doing that worked for me was to separate my class into two classes. Basically, the function you are calling the multiprocessing on needs to be called right after you instantiate a new object for the class it belongs to.
Something like this:
from multiprocessing import Pool
class B:
...
def process_feature(idx, feature):
# do stuff in the new process
pass
...
def multiprocess_feature(process_args):
b_instance = B()
return b_instance.process_feature(*process_args)
class A:
...
def process_stuff():
...
with Pool(processes=num_processes, maxtasksperchild=10) as pool:
results = pool.starmap(
multiprocess_feature,
[
(idx, feature)
for idx, feature in enumerate(features)
],
chunksize=100,
)
...
...
...

Creating a variable that can be compared across processes

I have code like the following,
class _Process(multiprocessing.Process):
STOP = multiprocessing.Manager().Event()
def __init__(self, queue, process_fn):
self._q = queue
self._p = process_fn
super().__init__()
def run(self):
while True:
dat = self._q.get()
if not dat is _Process.STOP:
self._p(dat, self._q)
self._q.task_done()
else:
self._q.task_done()
break
but, I cannot compare STOP successfully. This isn't that surprising when I'm using is since, I believe, is compares object id's and from the docs " ... This is the address of the object in memory." So, since I'm using multiple processes the memory address will be different. (I can't compare it with == either though, and I'm not sure why this is).
This happens with any object I create with Manager(), but if I use a "true" singleton (True or False or None) it does work. Although that's not an appropriate solution since any of those values may be valid in the queue.
So how can I create a variable, like the singletons, that can be compared across processes?
(N.B. I have tried using a dedicated class too, but get errors about it not being able to be pickled.)
Update: The answer does seem to be to use a class, but I was receiving the pickleing problems as I was only trying with an inner class. Moving it to module scope fixed the error and it works fine. - Thanks #Schnouki!
Here's an example (and pointless) usage of the code, that shows the error ...
def f(data, queue):
print(data)
q = multiprocessing.JoinableQueue()
for i in range(4):
p = _Process(q, f)
p.daemon = True
p.start()
q.put(i)
q.join()
for i in range(4):
q.put(_Process.STOP)
q.join()
This is a weird way to use an Event object... If you can't use None or a boolean, I suggest you use a dedicated class and test the type of what you get from the queue:
class StopProcessing(object):
pass
#...
q.put(StopProcessing())
#...
while True:
dat = self._q.get()
if type(dat) is StopProcessing:
# ...
Or, of course, you could just keep using the multiprocessing.Event and test for its type. However this would probably be quite misleading for someone else reading your code; using a dedicated type seems much cleaner and Pythonic to me.
EDIT: Ok, so apparently this doesn't work because the new class is not picklable. So here's another idea: what if you directly put the type inside your queue, like this:
class StopProcessing(object):
pass
#...
q.put(StopProcessing)
#...
while True:
dat = self._q.get()
if dat is StopProcessing:
#...
According to the pickle doc, "classes that are defined at the top level of a module" can be pickled.

What is offered by coroutines in python that improve a naive consumer/producer setup?

I've read a little about coroutines, in particular with python, and something is not entirely obvious to me.
I have implemented a producer/consumer model, a basic version of which is as follows:
#!/usr/bin/env python
class MyConsumer(object):
def __init__(self, name):
self.__name = name
def __call__(self, data):
return self.observer(data)
def observer(self, data):
print self.__name + ': ' + str(data)
class MyProducer(object):
def __init__(self):
self.__observers = []
self.__counter = 0
def add_observer(self, observer):
self.__observers.append(observer)
def run(self):
while self.__counter < 10:
for each_observer in self.__observers:
each_observer(self.__counter)
self.__counter += 1
def main():
consumer_one = MyConsumer('consumer one')
consumer_two = MyConsumer('consumer two')
producer = MyProducer()
producer.add_observer(consumer_one)
producer.add_observer(consumer_two)
# run
producer.run()
if __name__ == "__main__":
main()
Obviously, MyConsumer could have routines for producing as well and so a data pipeline can be built easily. As I have implemented this in practice, a base class is defined that implements the logic of the consumer/producer model and single processing function is implemented that is overwritten in child classes. This makes it very simple to produce data pipelines with easily defined, isolated processing elements.
This seems to me to be typical of the kinds of applications that are presented for coroutines, for example in the oft quoted tutorial: http://www.dabeaz.com/coroutines/index.html. Unfortunately, it is not apparent to me what the advantages of coroutines are over the implementation above. I can see that in languages in which callable objects are more difficult to handle, there is something to be gained, but in the case of python, this doesn't seem to be an issue.
Can anybody shed some light on this for me? Thanks.
edit: Apologies, the producer in the above code counts from 0 to 9 and notifies the consumers, which then print out their name followed by the count value.
When using the coroutines approach, both the consumer and the producer code can be simpler sometimes. In your approach, at least one of them must be written as a finite-state-machine (assuming some state is involved).
With the coroutines approach they are essentially independent processes.
An example would help:
Take the example you provided but now assume the consumer prints only every 2nd input. Your approach requires adding an object member indicating whether the received input is an odd or even sample.
def observer(self, data):
self.odd_sample = !self.odd_sample
if self.odd_sample:
print str(data)
When using a coroutine, one would just loop over the input, dropping every second input. The 'state' is implicitly maintained by the current position in the code:
while True:
y = producer()
print(y)
y = producer()
# ignore this value

Categories

Resources