How to queue 3 dependent functions using threads and queues - python

I have 3 functions which I need to run. Each function generates a certain output that the next function depends on. So basically, once the first one finishes only then I can proceed to the second one, but once the second one runs, I can start running the first one again to generate the next batch of data - so I want to run them both at the same time, but I can't run the second one before I finish the first one, and the third one before I finish the second one. But I can run the first and second while the third is running. How can I implement that using threading in python? I understand the basics behind threading, but I don't know how to create the queue for that purpose.
This is an example of what I need to do:
# This is what will usually happen without threading. How can I implement the
# same thing but with threading? Keep in mind that foo() 1, 2 and 3
# take some amount of time. And foo2() may finish before foo() finished
# generating data, so I can't run foo2() until I have the data from foo()
# foo generates the data for foo2
data = foo()
# foo2 generates data for foo3
data2 = foo2(data)
# foo3 does something with data2 and the data is no longer used
foo3(data2)

After going through the comments of your questions I did a little more searching. I understand what you are looking is some kind of Pipeline pattern. And it seems Python does have something for it. Answer for the qn in the link also talks abotu what #Peterwood said about making use of queues.
How to design an async pipeline pattern in python

Related

Mulitprocessing pool for function with no arguments/iterable?

I'm running Python 2.7 on the GCE platform to do calculations. The GCE instances boot, install various packages, copy 80 Gb of data from a storage bucket and runs a "workermaster.py" script with nohangup. The workermaster runs on an infinite loop which checks a task-queue bucket for tasks. When the task bucket isn't empty it picks a random file (task) and passes work to a calculation module. If there is nothing to do the workermaster sleeps for a number of seconds and checks the task-list again. The workermaster runs continuously until the instance is terminated (or something breaks!).
Currently this works quite well, but my problem is that my code only runs instances with a single CPU. If I want to scale up calculations I have to create many identical single-CPU instances and this means there is a large cost overhead for creating many 80 Gb disks and transferring the data to them each time, even though the calculation is only "reading" one small portion of the data for any particular calculation. I want to make everything more efficient and cost effective by making my workermaster capable of using multiple CPUs, but after reading many tutorials and other questions on SO I'm completely confused.
I thought I could just turn the important part of my workermaster code into a function, and then create a pool of processes that "call" it using the multiprocessing module. Once the workermaster loop is running on each CPU, the processes do not need to interact with each other or depend on each other in any way, they just happen to be running on the same instance. The workermaster prints out information about where it is in the calculation and I'm also confused about how it will be possible to tell the "print" statements from each process apart, but I guess that's a few steps from where I am now! My problems/confusion are that:
1) My workermaster "def" doesn't return any value because it just starts an infinite loop, where as every web example seems to have something in the format myresult = pool.map(.....); and
2) My workermaster "def" doesn't need any arguments/inputs - it just runs, whereas the examples of multiprocessing that I have seen on SO and on the Python Docs seem to have iterables.
In case it is important, the simplified version of the workermaster code is:
# module imports are here
# filepath definitions go here
def workermaster():
while True:
tasklist = cloudstoragefunctions.getbucketfiles('<my-task-queue-bucket')
if tasklist:
tasknumber = random.randint(2, len(tasklist))
assignedtask = tasklist[tasknumber]
print 'Assigned task is now: ' + assignedtask
subprocess.call('gsutil -q cp gs://<my-task-queue-bucket>/' + assignedtask + ' "' + taskfilepath + assignedtask + '"', shell=True)
tasktype = assignedtask.split('#')[0]
if tasktype == 'Calculation':
currentcalcid = assignedtask.split('#')[1]
currentfilenumber = assignedtask.split('#')[2].replace('part', '')
currentstartfile = assignedtask.split('#
currentendfile = assignedtask.split('#')[4].replace('.csv', '')
calcmodule.docalc(currentcalcid, currentfilenumber, currentstartfile, currentendfile)
elif tasktype == 'Analysis':
#set up and run analysis module, etc.
print ' Operation completed!'
os.remove(taskfilepath + assignedtask)
else:
print 'There are no tasks to be processed. Going to sleep...'
time.sleep(30)
Im trying to "call" the function multiple times using the multiprocessing module. I think I need to use the "pool" method, so I've tried this:
import multiprocessing
if __name__ == "__main__":
p = multiprocessing.Pool()
pool_output = p.map(workermaster, [])
My understanding from the docs is that the __name__ line is there only as a workaround for doing multiprocessing in Windows (which I am doing for development, but GCE is on Linux). The p = multiprocessing.Pool() line is creating a pool of workers equal to the number of system CPUs as no argument is specified. It the number of CPUs was 1 then I would expect the code to behave as it does before I attempted to use multiprocessing. The last line is the one that I don't understand. I thought that it was telling each of the processors in the pool that the "target" (thing to run) is workermaster. From the docs there appears to be a compulsory argument which is an iterable, but I don't really understand what this is in my case, as workermaster doesn't take any arguments. I've tried passing it an empty list, empty string, empty brackets (tuple?) and it doesn't do anything.
Please would it be possible for someone help me out? There are lots of discussions about using multiprocessing and this thread Mulitprocess Pools with different functions and this one python code with mulitprocessing only spawns one process each time seem to be close to what I am doing but still have iterables as arguments. If there is anything critical that I have left out please advise and I will modify my post - thank you to anyone who can help!
Pool() is useful if you want to run the same function with different argumetns.
If you want to run function only once then use normal Process().
If you want to run the same function 2 times then you can manually create 2 Process().
If you want to use Pool() to run function 2 times then add list with 2 arguments (even if you don't need arguments) because it is information for Pool() to run it 2 times.
But if you run function 2 times with the same folder then it may run 2 times the same task. if you will run 5 times then it may run 5 times the same task. I don't know if it is needed.
As for Ctrl+C I found on Stackoverflow Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python but I don't know if it resolves your problem.

Python - Running Function in Separate Thread, Then Accessing It

So I have a Python code running with one very expensive function that gets executed at times on demand, but it's result is not needed straight away (it can be delayed by a few cycles).
def heavy_function(arguments):
return calc_obtained_from_arguments
def main():
a = None
if some_condition:
a = heavy_function(x)
else:
do_something_with(a)
The thing is that whenever I calculate the heavy_function, the rest of the program hangs. However, I need it to run with empty a value, or better make it know that a is being processed separately and thus should not be accessed. How can I move the heavy_function to separate process and keep calling the main function all the time until heavy_function is done executing, then read the obtained a value and use it in main function?
You could use a simple queue.
Put your heavy_function inside a separate process that idles as long as there is no input in the input queue. Use Queue.get(block=True) to do so. Put the result of the computation in another queue.
Run your normal process with the empty a-value and check emptiness of the output queue from time to time. Maybe use while Queue.empty(): here.
If an item becomes available, because your heavy_functionhas finished, switch to a calculation with the value a from your output queue.

Python - Can't run code during while loop

I am pretty new to python, and while using a module to print out packets being received I can't execute code while the while loop that reads the packets is being executed. Here is a basic example. Any help would be appreciated.
def foo():
while True:
print("bar")
foo()
print("foobar")
i want it to print foobar once after the while loop has stared, is this possible?
Typically in Python (and most other languages), you start with just one thread of execution.
A while True: ... is an infinite loop – unless code inside the loop breaks out, or something external interrupts it, it never ends. So your code never reaches the call to print('foobar') line.
You could put a special case inside the while loop, for the first pass through, that reports what you want. Or you could look into using multiple threads of execution – an advanced topic.
The program executes sequentially, so the print will never happen because of the infinite loop. So you must use a thread to circumvent this issue, allowing you to simultaneously execute code like so:
threading.Thread(target = foo).start() # new thread instead of foo() in the main thread

Multiprocessing, pooling and randomness

I am experiencing a strange thing: I wrote a program to simulate economies. Instead of running this simulation one by one on one CPU core, I want to use multiprocessing to make things faster. So I run my code (fine), and I want to get some stats from the simulations I am doing. Then arises one surprise: all the simulations done at the same time yield the very same result! Is there some strange relationship between Pool() and random.seed()?
To be much clearer, here is what the code can be summarized as:
class Economy(object):
def __init__(self,i):
self.run_number = i
self.Statistics = Statistics()
self.process()
def run_and_return(i):
eco = Economy(i)
return eco
collection = []
def get_result(x):
collection.append(x)
if __name__ == '__main__':
pool = Pool(processes=4)
for i in range(NRUN):
pool.apply_async(run_and_return, (i,), callback=get_result)
pool.close()
pool.join()
The process(i) is the function that goes through every step of the simulation, during i steps. Basically I simulate NRUN Economies, from which I get the Statistics that I put in the list collection.
Now the strange thing is that the output of this is exactly the same for the first 4 runs: during the same "wave" of simulations, I get the very same output. Once I get to the second wave, then I get a different output for the next 4 simulations!
All these simulations run well if I use the same program with processes=1: I get different results when I only work on one core, taking simulations one by one... I have tried a few things, but can't get my head around this, hence my post...
Thank you very much for taking the time to read this long post, do not hesitate to ask for more precisions!
All the best,
If you are on Linux then each pool process is made by forking the parent process. This means the process is literally duplicated - this includes the seed any random object may be using.
The random module selects the seed for its default functions on import. Meaning the seed has already been selected before you create the Pool.
To get around this you must use an initialiser for each pool process that sets the random seed to something unique.
A decent way to seed random would be to use the process id and the current time. The process id is bound to be unique on a single run of your program. Whilst using the time will ensure uniqueness over multiple runs in case the same process id is produced. Passing process id and time through as a string will mean that the digest of the string is also used to seed the random number generator -- meaning two similar strings will produce substantially different seeds. Alternatively, you could use the uuid module to generate seeds.
def proc_init():
random.seed(str(os.getpid()) + str(time.time()))
pool = Pool(num_procs, initializer=proc_init)

Is this usage of Python threading safe/good?

I've got an application which gets some results from some urls and then has to take a decision based on the results (i.e.: pick the best result and display it to the user). Since I want to check several urls this was the first time that multithreading is pretty much needed.
So with the help of some examples I cooked up the following testcode:
import threading
import urllib2
threadsList = []
theResultList = []
def get_url(url):
result = urllib2.urlopen(url).read()
theResultList.append(result[0:10])
theUrls = ['http://google.com', ' http://yahoo.com']
for u in theUrls:
t = threading.Thread(target=get_url, args=(u,))
threadsList.append(t)
t.start()
t.join()
print theResultList
This seems to work, but I'm really insecure here because I really have virtually no experience with multithreading. I always hear these terms like "thread safe" and "race condition".
Of course I read about these things, but since this is my first time using something like this, my question is: is it ok to do it like this? Are there any negative or unexpected effects which I overlook? Are there ways to improve this?
All tips are welcome!
You have to worry about race conditions when you have multiple threads modifying the same object. In your case you have this exact condition - all threads are modifying theResultList.
However, Python's lists are thread safe - read more here. Therefore appends to a list from multiple threads will not somehow corrupt the list structure - you still have to take care to protect concurrent modifications to individual list elements however. For example:
# not thread safe code! - all threads modifying the same element
def get_url(url):
result = urllib2.urlopen(url).read()
#in this example, theResultList is a list of integers
theResultList[0] += 1
In your case, you aren't doing something like this, so your code is fine.
Side note:
The reason incrementing an integer isn't thread safe, is because it's actually two operations - one operation to read the value, and one operation to increment the value. A thread can be interrupted between these two steps (by another thread that also wants to increment the same variable) - this means that when the thread finally does increment in the second step, it could be incrementing an out of date value.

Categories

Resources