Python3 sharing an array between parent/child processes

Python3 sharing an array between parent/child processes - python

https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Array
What I’m trying to do
Create an array in MainProcess and send it through inheritance to any subsequent child processes. The child processes will change the array. The parent process will look out for the changes and act accordingly.
The problem
The parent process does not "see" any changes done by the child processes. However the child processes do "see" the changes. Ie if child 1 adds an item then child 2 will see that item etc
This is true for sARRAY and iARRAY, and iVALUE.
BUT
While the parent process seems to be oblivious to the array values it does take notice of the changes done to the iVALUE.
I don’t understand what I’m doing wrong.
UPDATE 2 https://stackoverflow.com/a/6455704/1267259
The main source of confusion is that multiprocessing uses separate processes and not threads. This means that any changes to object state
made by the children aren't automatically visible to the parent.
To clarify. What I want to do is possible, right?
https://stackoverflow.com/a/26554759/1267259
I mean that's the purpose with multiprocessing Array and Value, to communicate between children and parent processes? And iVALUE works so...
I’ve found this Shared Array not shared correctly in python multiprocessing
But I don’t understand the answer "Assigning to values that have meaning in all processes seems to help:"
UPDATE 1
Found
Python : multiprocessing and Array of c_char_p
> "the assignment to arr[i] points arr[i] to a memory address that was
only meaningful to the subprocess making the assignment. The other
subprocesses retrieve garbage when looking at that address."
As I understand it this doesn't apply to this problem. The assignment
by one subprocess to the array does make sense to the other
subprocesses in this case. But why doesn't it make sense for the main
process?
And I am aware of "managers" but it feels like Array should suffice for this use case. I've read the manual but obviously I don't seem to get it.
UPDATE 3 Indeed, this works
manage = multiprocessing.Manager()
manage = list(range(3))
So...
What am I doing wrong?
import multiprocessing
import ctypes
class MainProcess:
# keep track of process
iVALUE = multiprocessing.Value('i',-1) # this works
# keep track of items
sARRAY = multiprocessing.Array(ctypes.c_wchar_p, 1024) # this works between child processes
iARRAY = multiprocessing.Array(ctypes.c_int, 3) # this works between child processes
pLOCK = multiprocessing.Lock()
def __init__(self):
# create an index for each process
self.sARRAY.value = [None] * 3
self.iARRAY.value = [None] * 3
def InitProcess(self):
# list of items to process
items = []
item = (i for i in items)
with(multiprocessing.Pool(3)) as pool:
# main loop: keep looking for updated values
while True:
try:
pool.apply_async(self.worker, (next(item),callback=eat_finished_cake))
except StopIteration:
pass
print(self.sARRAY) # yields [None][None][None]
print(self.iARRAY) # yields [None][None][None]
print(self.iVALUE) # yields 1-3
pool.close()
pool.join()
def worker(self,item):
with self.pLOCK:
self.iVALUE.value += 1
self.sARRAY.value[self.iVALUE.value] = item # value: 'item 1'
self.iARRAY.value[self.iVALUE.value] = 2
# on next child process run
print(self.iVALUE.value) # prints 1
print(self.sARRAY.value) # prints ['item 1'][None][None]
print(self.iARRAY.value) # prints [2][None][None]
sleep(0.5)
...
with self.pLOCK:
self.iVALUE.value -= 1
UPDATE 4
Changing
pool.apply_async(self.worker, (next(item),))
To
x = pool.apply_async(self.worker, (next(item),))
print(x.get())
Or
x = pool.apply(self.worker, (next(item),))
print(x)
And in self. worker() returning self.iARRAY.value or self.sARRAY.value does return a variable that has the updated value. This is not what I want to achieve though, this doesn't event require the use of ARRAY to achive...
So I need to clarify. In the self.worker() I'm doing important heavy lifting that can take a long time and I need to send back information to the main process, eg the progress before the return value is finished to be sent to the callback.
I don't expect the return of the finished worked result to the main method/that is to be handled by the callback function. I see now that omitting the callback in the code example could've give a different impression sorry.
UPDATE 5
Re: Use numpy array in shared memory for multiprocessing
I've seen that answer and tried a variation of it using initilaizer() with a global var and passed array through initargs with no luck. I don't understand the use of nymphs and with "closing()" but that code doesn't seem to access the "arr" inside main() although shared_arr is used, but only after p.join().
As far as I can see the array is declared then turned to a nymph and inherited through init(x). My code should have the same behavior as that code so far.
One major difference seems to be how the array is accessed
I've only succeeded setting and getting array value using the attribute value, when I tried
self.iARRAY[0] = 1 # instead of iARRAY.value = [None] * 3
self.iARRAY[1] = 1
self.iARRAY[2] = 1
print(self.iARRAY) # prints <SynchronizedArray wrapper for <multiprocessing.sharedctypes.c_int_Array_3 object at 0x7f9cfa8538c8>>
And I can't find a method to access and check the values (the attribute "value" gives an unknown method error)
Another major difference from that code is the prevention of data copying using the get_obj().
Isn't this a nymphy issue?
assert np.allclose(((-1)**M)*tonumpyarray(shared_arr), arr_orig)
Not sure how to make use of that.
def worker(self,item):
with self.pLOCK:
self.iVALUE.value += 1
self.sARRAY.value[self.iVALUE.value] = item # value: 'item 1'
with self.iARRAY.get_lock():
arr = self.iARRAY.get_obj()
arr[self.iVALUE.value] = 2 # and now ???
sleep(0.5)
...
with self.pLOCK:
self.iVALUE.value -= 1
UPDATE 6
I've tried using multiprocessing.Process() instead of Pool() but the result is the same.

correct way to declare the global variable (in this case class attribute)
iARRAY = multiprocessing.Array(ctypes.c_int, range(3))
correct way to set value
self.iARRAY[n] = x
correct way to get value
self.iARRAY[n]
Not sure why the examples I've seen had used Array(ctypes.c_int, 3) and iARRAY.value[n] but in this case that was wrong

This is your problem:
while True:
try:
pool.apply_async(self.worker, (next(item),))
except StopIteration:
pass
print(self.sARRAY) # yields [None][None][None]
print(self.iARRAY) # yields [None][None][None]
print(self.iVALUE) # yields 1-3
The function pool.apply_async() starts the subprocess running and returns immediately. You don't appear to be waiting for the workers to finish. For that, you might consider using a barrier.

Related

multiprocessing stops program execution - python

I'm a noob with multiprocessing, and I'm trying to speed up an old algorithm of mine. It works perfectly fine, without multipocessing, but in the moment I try to implement it, the program stop working: it stands by untill I abort the script.
Another issue is that it doesn't populate the dataframe: again, normally it works, but with multiprocessing it returns only NaN.
func works well.
stockUniverse = list(map(lambda s: s.strip(), Stocks)) #Stocks = list
def func(i):
df.at[i, 'A'] = 1
df.at[i, 'B'] = 2
df.at[i, 'C'] = 3
print(i, 'downloaded')
return True
if __name__ == "__main__":
print('Start')
pool = mp.Pool(mp.cpu_count())
pool.imap(func, stockUniverse)
print(df)
the result is:
Index 19 NaN NaN NaN
index 20 NaN NaN NaN
And then it stops there until I hit Ctrl+C.
Thanks

The map function blocks until all the submitted tasks have completed and returns a list of the return values from the worker function. But the imap function returns immediately with an iterator that must be iterated to return the return values one by one as each becomes available. Your original code did not iterate that iterator but instead immediately printed out what it expected was the updated df. But you would not have given the tasks enough time to start and complete for df to have been modified. In theory if you had inserted before the print statement a call to time.sleep for a sufficiently long enough time, then the tasks would have started and completed before you printed out df. But clearly iterating the iterator is the most efficient way of being sure all tasks have completed and the only way of getting return values back.
But, as I mentioned in my comment, you have a much bigger problem. The tasks you submitted are executed by worker function func being called by processes in the process pool that you created, which are each executing in their own address space. You did not tag your question with the platform on which you are running (whenever you tag a question with multiprocessing, you are suppose to also tag the question with the platform), but I might infer that you are running under a platform that uses the spawn method to create new processes, such as Windows, and that is why you have the if __name__ == "__main__": block controlling code that creates new processes (i.e. the processing pool). When spawn is used to create new processes, a new, empty address space is created, a new Python interpreter is launched and the source is re-executed from the top (without the if __name__ == "__main__": block controlling code that creates new processes, you would get into an infinite, recursive loop creating new processes). But this means that any definition of df at global scope made outside the if __name__ == "__main__": block (which, you must have omitted if you are running under Windows) will be creating a new, separate instance for each process in the pool as each process is created.
If you are instead running under Linux, where fork is used to create new processes, the story is a bit different. The new processes will inherit the original address space from the main process and all declared variables, but copy on write is used. That means that once a subprocess attempts to modify any variable in this inherited storage, a copy of the page is made and the process will now be working on its own copy. So again, nothing can be shared for updating purposes.
You should therefore modify your program to have your worker function return values back to the main process, which will do the necessary updating:
import multiprocessing as mp
import pandas as pd
def func(stock):
return (stock, (('A', 1), ('B', 1), ('C', 1)))
if __name__ == "__main__":
stockUniverse = ['abc', 'def', 'ghi', 'klm']
d = {col: pd.Series(index=stockUniverse, dtype='int32') for col in ['A', 'B', 'C']}
df = pd.DataFrame(d)
pool_size = min(mp.cpu_count(), len(stockUniverse))
pool = mp.Pool(pool_size)
for result in pool.imap_unordered(func, stockUniverse):
stock, col_values = result # unpack
for col_value in col_values:
col, value = col_value # unpack
df.at[stock, col] = value
print(df)
Prints:
A B C
abc 1 1 1
def 1 1 1
ghi 1 1 1
klm 1 1 1
Note that I have used imap_unordered instead of imap. The former method is allowed to return the results in arbitrary order (i.e. as they become available) and is generally more efficient and since the return value contains all the information required for setting the correct row of df, we no longer require any specific ordering.
But:
If your worker function is doing largely nothing but downloading from a website and very little CPU-intensive processing, then you could (should) be using a thread pool by making the simple substitution of:
from multiprocessing.pool import ThreadPool
...
MAX_THREADS_TO_USE = 100 # or maybe even larger!!!
pool_size = min(MAX_THREADS_TO_USE, len(stockUniverse))
pool = ThreadPool(pool_size)
And since all threads share the same address space, you could use your original worker function, func as is!

Python multiprocessing: can I reuse processes (already parallelized functions) with updated global variable?

At first let me show you the current setup I have:
import multiprocessing.pool
from contextlib import closing
import os
def big_function(param):
process(another_module.global_variable[param])
def dispatcher():
# sharing read-only global variable taking benefit from Unix
# which follows policy copy-on-update
# https://stackoverflow.com/questions/19366259/
another_module.global_variable = huge_list
# send indices
params = range(len(another_module.global_variable))
with closing(multiprocessing.pool.Pool(processes=os.cpu_count())) as p:
multiprocessing_result = list(p.imap_unordered(big_function, params))
return multiprocessing_result
Here I use shared variable updated before creating process pool, which contains huge data, and that indeed gained me speedup, so it seem to be not pickled now. Also this variable belongs to the scope of an imported module (if it's important).
When I tried to create setup like this:
another_module.global_variable = []
p = multiprocessing.pool.Pool(processes=os.cpu_count())
def dispatcher():
# sharing read-only global variable taking benefit from Unix
# which follows policy copy-on-update
# https://stackoverflow.com/questions/19366259/
another_module_global_variable = huge_list
# send indices
params = range(len(another_module.global_variable))
multiprocessing_result = list(p.imap_unordered(big_function, params))
return multiprocessing_result
p "remembered" that global shared list was empty and refused to use new data when was called from inside the dispatcher.
Now here is the problem: processing ~600 data objects on 8 cores with the first setup above, my parallel computation runs 8 sec, while single-threaded it works 12 sec.
This is what I think: as long, as multiprocessing pickles data, and I need to re-create processes each time, I need to pickle function big_function(), so I lose time on that. The situation with data was partially solved using global variable (but I still need to recreate pool on each update of it).
What can I do with instances of big_function()(which depends on many other functions from other modules, numpy, etc)? Can I create os.cpu_count() of it's copies once and for all, and somehow feed new data into them and receive results, reusing workers?

Just to go over 'remembering' issue:
another_module.global_variable = []
p = multiprocessing.pool.Pool(processes=os.cpu_count())
def dispatcher():
another_module_global_variable = huge_list
params = range(len(another_module.global_variable))
multiprocessing_result = list(p.imap_unordered(big_function, params))
return multiprocessing_result
What seems to be the problem is when you are creating Pool instance.
Why is that?
It's because when you create instance of Pool, it does set up number of workers (by default equal to a number of CPU cores) and they are all started (forked) at that time. That means workers have a copy of parents global state (and another_module.global_variable among everything else), and with copy-on-write policy, when you update value of another_module.global_variable you change it in parent's process. Workers have a reference to the old value. That is why you have a problem with it.
Here are couple of links that can give you more explanation on this: this and this.
Here is a small snippet where you can switch lines where global variable value is changed and where process is started, and check what is printed in child process.
from __future__ import print_function
import multiprocessing as mp
glob = dict()
glob[0] = [1, 2, 3]
def printer(a):
print(globals())
print(a, glob[0])
if __name__ == '__main__':
p = mp.Process(target=printer, args=(1,))
p.start()
glob[0] = 'test'
p.join()
This is the Python2.7 code, but it works on Python3.6 too.
What would be the solution for this issue?
Well, go back to first solution. You update value of imported module's variable and then create pool of processes.
Now the real issue with the lack of speedup.
Here is the interesting part from documentation on how functions are pickled:
Note that functions (built-in and user-defined) are pickled by “fully
qualified” name reference, not by value. This means that only the
function name is pickled, along with the name of the module the
function is defined in. Neither the function’s code, nor any of its
function attributes are pickled. Thus the defining module must be
importable in the unpickling environment, and the module must contain
the named object, otherwise an exception will be raised.
This means that your function pickling should not be a time wasting process, or at least not by itself. What causes lack of speedup is that for ~600 data objects in list that you pass to imap_unordered call, you pass each one of them to a worker process. Once again, underlying implementation of multiprocessing.Pool may be the cause of this issue.
If you go deeper into multiprocessing.Pool implementation, you will see that two Threads using Queue are handling communication between parent and all child (worker) processes. Because of this and that all processes constantly require arguments for function and constantly return responses, you end up with very busy parent process. That is why 'a lot' of time is spent doing 'dispatching' work passing data to and from worker processes.
What to do about this?
Try to increase number of data objects that are processes in worker process at any time. In your example, you pass one data object after other and you can be sure that each worker process is processing exactly one data object at any time. Why not increase the number of data objects you pass to worker process? That way you can make each process busier with processing 10, 20 or even more data objects. From what I can see, imap_unordered has an chunksize argument. It's set to 1 by default. Try increasing it. Something like this:
import multiprocessing.pool
from contextlib import closing
import os
def big_function(params):
results = []
for p in params:
results.append(process(another_module.global_variable[p]))
return results
def dispatcher():
# sharing read-only global variable taking benefit from Unix
# which follows policy copy-on-update
# https://stackoverflow.com/questions/19366259/
another_module.global_variable = huge_list
# send indices
params = range(len(another_module.global_variable))
with closing(multiprocessing.pool.Pool(processes=os.cpu_count())) as p:
multiprocessing_result = list(p.imap_unordered(big_function, params, chunksize=10))
return multiprocessing_result
Couple of advices:
I see that you create params as a list of indexes, that you use to pick particular data object in big_function. You can create tuples that represent first and last index and pass them to big_function. This can be a way of increasing chunk of work. This is an alternative approach to the one I proposed above.
Unless you explicitly like to have Pool(processes=os.cpu_count()), you can omit it. It by default takes number of CPU cores.
Sorry for the length of answer or any typo that might have sneaked in.

An easily refreshable Queue for Python Threading

I would like to find a mechanism to easily report the progress of a Python thread. For example, if my thread had a counter, I would like to know the value of the counter once in awhile, but, importantly, I only need to know the latest value, not every value that's ever gone by.
What I imagine to be the simplest solution is a single value Queue, where every time I put a new value on in the thread, it replaces the old value with the new one. Then when I do a get in the main program, it would only return the latest value.
Because I don't know how to do the above, instead what I do is put every counter value in a queue, and when I get, I get all the values until there are no more, and just keep the last. But this seems far from ideal, in that I'm filling the queues with thousands of values the I don't care about.
Here's an example of what I do now:
from threading import Thread
from Queue import Queue, Empty
from time import sleep
N = 1000
def fast(q):
count = 0
while count<N:
sleep(.02)
count += 1
q.put(count)
def slow(q):
while 1:
sleep(5) # sleep for a long time
# read last item in queue
val = None
while 1: # read all elements of queue, only saving last
try:
val = q.get(block=False)
except Empty:
break
print val # the last element read from the queue
if val==N:
break
if __name__=="__main__":
q = Queue()
fast_thread = Thread(target=fast, args=(q,))
fast_thread.start()
slow(q)
fast_thread.join()
My question is, is there a better approach?

Just use a global variable and a threading.Lock to protect it during assignments:
import threading
from time import sleep
N = 1000
value = 0
def fast(lock):
global value
count = 0
while count<N:
sleep(.02)
count += 1
with lock:
value = count
def slow():
while 1:
sleep(5) # sleep for a long time
print value # read current value
if value == N:
break
if __name__=="__main__":
lock = threading.Lock()
fast_thread = threading.Thread(target=fast, args=(lock,))
fast_thread.start()
slow()
fast_thread.join()
yields (something like)
249
498
747
997
1000
As Don Question points out, if there is only one thread modifying value, then
actually no lock is needed in the fast function. And as dano points out, if you want to
ensure that the value printed in slow is the same value used in the
if-statement, then a lock is needed in the slow function.
For more on when locks are needed, see Thread Synchronization Mechanisms in Python.

Just use a deque with a maximum length of 1. It will just keep your latest value.
So, instead of:
q = Queue()
use:
from collections import deque
q = deque(maxlen=1)
To read from the deque, there's no get method, so you'll have to do something like:
val = None
try:
val = q[0]
except IndexError:
pass

In your special case, you may over-complicate the issue. If your variable is just some kind of progress-indenticator of a single thread, and only this thread actually changes the variable, then it's completely safe to use a shared object to communicate the progress as long as all other threads do only read.
I guess we all read to many (rightfully) warnings about race-conditions and other pitfalls of shared states in concurrent programming, so we tend to overthink and add more precaution then is sometimes needed.
You could basically share a pre-constructed dict:
thread_progress = dict.fromkeys(list_of_threads, progress_start_value)
or manually:
thread_progress = {thread: progress_value, ...}
without further precaution as long as no thread changes the dict-keys.
This way you can track the progress of multiple threads over one dict. Only condition is to not change the dict once the threading started. Which means the dict must contain all threads BEFORE the first child-thread starts, else you must use a Lock, before writing to the dict. With "changing the dict" i mean all operation regarding the keys. You may change the associated values of a key, because that's in the next level of indirection.
Update:
The underlying problem is the shared state. Which is already a problem in linear Programs, but a nightmare in concurrent.
For example: Imagine a global (shared) variable sv and two functions G(ood) and B(ad) in a linear program. Both function calculate a result depending on sv, but B unintentionally changes sv. Now you are wondering why the heck G doesn't do what it should do, despite not finding any error in your function G, even after you tested it isolated and it was perfectly fine.
Now imagine the same scenario in a concurrent program, with two Threads A and B. Both Threads increment the shared state/variable sv by one.
without locking (current value of sv in parenthesis):
sv = 0
A reads sv (0)
B reads sv (0)
A inc sv (0)
B inc sv (0)
A writes sv (1)
B writes sv (1)
sv == 1 # should be 2!
To find the source of the problem is a pure nightmare! Because it could also succeed sometimes. More often than not A actually would succeed to finish, before B even starts to read sv, but now your problem just seems to behave non-deterministic or erratic and is even harder to find. In contrast to my linear example, both threads are "good", but nevertheless behave not as intentioned.
with locking:
sv = 0
l = lock (for access on sv)
A tries to aquire lock for sv -> success (0)
B tries to aquire lock for sv -> failure, blocked by A (0)
A reads sv (0)
B blocked (0)
A inc sv (0)
B blocked (0)
A writes sv (1)
B blocked (1)
A releases lock on sv (1)
B tries to aquire lock for sv -> success (1)
...
sv == 2
I hope my little example explained the underlying problem of accessing a shared state and
why making write operations (including the read operation) atomic through locking is necessary.
Regarding my advice of a pre-initialized dict: This is a mere precaution because of two reasons:
if you iterate over the threads in a for-loop, the loop may raise an
exception if a thread adds or removes an entry to/from the dict
while still in the loop, because it now is unclear what the next key
should be.
Thread A reads the dict and gets interrupted by Thread B which adds
an entry and finishes. Thread A resumes, but doesn't have the dict
Thread B changed and writes the pre-B together with it's own changes
back. Thread Bs changes are lost.
BTW my proposed solution wouldn't work atm, because of the immutability of the primitive types. But this could be easily fixed by making them mutable, e.g. by encapsulating them into a list or an special Progress-Object, or even simpler: give the thread-function access to the thread_progress dict .
Explanation by example:
t = Thread()
progress = 0 # progress points to the object `0`
dict[t] = progress # dict[t] points now to object `0`
progress = 1 # progress points to object `1`
dict[t] # dict[t] still points to object `0`
better:
t = Thread()
t.progress = 0
dict[thread_id] = t
t.progress = 1
dict[thread_id].progress == 1

Function within Worker/Child instance does not return, freezes program

I am using the multiprocessing module in python. Here is a sample of the code I am using:
import multiprocessing as mp
def function(fun_var1, fun_var2):
b = fun_var1 + fun_var2
# and more computationally intensive stuff happens here
return b
# my program freezes after the return command
class Worker(mp.Process):
def __init__(self, queue_obj, func_var1, func_var2):
mp.Process.__init__(self)
self.queue_obj = queue_obj
self.func_var1 = func_var1
self.func_var2 = func_var2
def run(self):
self.var = function( self.func_var1, self.func_var2 )
self.queue_obj.put(self.var)
if __name__ == '__main__':
mp.freeze_support()
queue_list = []
processes = []
result = []
for i in range(2):
queue_list.append(mp.Queue())
processes.append( Worker(queue_list[i], i, var1, var2 )
processes[i].start()
for i in range(2):
processes[i].join()
result.append(queue_list[i].get())
During runtime of the program two instances of the worker class are generated which work simultaneously. One instance finishes after about 2 minutes and the other would take about 7 minutes. The first instance returns its results fine. However, the second instance freezes the program when the function() that is called within the run() method returns its value. No error is being thrown, the program just does not continue to execute. The console also indicates that it is busy but not displaying the >>> prompt. I am completely clueless why this behavior occurs. The same code works fine for slightly different inputs in the two Worker instances. The only difference I can make out is that the work loads are more equal when it executes correctly. Could the time difference cause trouble? Does anyone have experience with this kind of behavior? Also note that if I run a serial setup of the program in which function() is just called twice by the main program, the code executes flawlessly. Could there be some timeout involved in the worker instance that makes it impossible for function() to return its value to the Worker instance? The return value of function() is actually a list that is fairly small. It contains about 100 float values.
Any suggestions are welcomed!

This is a bit of an educated guess without actually seeing what's going on in worker, but is it possible that your child has put items into the Queue that haven't been consumed? The documentation has a warning about this:
Warning
As mentioned above, if a child process has put items on a queue (and
it has not used JoinableQueue.cancel_join_thread), then that process
will not terminate until all buffered items have been flushed to the
pipe.
This means that if you try joining that process you may get a deadlock
unless you are sure that all items which have been put on the queue
have been consumed. Similarly, if the child process is non-daemonic
then the parent process may hang on exit when it tries to join all its
non-daemonic children.
Note that a queue created using a manager does not have this issue.
See Programming guidelines.
It might be worth trying to create your Queue object using mp.Manager.Queue and see if the issue goes away.

Python: update argument in thread

I was wondering if it would be possible to start a new thread and update its argument when this argument gets a new value in the main of the program, so something like this:
i = 0
def foo(i):
print i
time.sleep(5)
thread.start_new_thread(foo,(i,))
while True:
i = i+1
Thanks a lot for any help!

An argument is just a value, like anything else. Passing the value just makes a new reference to the same value, and if you mutate that value, every reference will see it.
The fact that both the global variable and the function parameter have the same name isn't relevant here, and is a little confusing, so I'm going to rename one of them. Also, your foo function only does that print once (possibly before you even increment the value), then sleeps for 5 seconds, then finishes. You probably wanted a loop there; otherwise, you can't actually tell whether things are working or not.
So, here's an example:
i = []
def foo(j):
while True:
print j
time.sleep(5)
thread.start_new_thread(foo,(i,))
while True:
i.append(1)
So, why doesn't your code work? Well, i = i+1 isn't mutating the value 0, it's assigning a new value, 0 + 1, to i. The foo function still has a reference to the old value, 0, which is unchanged.
Since integers are immutable, you can't directly solve this problem. But you can indirectly solve it very easily: replace the integer with some kind of wrapper that is mutable.
For example, you can write an IntegerHolder class with set and get methods; when you i.set(i.get() + 1), and the other reference does i.get(), it will see the new value.
Or you can just use a list as a holder. Lists are mutable, and hold zero or more elements. When you do i[0] = i[0] + 1, that replaces i[0] with a new integer value, but i is still the same list value, and that's what the other reference is pointing at. So:
i = [0]
def foo(j):
print j[0]
time.sleep(5)
thread.start_new_thread(foo,(i,))
while True:
i[0] = i[0]+1
This may seem a little hacky, but it's actually a pretty common Python idiom.
Meanwhile, the fact that foo is running in another thread creates another problem.
In theory, threads run simultaneously, and there's no ordering of any data accesses between them. Your main thread could be running on core 0, and working on a copy of i that's in core 0's cache, while your foo thread is running on core 1, and working on a different copy of i that's in core 1's cache, and there is nothing in your code to force the caches to get synchronized.
In practice, you will often get away with this, especially in CPython. But to actually know when you can get away with it, you have to learn how the Global Interpreter Lock works, and how the interpreter handles variables, and (in some cases) even how your platform's cache coherency and your C implementation's memory model and so on work. So, you shouldn't rely on it. The right thing to do is to use some kind of synchronization mechanism to guard access to i.
As a side note, you should also almost never use thread instead of threading, so I'm going to switch that as well.
i = []
lock = threading.Lock()
def foo(j):
while True:
with lock:
print j[0]
time.sleep(5)
t = threading.Thread(target=foo, args=(i,))
t.start()
while True:
with lock:
i[0] = i[0]+1
One last thing: If you create a thread, you need to join it later, or you can't quit cleanly. But your foo thread never exits, so if you try to join it, you'll just block forever.
For simple cases like this, there's a simple solution. Before calling t.start(), do t.daemon = True. This means when your main thread quits, the background thread will be automatically killed at some arbitrary point. That's obviously a bad thing if it's, say, writing to a file or a database. But in your case, it's not doing anything persistent or dangerous.
For more realistic cases, you generally want to create some way to signal between the two threads. Often you've already got something for the thread to wait on—a Queue, a file object or collection of them (via select), etc. If not, just create a flag variable protected by a lock (or condition or whatever is appropriate).

Try globals.
i = 0
def foo():
print i
time.sleep(5)
thread.start_new_thread(foo,())
while True:
i = i+1
You could also pass a hash holding the variables you need.
args = {'i' : 0}
def foo(args):
print args['i']
time.sleep(5)
thread.start_new_thread(foo,(args,))
while True:
args['i'] = arg['i'] + 1
You might also want to use a thread lock.
import thread
lock = thread.allocate_lock()
args = {'i' : 0}
def foo(args):
with lock:
print args['i']
time.sleep(5)
thread.start_new_thread(foo,(args,))
while True:
with lock:
args['i'] = arg['i'] + 1
Hoped this helped.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python3 sharing an array between parent/child processes - python

Related

multiprocessing stops program execution - python

Python multiprocessing: can I reuse processes (already parallelized functions) with updated global variable?

An easily refreshable Queue for Python Threading

Function within Worker/Child instance does not return, freezes program

Python: update argument in thread

Categories

Resources