Python: update argument in thread - python

I was wondering if it would be possible to start a new thread and update its argument when this argument gets a new value in the main of the program, so something like this:
i = 0
def foo(i):
print i
time.sleep(5)
thread.start_new_thread(foo,(i,))
while True:
i = i+1
Thanks a lot for any help!

An argument is just a value, like anything else. Passing the value just makes a new reference to the same value, and if you mutate that value, every reference will see it.
The fact that both the global variable and the function parameter have the same name isn't relevant here, and is a little confusing, so I'm going to rename one of them. Also, your foo function only does that print once (possibly before you even increment the value), then sleeps for 5 seconds, then finishes. You probably wanted a loop there; otherwise, you can't actually tell whether things are working or not.
So, here's an example:
i = []
def foo(j):
while True:
print j
time.sleep(5)
thread.start_new_thread(foo,(i,))
while True:
i.append(1)
So, why doesn't your code work? Well, i = i+1 isn't mutating the value 0, it's assigning a new value, 0 + 1, to i. The foo function still has a reference to the old value, 0, which is unchanged.
Since integers are immutable, you can't directly solve this problem. But you can indirectly solve it very easily: replace the integer with some kind of wrapper that is mutable.
For example, you can write an IntegerHolder class with set and get methods; when you i.set(i.get() + 1), and the other reference does i.get(), it will see the new value.
Or you can just use a list as a holder. Lists are mutable, and hold zero or more elements. When you do i[0] = i[0] + 1, that replaces i[0] with a new integer value, but i is still the same list value, and that's what the other reference is pointing at. So:
i = [0]
def foo(j):
print j[0]
time.sleep(5)
thread.start_new_thread(foo,(i,))
while True:
i[0] = i[0]+1
This may seem a little hacky, but it's actually a pretty common Python idiom.
Meanwhile, the fact that foo is running in another thread creates another problem.
In theory, threads run simultaneously, and there's no ordering of any data accesses between them. Your main thread could be running on core 0, and working on a copy of i that's in core 0's cache, while your foo thread is running on core 1, and working on a different copy of i that's in core 1's cache, and there is nothing in your code to force the caches to get synchronized.
In practice, you will often get away with this, especially in CPython. But to actually know when you can get away with it, you have to learn how the Global Interpreter Lock works, and how the interpreter handles variables, and (in some cases) even how your platform's cache coherency and your C implementation's memory model and so on work. So, you shouldn't rely on it. The right thing to do is to use some kind of synchronization mechanism to guard access to i.
As a side note, you should also almost never use thread instead of threading, so I'm going to switch that as well.
i = []
lock = threading.Lock()
def foo(j):
while True:
with lock:
print j[0]
time.sleep(5)
t = threading.Thread(target=foo, args=(i,))
t.start()
while True:
with lock:
i[0] = i[0]+1
One last thing: If you create a thread, you need to join it later, or you can't quit cleanly. But your foo thread never exits, so if you try to join it, you'll just block forever.
For simple cases like this, there's a simple solution. Before calling t.start(), do t.daemon = True. This means when your main thread quits, the background thread will be automatically killed at some arbitrary point. That's obviously a bad thing if it's, say, writing to a file or a database. But in your case, it's not doing anything persistent or dangerous.
For more realistic cases, you generally want to create some way to signal between the two threads. Often you've already got something for the thread to wait on—a Queue, a file object or collection of them (via select), etc. If not, just create a flag variable protected by a lock (or condition or whatever is appropriate).

Try globals.
i = 0
def foo():
print i
time.sleep(5)
thread.start_new_thread(foo,())
while True:
i = i+1
You could also pass a hash holding the variables you need.
args = {'i' : 0}
def foo(args):
print args['i']
time.sleep(5)
thread.start_new_thread(foo,(args,))
while True:
args['i'] = arg['i'] + 1
You might also want to use a thread lock.
import thread
lock = thread.allocate_lock()
args = {'i' : 0}
def foo(args):
with lock:
print args['i']
time.sleep(5)
thread.start_new_thread(foo,(args,))
while True:
with lock:
args['i'] = arg['i'] + 1
Hoped this helped.

Related

Does the "print" function run in a new subprocess or something similar?

I'm using praw to get new submissions from reddit:
for submission in submissions:
print('Submission being evaluated:', submission.id)
p = Process(target = evaluate, args = (submission.id, lock))
p.start()
When using this code I sometimes get ids that link to older submissions.
So I changed my script to check if the submissions are new:
for submission in submissions:
if ((time.time()-submission.created) < 15): #if submission is new
lock.acquire()
print('Submission being evaluated:', submission.id)
lock.release()
p = Process(target = evaluate, args = (submission.id, lock))
p.start()
else:
lock.acquire()
print("Submission "+submission.id+" was older than 15 seconds")
lock.release()
But for an extended period of time the else part didn't get executed even though I got a fair amount of old submission ids with the previous script.
So my question is, when I run print(submission.id) is it running in the background when the subprocess is created, maybe causing a problem and changing the value of submission.id or is it just some coincidence that with the second script I got no old submissions?
Thanks in advance!
To answer the question in the title, no.
sys.stdout, the stream print writes to, is usually line buffered though (though that shouldn't matter in this case, as print writes a newline character (unless told not to)), and is shared between threads and subprocesses (unless explicitly unshared).
Without knowing more about the code around this, it's hard to say more. (Who knows, maybe you have a background thread somewhere in there that sneakily changes submission.id?)
EDIT:
The new information in the original post, namely that
print('Submission being evaluated:', submission.id)
is being printed, not
print(submission.id)
is critical.
Each argument of a print() call is printed atomically, but if two processes or threads are print()ing simultaneously, let's say print('a', 'b'), it's entirely possible that you get a a b b instead of a b a b.
Here is a function I like to use for safely printing to the console. That is the proper use of Lock(), use it around very simple operations. I actually use it in a class, so I dont pass around the lock object as in the below example, but same principle.
Also, the answer is likely yes, but it's more uncertain than certain. Are you also using a lock everytime you read and write submission.id? Generally, if you have an object shared by multiple processes, its best to do this, and, also best to use the Value class from the multiprocessing library, since value objects are designed to be safely shared between processes. Below is a trivial but clear example without processes (thats your job!).
from multiprocessing import Process, Lock, #...
myLock = Lock()
myID = ""
def SetID(id, lock):
with lock:
id = "set with lock"
return
def SafePrint(msg, lock):
lock.acquire()
print(msg)
lock.release()
return
SetID(myID, myLock)
SafePrint(myID, myLock)

Live detect of variable change

Is it possible to set up a loop with live variable changes? I'm using threading, and the variables can change very often in between lines.
I'm looking for something like this:
length = len(some_list)
while length == len(some_list):
if check_something(some_list):
# The variable could change right here for
# example, and the next line would still be called.
do_something(some_list)
So far I've had no luck, is this something that's possible in python?
EDIT: More what I'm looking for is something so that the loop restarts if some_list changes.
If its just a single changing list, you can make a local copy.
def my_worker():
my_list = some_list[:]
if check_something(my_list):
do_something(my_list)
UPDATE
A queue may work for you. The thing that modifies needs to post to the queue, so its not an automatic thing. There is also the risk that the background thread falls behind and processes old stuff or ends up crashing everything if memory is exhausted by the queue.
import threading
import queue
import time
def worker(work_q):
while True:
some_list = work_q.get()
if some_list is None:
print('exiting')
return
print(some_list)
work_q = queue.Queue()
work_thread = threading.Thread(target=worker, args=(work_q,))
work_thread.start()
for i in range(10):
some_list.append(i)
work_q.put(some_list[:])
time.sleep(.2)
work_q.put(None)
work_thread.join()

Python3 sharing an array between parent/child processes

https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Array
What I’m trying to do
Create an array in MainProcess and send it through inheritance to any subsequent child processes. The child processes will change the array. The parent process will look out for the changes and act accordingly.
The problem
The parent process does not "see" any changes done by the child processes. However the child processes do "see" the changes. Ie if child 1 adds an item then child 2 will see that item etc
This is true for sARRAY and iARRAY, and iVALUE.
BUT
While the parent process seems to be oblivious to the array values it does take notice of the changes done to the iVALUE.
I don’t understand what I’m doing wrong.
UPDATE 2 https://stackoverflow.com/a/6455704/1267259
The main source of confusion is that multiprocessing uses separate processes and not threads. This means that any changes to object state
made by the children aren't automatically visible to the parent.
To clarify. What I want to do is possible, right?
https://stackoverflow.com/a/26554759/1267259
I mean that's the purpose with multiprocessing Array and Value, to communicate between children and parent processes? And iVALUE works so...
I’ve found this Shared Array not shared correctly in python multiprocessing
But I don’t understand the answer "Assigning to values that have meaning in all processes seems to help:"
UPDATE 1
Found
Python : multiprocessing and Array of c_char_p
> "the assignment to arr[i] points arr[i] to a memory address that was
only meaningful to the subprocess making the assignment. The other
subprocesses retrieve garbage when looking at that address."
As I understand it this doesn't apply to this problem. The assignment
by one subprocess to the array does make sense to the other
subprocesses in this case. But why doesn't it make sense for the main
process?
And I am aware of "managers" but it feels like Array should suffice for this use case. I've read the manual but obviously I don't seem to get it.
UPDATE 3 Indeed, this works
manage = multiprocessing.Manager()
manage = list(range(3))
So...
What am I doing wrong?
import multiprocessing
import ctypes
class MainProcess:
# keep track of process
iVALUE = multiprocessing.Value('i',-1) # this works
# keep track of items
sARRAY = multiprocessing.Array(ctypes.c_wchar_p, 1024) # this works between child processes
iARRAY = multiprocessing.Array(ctypes.c_int, 3) # this works between child processes
pLOCK = multiprocessing.Lock()
def __init__(self):
# create an index for each process
self.sARRAY.value = [None] * 3
self.iARRAY.value = [None] * 3
def InitProcess(self):
# list of items to process
items = []
item = (i for i in items)
with(multiprocessing.Pool(3)) as pool:
# main loop: keep looking for updated values
while True:
try:
pool.apply_async(self.worker, (next(item),callback=eat_finished_cake))
except StopIteration:
pass
print(self.sARRAY) # yields [None][None][None]
print(self.iARRAY) # yields [None][None][None]
print(self.iVALUE) # yields 1-3
pool.close()
pool.join()
def worker(self,item):
with self.pLOCK:
self.iVALUE.value += 1
self.sARRAY.value[self.iVALUE.value] = item # value: 'item 1'
self.iARRAY.value[self.iVALUE.value] = 2
# on next child process run
print(self.iVALUE.value) # prints 1
print(self.sARRAY.value) # prints ['item 1'][None][None]
print(self.iARRAY.value) # prints [2][None][None]
sleep(0.5)
...
with self.pLOCK:
self.iVALUE.value -= 1
UPDATE 4
Changing
pool.apply_async(self.worker, (next(item),))
To
x = pool.apply_async(self.worker, (next(item),))
print(x.get())
Or
x = pool.apply(self.worker, (next(item),))
print(x)
And in self. worker() returning self.iARRAY.value or self.sARRAY.value does return a variable that has the updated value. This is not what I want to achieve though, this doesn't event require the use of ARRAY to achive...
So I need to clarify. In the self.worker() I'm doing important heavy lifting that can take a long time and I need to send back information to the main process, eg the progress before the return value is finished to be sent to the callback.
I don't expect the return of the finished worked result to the main method/that is to be handled by the callback function. I see now that omitting the callback in the code example could've give a different impression sorry.
UPDATE 5
Re: Use numpy array in shared memory for multiprocessing
I've seen that answer and tried a variation of it using initilaizer() with a global var and passed array through initargs with no luck. I don't understand the use of nymphs and with "closing()" but that code doesn't seem to access the "arr" inside main() although shared_arr is used, but only after p.join().
As far as I can see the array is declared then turned to a nymph and inherited through init(x). My code should have the same behavior as that code so far.
One major difference seems to be how the array is accessed
I've only succeeded setting and getting array value using the attribute value, when I tried
self.iARRAY[0] = 1 # instead of iARRAY.value = [None] * 3
self.iARRAY[1] = 1
self.iARRAY[2] = 1
print(self.iARRAY) # prints <SynchronizedArray wrapper for <multiprocessing.sharedctypes.c_int_Array_3 object at 0x7f9cfa8538c8>>
And I can't find a method to access and check the values (the attribute "value" gives an unknown method error)
Another major difference from that code is the prevention of data copying using the get_obj().
Isn't this a nymphy issue?
assert np.allclose(((-1)**M)*tonumpyarray(shared_arr), arr_orig)
Not sure how to make use of that.
def worker(self,item):
with self.pLOCK:
self.iVALUE.value += 1
self.sARRAY.value[self.iVALUE.value] = item # value: 'item 1'
with self.iARRAY.get_lock():
arr = self.iARRAY.get_obj()
arr[self.iVALUE.value] = 2 # and now ???
sleep(0.5)
...
with self.pLOCK:
self.iVALUE.value -= 1
UPDATE 6
I've tried using multiprocessing.Process() instead of Pool() but the result is the same.
correct way to declare the global variable (in this case class attribute)
iARRAY = multiprocessing.Array(ctypes.c_int, range(3))
correct way to set value
self.iARRAY[n] = x
correct way to get value
self.iARRAY[n]
Not sure why the examples I've seen had used Array(ctypes.c_int, 3) and iARRAY.value[n] but in this case that was wrong
This is your problem:
while True:
try:
pool.apply_async(self.worker, (next(item),))
except StopIteration:
pass
print(self.sARRAY) # yields [None][None][None]
print(self.iARRAY) # yields [None][None][None]
print(self.iVALUE) # yields 1-3
The function pool.apply_async() starts the subprocess running and returns immediately. You don't appear to be waiting for the workers to finish. For that, you might consider using a barrier.

An easily refreshable Queue for Python Threading

I would like to find a mechanism to easily report the progress of a Python thread. For example, if my thread had a counter, I would like to know the value of the counter once in awhile, but, importantly, I only need to know the latest value, not every value that's ever gone by.
What I imagine to be the simplest solution is a single value Queue, where every time I put a new value on in the thread, it replaces the old value with the new one. Then when I do a get in the main program, it would only return the latest value.
Because I don't know how to do the above, instead what I do is put every counter value in a queue, and when I get, I get all the values until there are no more, and just keep the last. But this seems far from ideal, in that I'm filling the queues with thousands of values the I don't care about.
Here's an example of what I do now:
from threading import Thread
from Queue import Queue, Empty
from time import sleep
N = 1000
def fast(q):
count = 0
while count<N:
sleep(.02)
count += 1
q.put(count)
def slow(q):
while 1:
sleep(5) # sleep for a long time
# read last item in queue
val = None
while 1: # read all elements of queue, only saving last
try:
val = q.get(block=False)
except Empty:
break
print val # the last element read from the queue
if val==N:
break
if __name__=="__main__":
q = Queue()
fast_thread = Thread(target=fast, args=(q,))
fast_thread.start()
slow(q)
fast_thread.join()
My question is, is there a better approach?
Just use a global variable and a threading.Lock to protect it during assignments:
import threading
from time import sleep
N = 1000
value = 0
def fast(lock):
global value
count = 0
while count<N:
sleep(.02)
count += 1
with lock:
value = count
def slow():
while 1:
sleep(5) # sleep for a long time
print value # read current value
if value == N:
break
if __name__=="__main__":
lock = threading.Lock()
fast_thread = threading.Thread(target=fast, args=(lock,))
fast_thread.start()
slow()
fast_thread.join()
yields (something like)
249
498
747
997
1000
As Don Question points out, if there is only one thread modifying value, then
actually no lock is needed in the fast function. And as dano points out, if you want to
ensure that the value printed in slow is the same value used in the
if-statement, then a lock is needed in the slow function.
For more on when locks are needed, see Thread Synchronization Mechanisms in Python.
Just use a deque with a maximum length of 1. It will just keep your latest value.
So, instead of:
q = Queue()
use:
from collections import deque
q = deque(maxlen=1)
To read from the deque, there's no get method, so you'll have to do something like:
val = None
try:
val = q[0]
except IndexError:
pass
In your special case, you may over-complicate the issue. If your variable is just some kind of progress-indenticator of a single thread, and only this thread actually changes the variable, then it's completely safe to use a shared object to communicate the progress as long as all other threads do only read.
I guess we all read to many (rightfully) warnings about race-conditions and other pitfalls of shared states in concurrent programming, so we tend to overthink and add more precaution then is sometimes needed.
You could basically share a pre-constructed dict:
thread_progress = dict.fromkeys(list_of_threads, progress_start_value)
or manually:
thread_progress = {thread: progress_value, ...}
without further precaution as long as no thread changes the dict-keys.
This way you can track the progress of multiple threads over one dict. Only condition is to not change the dict once the threading started. Which means the dict must contain all threads BEFORE the first child-thread starts, else you must use a Lock, before writing to the dict. With "changing the dict" i mean all operation regarding the keys. You may change the associated values of a key, because that's in the next level of indirection.
Update:
The underlying problem is the shared state. Which is already a problem in linear Programs, but a nightmare in concurrent.
For example: Imagine a global (shared) variable sv and two functions G(ood) and B(ad) in a linear program. Both function calculate a result depending on sv, but B unintentionally changes sv. Now you are wondering why the heck G doesn't do what it should do, despite not finding any error in your function G, even after you tested it isolated and it was perfectly fine.
Now imagine the same scenario in a concurrent program, with two Threads A and B. Both Threads increment the shared state/variable sv by one.
without locking (current value of sv in parenthesis):
sv = 0
A reads sv (0)
B reads sv (0)
A inc sv (0)
B inc sv (0)
A writes sv (1)
B writes sv (1)
sv == 1 # should be 2!
To find the source of the problem is a pure nightmare! Because it could also succeed sometimes. More often than not A actually would succeed to finish, before B even starts to read sv, but now your problem just seems to behave non-deterministic or erratic and is even harder to find. In contrast to my linear example, both threads are "good", but nevertheless behave not as intentioned.
with locking:
sv = 0
l = lock (for access on sv)
A tries to aquire lock for sv -> success (0)
B tries to aquire lock for sv -> failure, blocked by A (0)
A reads sv (0)
B blocked (0)
A inc sv (0)
B blocked (0)
A writes sv (1)
B blocked (1)
A releases lock on sv (1)
B tries to aquire lock for sv -> success (1)
...
sv == 2
I hope my little example explained the underlying problem of accessing a shared state and
why making write operations (including the read operation) atomic through locking is necessary.
Regarding my advice of a pre-initialized dict: This is a mere precaution because of two reasons:
if you iterate over the threads in a for-loop, the loop may raise an
exception if a thread adds or removes an entry to/from the dict
while still in the loop, because it now is unclear what the next key
should be.
Thread A reads the dict and gets interrupted by Thread B which adds
an entry and finishes. Thread A resumes, but doesn't have the dict
Thread B changed and writes the pre-B together with it's own changes
back. Thread Bs changes are lost.
BTW my proposed solution wouldn't work atm, because of the immutability of the primitive types. But this could be easily fixed by making them mutable, e.g. by encapsulating them into a list or an special Progress-Object, or even simpler: give the thread-function access to the thread_progress dict .
Explanation by example:
t = Thread()
progress = 0 # progress points to the object `0`
dict[t] = progress # dict[t] points now to object `0`
progress = 1 # progress points to object `1`
dict[t] # dict[t] still points to object `0`
better:
t = Thread()
t.progress = 0
dict[thread_id] = t
t.progress = 1
dict[thread_id].progress == 1

Garbage-collect a lock once no threads are asking for it

I have a function that must never be called with the same value simultaneously from two threads. To enforce this, I have a defaultdict that spawns new threading.Locks for a given key. Thus, my code looks similar to this:
from collections import defaultdict
import threading
lock_dict = defaultdict(threading.Lock)
def f(x):
with lock_dict[x]:
print "Locked for value x"
The problem is that I cannot figure out how to safely delete the lock from the defaultdict once its no longer needed. Without doing this, my program has a memory leak that becomes noticeable when f is called with many different values of x.
I cannot simply del lock_dict[x] at the end of f, because in the scenario that another thread is waiting for the lock, then the second thread will lock a lock that's no longer associated with lock_dict[x], and thus two threads could end up simultaneously calling f with the same value of x.
I'd use a different approach:
fcond = threading.Condition()
fargs = set()
def f(x):
with fcond:
while x in fargs:
fcond.wait()
fargs.add(x) # this thread has exclusive rights to use `x`
# do useful stuff with x
# any other thread trying to call f(x) will
# block in the .wait above()
with fcond:
fargs.remove(x) # we're done with x
fcond.notify_all() # let blocked threads (if any) proceed
Conditions have a learning curve, but once it's climbed they make it much easier to write correct thread-safe, race-free code.
Thread safety of the original code
#JimMischel asked in a comment whether the orignal's use of defaultdict was subject to races. Good question!
The answer is - alas - "you'll have to stare at your specific Python's implementation".
Assuming the CPython implementation: if any of the code invoked by defaultdict to supply a default invokes Python code, or C code that releases the GIL (global interpreter lock), then 2 (or more) threads could "simultaneously" invoke withlock_dict[x] with the same x not already in the dict, and:
Thread 1 sees that x isn't in the dict, gets a lock, then loses its timeslice (before setting x in the dict).
Thread 2 sees that x isn't in the dict, and also gets a lock.
One of those thread's locks ends up in the dict, but both threads execute f(x).
Staring at the source for 3.4.0a4+ (the current development head), defaultdict and threading.Lock are both implemented by C code that doesn't release the GIL. I don't recall whether earlier versions did or didn't, at various times, implement all or parts of defaultdict or threading.Lock in Python.
My suggested alternative code is full of stuff implemented in Python (all threading.Condition methods), but is race-free by design - even if you're using an old version of Python with sets also implemented in Python (the set is only accessed under the protection of the condition variable's lock).
One lock per argument
Without conditions, this seems to be much harder. In the original approach, I believe you need to keep a count of threads wanting to use x, and you need a lock to protect those counts and to protect the dictionary. The best code I've come up with for that is so long-winded that it seems sanest to put it in a context manager. To use, create an argument locker per function that needs it:
farglocker = ArgLocker() # for function `f()`
and then the body of f() can be coded simply:
def f(x):
with farglocker(x):
# only one thread at a time can run with argument `x`
Of course the condition approach could also be wrapped in a context manager. Here's the code:
import threading
class ArgLocker:
def __init__(self):
self.xs = dict() # maps x to (lock, count) pair
self.lock = threading.Lock()
def __call__(self, x):
return AllMine(self.xs, self.lock, x)
class AllMine:
def __init__(self, xs, lock, x):
self.xs = xs
self.lock = lock
self.x = x
def __enter__(self):
x = self.x
with self.lock:
xlock = self.xs.get(x)
if xlock is None:
xlock = threading.Lock()
xlock.acquire()
count = 0
else:
xlock, count = xlock
self.xs[x] = xlock, count + 1
if count: # x was already known - wait for it
xlock.acquire()
assert xlock.locked
def __exit__(self, *args):
x = self.x
with self.lock:
xlock, count = self.xs[x]
assert xlock.locked
assert count > 0
count -= 1
if count:
self.xs[x] = xlock, count
else:
del self.xs[x]
xlock.release()
So which way is better? Using conditions ;-) That way is "almost obviously correct", but the lock-per-argument (LPA) approach is a bit of a head-scratcher. The LPA approach does have the advantage that when a thread is done with x, the only threads allowed to proceed are those wanting to use the same x; using conditions, the .notify_all() wakes all threads blocked waiting on any argument. But unless there's very heavy contention among threads trying to use the same arguments, this isn't going to matter much: using conditions, the threads woken up that aren't waiting on x stay awake only long enough to see that x in fargs is true, and then immediately block (.wait()) again.

Categories

Resources