Thread locking failing in dead-simple example

Thread locking failing in dead-simple example - python

This is the simplest toy example. I know about concurrent.futures and higher level code; I'm picking the toy example because I'm teaching it (as part of same material with the high-level stuff).
It steps on counter from different threads, and I get... well, here it is even weirder. Usually I get a counter less than I should (e.g. 5M), generally much less like 20k. But as I decrease the number of loops, at some number like 1000 it is consistently right. Then at some intermediate number, I get almost right, occasionally correct, but once in a while slightly larger than the product of nthread x nloop. I am running it repeatedly in a Jupyter cell, but the first line really should reset counter to zero, not keep any old total.
lock = threading.Lock()
counter, nthread, nloop = 0, 100, 50_000
def increment(n, lock):
global counter
for _ in range(n):
lock.acquire()
counter += 1
lock.release()
for _ in range(nthread):
t = Thread(target=increment, args=(nloop, lock))
t.start()
print(f"{nloop:,} loops X {nthread:,} threads -> counter is {counter:,}")
If I add .join() the behavior changes, but is still not correct. For example, in the version that doesn't try to lock:
counter, nthread, nloop = 0, 100, 50_000
def increment(n):
global counter
for _ in range(n):
counter += 1
for _ in range(nthread):
t = Thread(target=increment, args=(nloop,))
t.start()
t.join()
print(f"{nloop:,} loops X {nthread:,} threads -> counter is {counter:,}")
# --> 50,000 loops X 100 threads -> counter is 5,022,510
The exact overcount varies, but I see something like that repeatedly.
I don't really want to .join() in the lock example, because I want to illustrate the idea of a background job. But I can wait for the aliveness of the thread (thank you Frank Yellin!), and that fixes the lock case. The overcount still troubles me though.

You're not waiting until all your threads are done before looking at counter. That's also why you're getting your result so quickly.
threads = []
for _ in range(nthread):
t = threading.Thread(target=increment, args=(nloop, lock))
t.start()
threads.append(t)
for thread in threads:
thread.join()
print(f"{nloop:,} loops X {nthread:,} threads -> counter is {counter:,}")
prints out the expected result.
50,000 loops X 100 threads -> counter is 5,000,000
Updated. I highly recommend using ThreadPoolExecutor() instead, which takes care of tracking the threads for you.
with ThreadPoolExecutor() as executor:
for _ in range(nthread):
executor.submit(increment, nloop, lock)
print(...)
will give you the answer you want, and takes care of waiting for the threads.

Related

random.random() generates same number in multiprocessing

I'm working on an optimization problem, and you can see a simplified version of my code posted below (the origin code is too complicated for asking such a question, and I hope my simplified code has simulated the original one as much as possible).
My purpose:
use the function foo in the function optimization, but foo can take very long time due to some hard situations. So I use multiprocessing to set a time limit for execution of the function (proc.join(iter_time), the method is from an anwser from this question; How to limit execution time of a function call?).
My problem:
In the while loop, every time the generated values for extra are the same.
The list lst's length is always 1, which means in every iteration in the while loop it starts from an empty list.
My guess: possible reason can be each time I create a process the random seed is counting from the beginning, and each time the process is terminated, there could be some garbage collection mechanism to clean the memory the processused, so the list is cleared.
My question
Anyone know the real reason of such problems?
if not using multiprocessing, is there anyway else that I can realize my purpose while generate different random numbers? btw I have tried func_timeout but it has other problems that I cannot handle...
random.seed(123)
lst = [] # a global list for logging data
def foo(epoch):
...
extra = random.random()
lst.append(epoch + extra)
...
def optimization(loop_time, iter_time):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = multiprocessing.Process(target=foo, args=(epoch,))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
if __name__ == '__main__':
optimization(300, 2)

You need to use shared memory if you want to share variables across processes. This is because child processes do not share their memory space with the parent. Simplest way to do this here would be to use managed lists and delete the line where you set a number seed. This is what is causing same number to be generated because all child processes will take the same seed to generate the random numbers. To get different random numbers either don't set a seed, or pass a different seed to each process:
import time, random
from multiprocessing import Manager, Process
def foo(epoch, lst):
extra = random.random()
lst.append(epoch + extra)
def optimization(loop_time, iter_time, lst):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = Process(target=foo, args=(epoch, lst))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
print(lst)
if __name__ == '__main__':
manager = Manager()
lst = manager.list()
optimization(10, 2, lst)
Output
[0.2035898948744943, 0.07617925389396074, 0.6416754412198231, 0.6712193790613651, 0.419777147554235, 0.732982735576982, 0.7137712131028766, 0.22875414425414997, 0.3181113880578589, 0.5613367673646847, 0.8699685474084119, 0.9005359611195111, 0.23695341111251134, 0.05994288664062197, 0.2306562314450149, 0.15575356275408125, 0.07435292814989103, 0.8542361251850187, 0.13139055891993145, 0.5015152768477814, 0.19864873743952582, 0.2313646288041601, 0.28992667535697736, 0.6265055915510219, 0.7265797043535446, 0.9202923318284002, 0.6321511834038631, 0.6728367262605407, 0.6586979597202935, 0.1309226720786667, 0.563889613032526, 0.389358766191921, 0.37260564565714316, 0.24684684162272597, 0.5982042933298861, 0.896663326233504, 0.7884030244369596, 0.6202229004466849, 0.4417549843477827, 0.37304274232635715, 0.5442716244427301, 0.9915536257041505, 0.46278512685707873, 0.4868394190894778, 0.2133187095154937]
Keep in mind that using managers will affect performance of your code. Alternate to this, you could also use multiprocessing.Array, which is faster than managers but is less flexible in what data it can store, or Queues as well.

Does new implementation of GIL in Python handled race condition issue?

I've read an article about multithreading in Python where they trying to use Synchronization to solve race condition issue. And I've run the example code below to reproduce race condition issue:
import threading
# global variable x
x = 0
def increment():
"""
function to increment global variable x
"""
global x
x += 1
def thread_task():
"""
task for thread
calls increment function 100000 times.
"""
for _ in range(100000):
increment()
def main_task():
global x
# setting global variable x as 0
x = 0
# creating threads
t1 = threading.Thread(target=thread_task)
t2 = threading.Thread(target=thread_task)
# start threads
t1.start()
t2.start()
# wait until threads finish their job
t1.join()
t2.join()
if __name__ == "__main__":
for i in range(10):
main_task()
print("Iteration {0}: x = {1}".format(i,x))
It does return the same result as the article when I'm using Python 2.7.15. But it does not when I'm using Python 3.6.9 (all threads return the same result = 200000).
I wonder that does new implementation of GIL (since Python 3.2) was handled race condition issue? If it does, why Lock, Mutex still exist in Python >3.2 . If it doesn't, why there is no conflict when running multi threading to modify shared resource like the example above?
My mind was struggling with those question in these days when I'm trying to understand more about how Python really works under the hood.

The change you are referring to was to replace check interval with switch interval. This meant that rather than switching threads every 100 byte codes it would do so every 5 milliseconds.
Ref: https://pymotw.com/3/sys/threads.html https://mail.python.org/pipermail/python-dev/2009-October/093321.html
So if your code ran fast enough, it would never experience a thread switch and it might appear to you that the operations were atomic when they are in fact not. The race condition did not appear as there was no actual interweaving of threads. x += 1 is actually four byte codes:
>>> dis.dis(sync.increment)
11 0 LOAD_GLOBAL 0 (x)
3 LOAD_CONST 1 (1)
6 INPLACE_ADD
7 STORE_GLOBAL 0 (x)
10 LOAD_CONST 2 (None)
13 RETURN_VALUE
A thread switch in the interpreter can occur between any two bytecodes.
Consider that in 2.7 this prints 200000 always because the check interval is set so high that each thread completes in its entirety before the next runs. The same can be constructed with switch interval.
import sys
import threading
print(sys.getcheckinterval())
sys.setcheckinterval(1000000)
# global variable x
x = 0
def increment():
"""
function to increment global variable x
"""
global x
x += 1
def thread_task():
"""
task for thread
calls increment function 100000 times.
"""
for _ in range(100000):
increment()
def main_task():
global x
# setting global variable x as 0
x = 0
# creating threads
t1 = threading.Thread(target=thread_task)
t2 = threading.Thread(target=thread_task)
# start threads
t1.start()
t2.start()
# wait until threads finish their job
t1.join()
t2.join()
if __name__ == "__main__":
for i in range(10):
main_task()
print("Iteration {0}: x = {1}".format(i,x))

The GIL protects individual byte code instructions. In contrast, a race condition is an incorrect ordering of instructions, which means multiple byte code instructions. As a result, the GIL cannot protect against race conditions outside of the Python VM itself.
However, by their very nature race conditions do not always trigger. Certain GIL strategies are more or less likely to trigger certain race conditions. A thread shorter than the GIL window is never interrupted, and one longer than the GIL window is always interrupted.
Your increment function has 6 byte code instructions, as has the inner loop calling it. Of these, 4 instructions must finish at once, meaning there are 3 possible switching points that corrupt the result. Your entire thread_task function takes about 0.015s to 0.020s (on my system).
With the old GIL switching every 100 instructions, the loop is guaranteed to be interrupted every 8.3 calls, or roughly 1200 times. With the new GIL switching every 5ms, the loop is interrupted only 3 times.

How to use multiprocessing to parallelize two calls to the same function, with different arguments, in a for loop?

In a for loop, I am calling a function twice but with different argument sets (argSet1, argSet2) that change on each iteration of the for loop. I want to parallelize this operation since one set of the arguments causes the called function to run faster, and the other set of arguments causes a slow run of the function. Note that I do not want to have two for loops for this operation. I also have another requirement: Each of these functions will execute some parallel operations and therefore I do not want to have any of the functions with either argSet1 or argSet2 be running more than once, because of the computational limited resources that I have. Making sure that the function with both argument sets is running will help me utilize the CPU cores as much as possible. Here's how do it normally without parallelization:
def myFunc(arg1, arg2):
if arg1:
print ('do something that does not take too long')
else:
print ('do something that takes long')
for i in range(10):
argSet1 = arg1Storage[i]
argSet1 = arg2Storage[i]
myFunc(argSet1)
myFunc(argSet2)
This will definitely not take the advantage of the computational resources that I have. Here's my try to parallelize the operations:
from multiprocessing import Process
def myFunc(arg1, arg2):
if arg1:
print ('do something that does not take too long')
else:
print ('do something that takes long')
for i in range(10):
argSet1 = arg1Storage[i]
argSet1 = arg2Storage[i]
p1 = Process(target=myFunc, args=argSet1)
p1.start()
p2 = Process(target=myFunc, args=argSet2)
p2.start()
However, this way each function with its respective arguments will be called 10 times and things become extremely slow. Given my limited knowledge of multiprocessing, I tried to improve things a bit more by adding p1.join() and p2.join() to the end of the for loop but this still causes slow down as p1 is done much faster and things wait until p2 is done. I also thought about using multiprocessing.Value to do some communication with the functions but then I have to add a while loop inside the function for each of the function calls which slows down everything again. I wonder if someone can offer a practical solution?

Since I built this answer in patches, scroll down for the best solution to this problem
You need specify to exactly how you want things to run. As far as I can tell, you want two processes to run at most, but also at least. Also, you do not want the heavy call to hold up the fast ones. One simple non-optimal way to run is:
from multiprocessing import Process
def func(counter,somearg):
j = 0
for i in range(counter): j+=i
print(somearg)
def loop(counter,arglist):
for i in range(10):
func(counter,arglist[i])
heavy = Process(target=loop,args=[1000000,['heavy'+str(i) for i in range(10)]])
light = Process(target=loop,args=[500000,['light'+str(i) for i in range(10)]])
heavy.start()
light.start()
heavy.join()
light.join()
The output here is (for one example run):
light0
heavy0
light1
light2
heavy1
light3
light4
heavy2
light5
light6
heavy3
light7
light8
heavy4
light9
heavy5
heavy6
heavy7
heavy8
heavy9
You can see the last part is sub-optimal, since you have a sequence of heavy runs - which means there is one process instead of two.
An easy way to optimize this, if you can estimate how much longer is the heavy process running. If it's twice as slow, as here, just run 7 iterations of heavy first, join the light process, and have it run the additional 3.
Another way is to run the heavy process in pairs, so at first you have 3 processes until the fast process ends, and then continues with 2.
The main point is separating the heavy and light calls to another process entirely - so while the fast calls complete one after the other quickly you can work your slow stuff. Once th fast ends, it's up to you how elaborate do you want to continue, but I think for now estimating how to break up the heavy calls is good enough. This is it for my example:
from multiprocessing import Process
def func(counter,somearg):
j = 0
for i in range(counter): j+=i
print(somearg)
def loop(counter,amount,arglist):
for i in range(amount):
func(counter,arglist[i])
heavy1 = Process(target=loop,args=[1000000,7,['heavy1'+str(i) for i in range(7)]])
light = Process(target=loop,args=[500000,10,['light'+str(i) for i in range(10)]])
heavy2 = Process(target=loop,args=[1000000,3,['heavy2'+str(i) for i in range(7,10)]])
heavy1.start()
light.start()
light.join()
heavy2.start()
heavy1.join()
heavy2.join()
with output:
light0
heavy10
light1
light2
heavy11
light3
light4
heavy12
light5
light6
heavy13
light7
light8
heavy14
light9
heavy15
heavy27
heavy16
heavy28
heavy29
Much better utilization. You can of course make this more advanced by sharing a queue for the slow process runs, so when the fast are done they can join as workers on the slow queue, but for only two different calls this may be overkill (though not much harder using the queue). The best solution:
from multiprocessing import Queue,Process
import queue
def func(index,counter,somearg):
j = 0
for i in range(counter): j+=i
print("Worker",index,':',somearg)
def worker(index):
try:
while True:
func,args = q.get(block=False)
func(index,*args)
except queue.Empty: pass
q = Queue()
for i in range(10):
q.put((func,(500000,'light'+str(i))))
q.put((func,(1000000,'heavy'+str(i))))
nworkers = 2
workers = []
for i in range(nworkers):
workers.append(Process(target=worker,args=(i,)))
workers[-1].start()
q.close()
for worker in workers:
worker.join()
This is the best and most scalable solution for what you want. Output:
Worker 0 : light0
Worker 0 : light1
Worker 1 : heavy0
Worker 1 : light2
Worker 0 : heavy1
Worker 0 : light3
Worker 1 : heavy2
Worker 1 : light4
Worker 0 : heavy3
Worker 0 : light5
Worker 1 : heavy4
Worker 1 : light6
Worker 0 : heavy5
Worker 0 : light7
Worker 1 : heavy6
Worker 1 : light8
Worker 0 : heavy7
Worker 0 : light9
Worker 1 : heavy8
Worker 0 : heavy9

You might want to use a multiprocessing.Pool of processes and map your myFunc into it, like so:
from multiprocessing import Pool
import time
def myFunc(arg1, arg2):
if arg1:
print ('do something that does not take too long')
time.sleep(0.01)
else:
print ('do something that takes long')
time.sleep(1)
def wrap(args):
return myFunc(*args)
if __name__ == "__main__":
p = Pool()
argStorage = [(True, False), (False, True)] * 12
p.map(wrap, argStorage)
I added a wrap function, since the function passed to p.map must accept a single argument. You could just as well adapt myFunc to accept a tuple, if that's possible in your case.
My sample appStorage constists of 24 items, where 12 of them will take 1sec to process, and 12 will be done in 10ms. In total, this script runs in 3-4 seconds (I have 4 cores).

One possible implementation could be as follow:
import concurrent.futures
import math
list_of_args = [arg1, arg2]
def my_func(arg):
....
print ('do something that takes long')
def main():
with concurrent.futures.ProcessPoolExecutor() as executor:
for arg, result in zip(list_of_args, executor.map(is_prime, list_of_args)):
print('my_func({0}) => {1}'.format(arg, result))
executor.map is like the built in function, the map method allows multiple calls to a provided function, passing each of the items in an iterable to that function.

Set the nr of executions per second using Python's multiprocessing

I wrote a script in Python 3.6 initially using a for loop which called an API, then putting all results into a pandas dataframe and writing them to a SQL database. (approximately 9,000 calls are made to that API every time the script runs).
Realising the calls inside the for loop were processed one-by-one, I decided to use the multiprocessing module to speed things up.
Therefore, I created a module level function called parallel_requests and now I call that instead of having the for loop:
list_of_lists = multiprocessing.Pool(processes=4).starmap(parallel_requests, zip(....))
Side note: I use starmap instead of map only because my parallel_requests function takes multiple arguments which I need to zip.
The good: this approach works and is much faster.
The bad: this approach works but is too fast. By using 4 processes (I tried that because I have 4 cores), parallel_requests is getting executed too fast. More than 15 calls per second are made to the API, and I'm getting blocked by the API itself.
In fact, it only works if I use 1 or 2 processes, otherwise it's too damn fast.
Essentially what I want is to keep using 4 processes, but also to limit the execution of my parallel_requests function to only 15 times per second overall.
Is there any parameter of multiprocessing.Pool that would help with this, or it's more complicated than that?

For this case I'd use a leaky bucket. You can have one process that fills a queue at the proscribed rate, with a maximum size that indicates how many requests you can "bank" if you don't make them at the maximum rate; the worker processes then just need to get from the queue before doing its work.
import time
def make_api_request(this, that, rate_queue):
rate_queue.get()
print("DEBUG: doing some work at {}".format(time.time()))
return this * that
def throttler(rate_queue, interval):
try:
while True:
if not rate_queue.full(): # avoid blocking
rate_queue.put(0)
time.sleep(interval)
except BrokenPipeError:
# main process is done
return
if __name__ == '__main__':
from multiprocessing import Pool, Manager, Process
from itertools import repeat
rq = Manager().Queue(maxsize=15) # conservative; no banking
pool = Pool(4)
Process(target=throttler, args=(rq, 1/15.)).start()
pool.starmap(make_api_request, zip(range(100), range(100, 200), repeat(rq)))

I'll look at the ideas posted here, but in the meantime I've just used a simple approach of opening and closing a Pool of 4 processes for every 15 requests and appending all the results in a list_of_lists.
Admittedly, not the best approach, since it takes time/resources to open/close a Pool, but it was the most handy solution for now.
# define a generator for use below
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
list_of_lists = []
for current_chunk in chunks(all_data, 15): # 15 is the API's limit of requests per second
pool = multiprocessing.Pool(processes=4)
res = pool.starmap(parallel_requests, zip(current_chunk, [to_symbol]*len(current_chunk), [query]*len(current_chunk), [start]*len(current_chunk), [stop]*len(current_chunk)) )
sleep(1) # Sleep for 1 second after every 15 API requests
list_of_lists.extend(res)
pool.close()
flatten_list = [item for sublist in list_of_lists for item in sublist] # use this to construct a `pandas` dataframe
PS: This solution is really not at all that fast due to the multiple opening/closing of pools. Thanks Nathan Vērzemnieks for suggesting to open just one pool, it's much faster, plus your processor won't look like it's running a stress test.

One way to do is to use Queue, which can share details about api-call timestamps with other processes.
Below is an example how this could work. It takes the oldest entry in queue, and if it is younger than one second, sleep functions is called for the duration of the difference.
from multiprocessing import Pool, Manager, queues
from random import randint
import time
MAX_CONNECTIONS = 10
PROCESS_COUNT = 4
def api_request(a, b):
time.sleep(randint(1, 9) * 0.03) # simulate request
return a, b, time.time()
def parallel_requests(a, b, the_queue):
try:
oldest = the_queue.get()
time_difference = time.time() - oldest
except queues.Empty:
time_difference = float("-inf")
if 0 < time_difference < 1:
time.sleep(1-time_difference)
else:
time_difference = 0
print("Current time: ", time.time(), "...after sleeping:", time_difference)
the_queue.put(time.time())
return api_request(a, b)
if __name__ == "__main__":
m = Manager()
q = m.Queue(maxsize=MAX_CONNECTIONS)
for _ in range(0, MAX_CONNECTIONS): # Fill the queue with zeroes
q.put(0)
p = Pool(PROCESS_COUNT)
# Create example data
data_length = 100
data1 = range(0, data_length) # Just some dummy-data
data2 = range(100, data_length+100) # Just some dummy-data
queue_iterable = [q] * (data_length+1) # required for starmap -function
list_of_lists = p.starmap(parallel_requests, zip(data1, data2, queue_iterable))
print(list_of_lists)

Optimizing Multiprocessing in Python (Follow Up: using Queues)

This is a followup question to this. User Will suggested using a queue, I tried to implement that solution below. The solution works just fine with j=1000, however, it hangs as I try to scale to larger numbers. I am stuck here and cannot determine why it hangs. Any suggestions would be appreciated. Also, the code is starting to get ugly as I keep messing with it, I apologize for all the nested functions.
def run4(j):
"""
a multicore approach using queues
"""
from multiprocessing import Process, Queue, cpu_count
import os
def bazinga(uncrunched_queue, crunched_queue):
"""
Pulls the next item off queue, generates its collatz
length and
"""
num = uncrunched_queue.get()
while num != 'STOP': #Signal that there are no more numbers
length = len(generateChain(num, []) )
crunched_queue.put([num , length])
num = uncrunched_queue.get()
def consumer(crunched_queue):
"""
A process to pull data off the queue and evaluate it
"""
maxChain = 0
biggest = 0
while not crunched_queue.empty():
a, b = crunched_queue.get()
if b > maxChain:
biggest = a
maxChain = b
print('%d has a chain of length %d' % (biggest, maxChain))
uncrunched_queue = Queue()
crunched_queue = Queue()
numProcs = cpu_count()
for i in range(1, j): #Load up the queue with our numbers
uncrunched_queue.put(i)
for i in range(numProcs): #put sufficient stops at the end of the queue
uncrunched_queue.put('STOP')
ps = []
for i in range(numProcs):
p = Process(target=bazinga, args=(uncrunched_queue, crunched_queue))
p.start()
ps.append(p)
p = Process(target=consumer, args=(crunched_queue, ))
p.start()
ps.append(p)
for p in ps: p.join()

You're putting 'STOP' poison pills into your uncrunched_queue (as you should), and having your producers shut down accordingly; on the other hand your consumer only checks for emptiness of the crunched queue:
while not crunched_queue.empty():
(this working at all depends on a race condition, btw, which is not good)
When you start throwing non-trivial work units at your bazinga producers, they take longer. If all of them take long enough, your crunched_queue dries up, and your consumer dies. I think you may be misidentifying what's happening - your program doesn't "hang", it just stops outputting stuff because your consumer is dead.
You need to implement a smarter methodology for shutting down your consumer. Either look for n poison pills, where n is the number of producers (who accordingly each toss one in the crunched_queue when they shut down), or use something like a Semaphore that counts up for each live producer and down when one shuts down.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Thread locking failing in dead-simple example - python

Related

random.random() generates same number in multiprocessing

Does new implementation of GIL in Python handled race condition issue?

How to use multiprocessing to parallelize two calls to the same function, with different arguments, in a for loop?

Set the nr of executions per second using Python's multiprocessing

Optimizing Multiprocessing in Python (Follow Up: using Queues)

Categories

Resources