Does new implementation of GIL in Python handled race condition issue?

Does new implementation of GIL in Python handled race condition issue? - python

I've read an article about multithreading in Python where they trying to use Synchronization to solve race condition issue. And I've run the example code below to reproduce race condition issue:
import threading
# global variable x
x = 0
def increment():
"""
function to increment global variable x
"""
global x
x += 1
def thread_task():
"""
task for thread
calls increment function 100000 times.
"""
for _ in range(100000):
increment()
def main_task():
global x
# setting global variable x as 0
x = 0
# creating threads
t1 = threading.Thread(target=thread_task)
t2 = threading.Thread(target=thread_task)
# start threads
t1.start()
t2.start()
# wait until threads finish their job
t1.join()
t2.join()
if __name__ == "__main__":
for i in range(10):
main_task()
print("Iteration {0}: x = {1}".format(i,x))
It does return the same result as the article when I'm using Python 2.7.15. But it does not when I'm using Python 3.6.9 (all threads return the same result = 200000).
I wonder that does new implementation of GIL (since Python 3.2) was handled race condition issue? If it does, why Lock, Mutex still exist in Python >3.2 . If it doesn't, why there is no conflict when running multi threading to modify shared resource like the example above?
My mind was struggling with those question in these days when I'm trying to understand more about how Python really works under the hood.

The change you are referring to was to replace check interval with switch interval. This meant that rather than switching threads every 100 byte codes it would do so every 5 milliseconds.
Ref: https://pymotw.com/3/sys/threads.html https://mail.python.org/pipermail/python-dev/2009-October/093321.html
So if your code ran fast enough, it would never experience a thread switch and it might appear to you that the operations were atomic when they are in fact not. The race condition did not appear as there was no actual interweaving of threads. x += 1 is actually four byte codes:
>>> dis.dis(sync.increment)
11 0 LOAD_GLOBAL 0 (x)
3 LOAD_CONST 1 (1)
6 INPLACE_ADD
7 STORE_GLOBAL 0 (x)
10 LOAD_CONST 2 (None)
13 RETURN_VALUE
A thread switch in the interpreter can occur between any two bytecodes.
Consider that in 2.7 this prints 200000 always because the check interval is set so high that each thread completes in its entirety before the next runs. The same can be constructed with switch interval.
import sys
import threading
print(sys.getcheckinterval())
sys.setcheckinterval(1000000)
# global variable x
x = 0
def increment():
"""
function to increment global variable x
"""
global x
x += 1
def thread_task():
"""
task for thread
calls increment function 100000 times.
"""
for _ in range(100000):
increment()
def main_task():
global x
# setting global variable x as 0
x = 0
# creating threads
t1 = threading.Thread(target=thread_task)
t2 = threading.Thread(target=thread_task)
# start threads
t1.start()
t2.start()
# wait until threads finish their job
t1.join()
t2.join()
if __name__ == "__main__":
for i in range(10):
main_task()
print("Iteration {0}: x = {1}".format(i,x))

The GIL protects individual byte code instructions. In contrast, a race condition is an incorrect ordering of instructions, which means multiple byte code instructions. As a result, the GIL cannot protect against race conditions outside of the Python VM itself.
However, by their very nature race conditions do not always trigger. Certain GIL strategies are more or less likely to trigger certain race conditions. A thread shorter than the GIL window is never interrupted, and one longer than the GIL window is always interrupted.
Your increment function has 6 byte code instructions, as has the inner loop calling it. Of these, 4 instructions must finish at once, meaning there are 3 possible switching points that corrupt the result. Your entire thread_task function takes about 0.015s to 0.020s (on my system).
With the old GIL switching every 100 instructions, the loop is guaranteed to be interrupted every 8.3 calls, or roughly 1200 times. With the new GIL switching every 5ms, the loop is interrupted only 3 times.

Related

Factorial calculation with threading takes too long to finish, How to fix it?

I am trying to calculate whether a given number is prime or not with the formula :
(n-1)! mod n =? (n-1)
I must calculate the factorial with different threads and make them work simultaneously and control if they're all finished and if so then join them. By doing so I will be calculating factorial with different threads and then be able to take the modulo. However even though my code works fine with the small prime numbers it is taking too long to execute when the number is too big. I searched my code and couldn't really find alternative that can slow down the execution time. Here is my code :
import threading
import time
# GLOBAL VARIABLE
result = 1
# worker class in order to multiply on threads
class Worker:
# initiating the worker class
def __init__(self):
super().__init__()
self.jobs = []
# the function that does the actual multiplying
def multiplier(self,beg,end):
global result
for i in range(beg,end+1):
result*= i
#print("\tresult updated with *{}:".format(i),result)
#print("Calculating from {} to {}".format(beg,end)," : ",result)
# appending threads to the object
def append_job(self,job):
self.jobs.append(job)
# function that is to see the threads
def see_jobs(self):
return self.jobs
# initiating the threads
def initiate(self):
for j in self.jobs:
j.start()
# finalizing and joining the threads
def finalize(self):
for j in self.jobs:
j.join()
# controlling the threads by blocking them untill all threads are asleep
def work(self):
while True:
if 0 == len([t for t in self.jobs if t.is_alive()]):
self.finalize()
break
# this is the function to split the factorial into several threads
def splitUp(n,t):
# defining the remainder and the whole
remainder, whole = (n-1) % t, (n-1) // t
# deciding to tuple count
tuple_count = whole if remainder == 0 else whole + 1
# empty result list
result = []
# iterating
beginning = 1
end = (n-1) // t
for i in range(1,tuple_count+1):
if i == tuple_count:
result.append((beginning,n-1)) # if we are at the end, just append all to end
else:
result.append((beginning,end*i))
beginning = end*i + 1
return result
if __name__ == "__main__":
threads = 64
number = 743
splitted = splitUp(number,threads)
worker = Worker()
#print(worker.see_jobs())
s = time.time()
# creating the threads
for arg in splitted:
thread = threading.Thread(target=worker.multiplier(arg[0],arg[1]))
worker.append_job(thread)
worker.initiate()
worker.work()
e = time.time()
print("result found with {} threads in {} secs\n".format(threads,e-s))
if result % number == number-1:
print("PRIME")
else:
print("NOT PRIME")
"""
-------------------- REPORT ------------------------
result found with 2 threads in 6.162530899047852 secs
result found with 4 threads in 0.29897499084472656 secs
result found with 16 threads in 0.009003162384033203 secs
result found with 32 threads in 0.0060007572174072266 secs
result found with 64 threads in 0.0029952526092529297 secs
note that: these results may differ from machine to machine
-------------------------------------------------------
"""
Thanks in advance.

First and foremost, you have a critical error in your code that you haven't reported or tried to trace:
======================== RESTART: ========================
result found with 2 threads in 5.800899267196655 secs
NOT PRIME
>>>
======================== RESTART: ========================
result found with 64 threads in 0.002002716064453125 secs
PRIME
>>>
As the old saying goes, "if the program doesn't work, it doesn't matter how fast it is".
The only test case you've given is 743; if you want help to diagnose the logic error, please determine the minimal test and parallelism that causes the error, and post a separate question.
I suspect that it's with your multplier function, as you're working with an ill-advised global variable in parallel processing, and your multiply operation is not thread-safe.
In assembly terms, you have an unguarded region:
LOAD result
MUL i
STORE result
If this is interleaved with the same work from another process, the result is virtually guaranteed to be wrong. You have to make this a critical region.
Once you fix that, you still have your speed problem. Factorial is the poster-child for recursion acceleration. I see two obvious accelerants:
Instead of that horridly slow multiplication loop, use functools.reduce to blast through your multiplication series.
If you're going to loop the program with a series of inputs, then short-cut most of the calculations with memoization. The example on the linked page benefits greatly from multiple-recursion; since factorial is linear, you'd need repeated application to take advantage of the technique.

Thread locking failing in dead-simple example

This is the simplest toy example. I know about concurrent.futures and higher level code; I'm picking the toy example because I'm teaching it (as part of same material with the high-level stuff).
It steps on counter from different threads, and I get... well, here it is even weirder. Usually I get a counter less than I should (e.g. 5M), generally much less like 20k. But as I decrease the number of loops, at some number like 1000 it is consistently right. Then at some intermediate number, I get almost right, occasionally correct, but once in a while slightly larger than the product of nthread x nloop. I am running it repeatedly in a Jupyter cell, but the first line really should reset counter to zero, not keep any old total.
lock = threading.Lock()
counter, nthread, nloop = 0, 100, 50_000
def increment(n, lock):
global counter
for _ in range(n):
lock.acquire()
counter += 1
lock.release()
for _ in range(nthread):
t = Thread(target=increment, args=(nloop, lock))
t.start()
print(f"{nloop:,} loops X {nthread:,} threads -> counter is {counter:,}")
If I add .join() the behavior changes, but is still not correct. For example, in the version that doesn't try to lock:
counter, nthread, nloop = 0, 100, 50_000
def increment(n):
global counter
for _ in range(n):
counter += 1
for _ in range(nthread):
t = Thread(target=increment, args=(nloop,))
t.start()
t.join()
print(f"{nloop:,} loops X {nthread:,} threads -> counter is {counter:,}")
# --> 50,000 loops X 100 threads -> counter is 5,022,510
The exact overcount varies, but I see something like that repeatedly.
I don't really want to .join() in the lock example, because I want to illustrate the idea of a background job. But I can wait for the aliveness of the thread (thank you Frank Yellin!), and that fixes the lock case. The overcount still troubles me though.

You're not waiting until all your threads are done before looking at counter. That's also why you're getting your result so quickly.
threads = []
for _ in range(nthread):
t = threading.Thread(target=increment, args=(nloop, lock))
t.start()
threads.append(t)
for thread in threads:
thread.join()
print(f"{nloop:,} loops X {nthread:,} threads -> counter is {counter:,}")
prints out the expected result.
50,000 loops X 100 threads -> counter is 5,000,000
Updated. I highly recommend using ThreadPoolExecutor() instead, which takes care of tracking the threads for you.
with ThreadPoolExecutor() as executor:
for _ in range(nthread):
executor.submit(increment, nloop, lock)
print(...)
will give you the answer you want, and takes care of waiting for the threads.

How do I increase variable value when multithreading in python

I am trying to make a webscraper with multithreading to make it faster. I want to make the value increase every execution. but sometimes the value is skipping or repeating on itself.
import threading
num = 0
def scan():
while True:
global num
num += 1
print(num)
open('logs.txt','a').write(str(f'{num}\n'))
for x in range(500):
threading.Thread(target=scan).start()
Result:
2
2
5
5
7
8
10
10
12
13
13
13
16
17
19
19
22
23
24
25
26
28
29
29
31
32
33
34
Expected result:
1
2
3
4
5
6
7
8
9
10

so since the variable num is a shared resource, you need to put a lock on it. This is done as follows:
num_lock = threading.Lock()
Everytime you want to update the shared variable, you need your thread to first acquire the lock. Once the lock is acquired, only that thread will have access to update the value of num, and no other thread will be able to do so while the current thread has acquired the lock.
Ensure that you use wait or a try-finally block while doing this, to guarantee that the lock will be released even if the current thread fails to update the shared variable.
Something like this:
num_lock.acquire()
try:
num+=1
finally:
num_lock.release()
using with:
with num_lock:
num+=1

Seems like a race condition. You could use a lock so that only one thread can get a particular number. It would make sense also to use lock for writing to the output file.
Here is an example with lock. You do not guarantee the order in which the output is written of course, but every item should be written exactly once. In this example I added a limit of 10000 so that you can more easily check that everything is written eventually in the test code, because otherwise at whatever point you interrupt it, it is harder to verify whether a number got skipped or it was just waiting for a lock to write the output.
The my_num is not shared, so you after you have already claimed it inside the with num_lock section, you are free to release that lock (which protects the shared num) and then continue to use my_num outside of the with while other threads can access the lock to claim their own value. This minimises the duration of time that the lock is held.
import threading
num = 0
num_lock = threading.Lock()
file_lock = threading.Lock()
def scan():
global num_lock, file_lock, num
while num < 10000:
with num_lock:
num += 1
my_num = num
# do whatever you want here using my_num
# but do not touch num
with file_lock:
open('logs.txt','a').write(str(f'{my_num}\n'))
threads = [threading.Thread(target=scan) for _ in range(500)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()

An important callout in addition to threading.Lock:
Use join to make the parent thread wait for forked threads to complete.
Without this, threads would still race.
Suppose I'm using the num after threads complete:
import threading
lock, num = threading.Lock(), 0
def operation():
global num
print("Operation has started")
with lock:
num += 1
threads = [threading.Thread(target=operation) for x in range(10)]
for t in threads:
t.start()
for t in threads:
t.join()
print(num)
Without join, inconsistent (9 gets printed once, 10 otherwise):
Operation has started
Operation has started
Operation has started
Operation has started
Operation has startedOperation has started
Operation has started
Operation has started
Operation has started
Operation has started9
With join, its consistent:
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
10

How to use multiprocessing to parallelize two calls to the same function, with different arguments, in a for loop?

In a for loop, I am calling a function twice but with different argument sets (argSet1, argSet2) that change on each iteration of the for loop. I want to parallelize this operation since one set of the arguments causes the called function to run faster, and the other set of arguments causes a slow run of the function. Note that I do not want to have two for loops for this operation. I also have another requirement: Each of these functions will execute some parallel operations and therefore I do not want to have any of the functions with either argSet1 or argSet2 be running more than once, because of the computational limited resources that I have. Making sure that the function with both argument sets is running will help me utilize the CPU cores as much as possible. Here's how do it normally without parallelization:
def myFunc(arg1, arg2):
if arg1:
print ('do something that does not take too long')
else:
print ('do something that takes long')
for i in range(10):
argSet1 = arg1Storage[i]
argSet1 = arg2Storage[i]
myFunc(argSet1)
myFunc(argSet2)
This will definitely not take the advantage of the computational resources that I have. Here's my try to parallelize the operations:
from multiprocessing import Process
def myFunc(arg1, arg2):
if arg1:
print ('do something that does not take too long')
else:
print ('do something that takes long')
for i in range(10):
argSet1 = arg1Storage[i]
argSet1 = arg2Storage[i]
p1 = Process(target=myFunc, args=argSet1)
p1.start()
p2 = Process(target=myFunc, args=argSet2)
p2.start()
However, this way each function with its respective arguments will be called 10 times and things become extremely slow. Given my limited knowledge of multiprocessing, I tried to improve things a bit more by adding p1.join() and p2.join() to the end of the for loop but this still causes slow down as p1 is done much faster and things wait until p2 is done. I also thought about using multiprocessing.Value to do some communication with the functions but then I have to add a while loop inside the function for each of the function calls which slows down everything again. I wonder if someone can offer a practical solution?

Since I built this answer in patches, scroll down for the best solution to this problem
You need specify to exactly how you want things to run. As far as I can tell, you want two processes to run at most, but also at least. Also, you do not want the heavy call to hold up the fast ones. One simple non-optimal way to run is:
from multiprocessing import Process
def func(counter,somearg):
j = 0
for i in range(counter): j+=i
print(somearg)
def loop(counter,arglist):
for i in range(10):
func(counter,arglist[i])
heavy = Process(target=loop,args=[1000000,['heavy'+str(i) for i in range(10)]])
light = Process(target=loop,args=[500000,['light'+str(i) for i in range(10)]])
heavy.start()
light.start()
heavy.join()
light.join()
The output here is (for one example run):
light0
heavy0
light1
light2
heavy1
light3
light4
heavy2
light5
light6
heavy3
light7
light8
heavy4
light9
heavy5
heavy6
heavy7
heavy8
heavy9
You can see the last part is sub-optimal, since you have a sequence of heavy runs - which means there is one process instead of two.
An easy way to optimize this, if you can estimate how much longer is the heavy process running. If it's twice as slow, as here, just run 7 iterations of heavy first, join the light process, and have it run the additional 3.
Another way is to run the heavy process in pairs, so at first you have 3 processes until the fast process ends, and then continues with 2.
The main point is separating the heavy and light calls to another process entirely - so while the fast calls complete one after the other quickly you can work your slow stuff. Once th fast ends, it's up to you how elaborate do you want to continue, but I think for now estimating how to break up the heavy calls is good enough. This is it for my example:
from multiprocessing import Process
def func(counter,somearg):
j = 0
for i in range(counter): j+=i
print(somearg)
def loop(counter,amount,arglist):
for i in range(amount):
func(counter,arglist[i])
heavy1 = Process(target=loop,args=[1000000,7,['heavy1'+str(i) for i in range(7)]])
light = Process(target=loop,args=[500000,10,['light'+str(i) for i in range(10)]])
heavy2 = Process(target=loop,args=[1000000,3,['heavy2'+str(i) for i in range(7,10)]])
heavy1.start()
light.start()
light.join()
heavy2.start()
heavy1.join()
heavy2.join()
with output:
light0
heavy10
light1
light2
heavy11
light3
light4
heavy12
light5
light6
heavy13
light7
light8
heavy14
light9
heavy15
heavy27
heavy16
heavy28
heavy29
Much better utilization. You can of course make this more advanced by sharing a queue for the slow process runs, so when the fast are done they can join as workers on the slow queue, but for only two different calls this may be overkill (though not much harder using the queue). The best solution:
from multiprocessing import Queue,Process
import queue
def func(index,counter,somearg):
j = 0
for i in range(counter): j+=i
print("Worker",index,':',somearg)
def worker(index):
try:
while True:
func,args = q.get(block=False)
func(index,*args)
except queue.Empty: pass
q = Queue()
for i in range(10):
q.put((func,(500000,'light'+str(i))))
q.put((func,(1000000,'heavy'+str(i))))
nworkers = 2
workers = []
for i in range(nworkers):
workers.append(Process(target=worker,args=(i,)))
workers[-1].start()
q.close()
for worker in workers:
worker.join()
This is the best and most scalable solution for what you want. Output:
Worker 0 : light0
Worker 0 : light1
Worker 1 : heavy0
Worker 1 : light2
Worker 0 : heavy1
Worker 0 : light3
Worker 1 : heavy2
Worker 1 : light4
Worker 0 : heavy3
Worker 0 : light5
Worker 1 : heavy4
Worker 1 : light6
Worker 0 : heavy5
Worker 0 : light7
Worker 1 : heavy6
Worker 1 : light8
Worker 0 : heavy7
Worker 0 : light9
Worker 1 : heavy8
Worker 0 : heavy9

You might want to use a multiprocessing.Pool of processes and map your myFunc into it, like so:
from multiprocessing import Pool
import time
def myFunc(arg1, arg2):
if arg1:
print ('do something that does not take too long')
time.sleep(0.01)
else:
print ('do something that takes long')
time.sleep(1)
def wrap(args):
return myFunc(*args)
if __name__ == "__main__":
p = Pool()
argStorage = [(True, False), (False, True)] * 12
p.map(wrap, argStorage)
I added a wrap function, since the function passed to p.map must accept a single argument. You could just as well adapt myFunc to accept a tuple, if that's possible in your case.
My sample appStorage constists of 24 items, where 12 of them will take 1sec to process, and 12 will be done in 10ms. In total, this script runs in 3-4 seconds (I have 4 cores).

One possible implementation could be as follow:
import concurrent.futures
import math
list_of_args = [arg1, arg2]
def my_func(arg):
....
print ('do something that takes long')
def main():
with concurrent.futures.ProcessPoolExecutor() as executor:
for arg, result in zip(list_of_args, executor.map(is_prime, list_of_args)):
print('my_func({0}) => {1}'.format(arg, result))
executor.map is like the built in function, the map method allows multiple calls to a provided function, passing each of the items in an iterable to that function.

How can I prevent values from overlapping in a Python multiprocessing?

I'm trying Python multiprocessing, and I want to use Lock to avoid overlapping variable 'es_id' values.
According to theory and examples, when a process calls lock, 'es_id' can't overlap because another process can't access it, but, the results show that es_id often overlaps.
How can the id values not overlap?
Part of my code is:
def saveDB(imgName, imgType, imgStar, imgPull, imgTag, lock): #lock=Lock() in main
imgName=NameFormat(imgName) #name/subname > name:subname
i=0
while i < len(imgName):
lock.acquire() #since global es_id
global es_id
print "getIMG.pt:save information about %s"%(imgName[i])
cmd="curl -XPUT http://localhost:9200/kimhk/imgName/"+str(es_id)+" -d '{" +\
'"image_name":"'+imgName[i]+'", '+\
'"image_type":"'+imgType[i]+'", '+\
'"image_star":"'+imgStar[i]+'", '+\
'"image_pull":"'+imgPull[i]+'", '+\
'"image_Tag":"'+",".join(imgTag[i])+'"'+\
"}'"
try:
subprocess.call(cmd,shell=True)
except subprocess.CalledProcessError as e:
print e.output
i+=1
es_id+=1
lock.release()
...
#main
if __name__ == "__main__":
lock = Lock()
exPg, proc_num=option()
procs=[]
pages=[ [] for i in range(proc_num)]
i=1
#Use Multiprocessing to get HTML data quickly
if proc_num >= exPg: #if page is less than proc_num, don't need to distribute the page to the process.
while i<=exPg:
page=i
proc=Process(target=getExplore, args=(page,lock,))
procs.append(proc)
proc.start()
i+=1
else:
while i<=exPg: #distribute the page to the process
page=i
index=(i-1)%proc_num #if proc_num=4 -> 0 1 2 3
pages[index].append(page)
i+=1
i=0
while i<proc_num:
proc=Process(target=getExplore, args=(pages[i],lock,))#
procs.append(proc)
proc.start()
i+=1
for proc in procs:
proc.join()
execution result screen:
result is the output of subprocess.call (cmd, shell = True). I use XPUT to add data to ElasticSearch, and es_id is the id of the data. I want these id to increase sequentially without overlap. (Because they will be overwritten by the previous data if they overlap)
I know XPOST doesn't need to use a lock code because it automatically generates an ID, but I need to access all the data sequentially in the future (like reading one line of files).
If you know how to access all the data sequentially after using XPOST, can you tell me?

It looks like you are trying to access a global variable with a lock, but global variables are different instances between processes. What you need to use is a shared memory value. Here's a working example. It has been tested on Python 2.7 and 3.6:
from __future__ import print_function
import multiprocessing as mp
def process(counter):
# Increment the counter 3 times.
# Hold the counter's lock for read/modify/write operations.
# Keep holding it so the value doesn't change before printing,
# and keep prints from multiple processes from trying to write
# to a line at the same time.
for _ in range(3):
with counter.get_lock():
counter.value += 1
print(mp.current_process().name,counter.value)
def main():
counter = mp.Value('i') # shared integer
processes = [mp.Process(target=process,args=(counter,)) for i in range(3)]
for p in processes:
p.start()
for p in processes:
p.join()
if __name__ == '__main__':
main()
Output:
Process-2 1
Process-2 2
Process-1 3
Process-3 4
Process-2 5
Process-1 6
Process-3 7
Process-1 8
Process-3 9

You've only given part of your code, so I can only see a potential problem. It doesn't do any good to lock-protect one access to es_id. You must lock-protect them all, anywhere they occur in the program. Perhaps it is best to create an access function for this purpose, like:
def increment_es_id():
global es_id
lock.acquire()
es_id += 1
lock.release()
This can be called safely from any thread.
In your code, it's a good practice to move the acquire/release calls as close together as you can make them. Here you only need to protect one variable, so you can move the acquire/release pair to just before and after the es_id += 1 statement..
Even better is to use the lock in a context manager (although in this simple case it won't make any difference):
def increment_es_id2():
global es_id
with lock:
es_id += 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Does new implementation of GIL in Python handled race condition issue? - python

Related

Factorial calculation with threading takes too long to finish, How to fix it?

Thread locking failing in dead-simple example

How do I increase variable value when multithreading in python

How to use multiprocessing to parallelize two calls to the same function, with different arguments, in a for loop?

How can I prevent values from overlapping in a Python multiprocessing?

Categories

Resources