I am trying to make a webscraper with multithreading to make it faster. I want to make the value increase every execution. but sometimes the value is skipping or repeating on itself.
import threading
num = 0
def scan():
while True:
global num
num += 1
print(num)
open('logs.txt','a').write(str(f'{num}\n'))
for x in range(500):
threading.Thread(target=scan).start()
Result:
2
2
5
5
7
8
10
10
12
13
13
13
16
17
19
19
22
23
24
25
26
28
29
29
31
32
33
34
Expected result:
1
2
3
4
5
6
7
8
9
10
so since the variable num is a shared resource, you need to put a lock on it. This is done as follows:
num_lock = threading.Lock()
Everytime you want to update the shared variable, you need your thread to first acquire the lock. Once the lock is acquired, only that thread will have access to update the value of num, and no other thread will be able to do so while the current thread has acquired the lock.
Ensure that you use wait or a try-finally block while doing this, to guarantee that the lock will be released even if the current thread fails to update the shared variable.
Something like this:
num_lock.acquire()
try:
num+=1
finally:
num_lock.release()
using with:
with num_lock:
num+=1
Seems like a race condition. You could use a lock so that only one thread can get a particular number. It would make sense also to use lock for writing to the output file.
Here is an example with lock. You do not guarantee the order in which the output is written of course, but every item should be written exactly once. In this example I added a limit of 10000 so that you can more easily check that everything is written eventually in the test code, because otherwise at whatever point you interrupt it, it is harder to verify whether a number got skipped or it was just waiting for a lock to write the output.
The my_num is not shared, so you after you have already claimed it inside the with num_lock section, you are free to release that lock (which protects the shared num) and then continue to use my_num outside of the with while other threads can access the lock to claim their own value. This minimises the duration of time that the lock is held.
import threading
num = 0
num_lock = threading.Lock()
file_lock = threading.Lock()
def scan():
global num_lock, file_lock, num
while num < 10000:
with num_lock:
num += 1
my_num = num
# do whatever you want here using my_num
# but do not touch num
with file_lock:
open('logs.txt','a').write(str(f'{my_num}\n'))
threads = [threading.Thread(target=scan) for _ in range(500)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
An important callout in addition to threading.Lock:
Use join to make the parent thread wait for forked threads to complete.
Without this, threads would still race.
Suppose I'm using the num after threads complete:
import threading
lock, num = threading.Lock(), 0
def operation():
global num
print("Operation has started")
with lock:
num += 1
threads = [threading.Thread(target=operation) for x in range(10)]
for t in threads:
t.start()
for t in threads:
t.join()
print(num)
Without join, inconsistent (9 gets printed once, 10 otherwise):
Operation has started
Operation has started
Operation has started
Operation has started
Operation has startedOperation has started
Operation has started
Operation has started
Operation has started
Operation has started9
With join, its consistent:
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
10
Related
This question already has answers here:
How to add a timeout to a function in Python
(5 answers)
Closed 1 year ago.
I have a function that I run on a number of data frames. I'd like to be able to organize this a bit better and add in a timeout statement. I'm a bit new to this...
organize data frames -
d = {}
dfs = [1,2,3]
for name in dfs:
df[name] = some_function()
And then set-up to organize these into a clean loop that checks how long the df takes to run. So it would run df_1 and df_2 because they take 5 seconds but would skip and print df_3 that it took x number of seconds.
def timeout():
# check how long df_1, df_2, df_3 takes and if takes longer than 30 seconds then print out the df name
You could use a Timer (from the threading module) but your loop must cooperate and stop then the time is expired. This could also be done by checking the elapsed time at each iteration but I believe the Timer approach would incur less overhead.
Assuming you are using an iterator for the loop, you can define a generic "timeOut" function to force it to stop after a given number of seconds:
from threading import Timer
def timeout(maxTime,iterator):
stop = False
def timedOut(): # function called when timer expires
nonlocal stop
stop = True
t = Timer(maxTime,timedOut)
t.start()
for r in iterator:
if stop: break # you could print the content of r here if needed.
yield r
t.cancel()
output:
for i in timeout(3,range(1000)):
for _ in range(10000000): pass
print(i)
1
2
3
4
5
6
7
8 # it stopped here after 3 seconds with 992 iterations to go
In your example, this could be:
d = {}
dfs = [1,2,3]
for name in timeout(5,dfs):
df[name] = some_function()
Note that this will stop the loop on dfs when the total processing time exceeds 5 seconds but it cannot interrupt what is going on inside some_funtion() if it exceeds the total time.
If you need a timeout that is not tied to a specific loop, you can create an instance of a Timer that you store in global variable or in a singleton class and check its state at appropriate places in you code:
t = Timer(25,lambda:pass) # timer will not do anything when expired
t.start() # but will no longer be alive after 25 seconds
...
# at appropriate places in your code
if not t.is_alive(): return # or break or continue ...
I've read an article about multithreading in Python where they trying to use Synchronization to solve race condition issue. And I've run the example code below to reproduce race condition issue:
import threading
# global variable x
x = 0
def increment():
"""
function to increment global variable x
"""
global x
x += 1
def thread_task():
"""
task for thread
calls increment function 100000 times.
"""
for _ in range(100000):
increment()
def main_task():
global x
# setting global variable x as 0
x = 0
# creating threads
t1 = threading.Thread(target=thread_task)
t2 = threading.Thread(target=thread_task)
# start threads
t1.start()
t2.start()
# wait until threads finish their job
t1.join()
t2.join()
if __name__ == "__main__":
for i in range(10):
main_task()
print("Iteration {0}: x = {1}".format(i,x))
It does return the same result as the article when I'm using Python 2.7.15. But it does not when I'm using Python 3.6.9 (all threads return the same result = 200000).
I wonder that does new implementation of GIL (since Python 3.2) was handled race condition issue? If it does, why Lock, Mutex still exist in Python >3.2 . If it doesn't, why there is no conflict when running multi threading to modify shared resource like the example above?
My mind was struggling with those question in these days when I'm trying to understand more about how Python really works under the hood.
The change you are referring to was to replace check interval with switch interval. This meant that rather than switching threads every 100 byte codes it would do so every 5 milliseconds.
Ref: https://pymotw.com/3/sys/threads.html https://mail.python.org/pipermail/python-dev/2009-October/093321.html
So if your code ran fast enough, it would never experience a thread switch and it might appear to you that the operations were atomic when they are in fact not. The race condition did not appear as there was no actual interweaving of threads. x += 1 is actually four byte codes:
>>> dis.dis(sync.increment)
11 0 LOAD_GLOBAL 0 (x)
3 LOAD_CONST 1 (1)
6 INPLACE_ADD
7 STORE_GLOBAL 0 (x)
10 LOAD_CONST 2 (None)
13 RETURN_VALUE
A thread switch in the interpreter can occur between any two bytecodes.
Consider that in 2.7 this prints 200000 always because the check interval is set so high that each thread completes in its entirety before the next runs. The same can be constructed with switch interval.
import sys
import threading
print(sys.getcheckinterval())
sys.setcheckinterval(1000000)
# global variable x
x = 0
def increment():
"""
function to increment global variable x
"""
global x
x += 1
def thread_task():
"""
task for thread
calls increment function 100000 times.
"""
for _ in range(100000):
increment()
def main_task():
global x
# setting global variable x as 0
x = 0
# creating threads
t1 = threading.Thread(target=thread_task)
t2 = threading.Thread(target=thread_task)
# start threads
t1.start()
t2.start()
# wait until threads finish their job
t1.join()
t2.join()
if __name__ == "__main__":
for i in range(10):
main_task()
print("Iteration {0}: x = {1}".format(i,x))
The GIL protects individual byte code instructions. In contrast, a race condition is an incorrect ordering of instructions, which means multiple byte code instructions. As a result, the GIL cannot protect against race conditions outside of the Python VM itself.
However, by their very nature race conditions do not always trigger. Certain GIL strategies are more or less likely to trigger certain race conditions. A thread shorter than the GIL window is never interrupted, and one longer than the GIL window is always interrupted.
Your increment function has 6 byte code instructions, as has the inner loop calling it. Of these, 4 instructions must finish at once, meaning there are 3 possible switching points that corrupt the result. Your entire thread_task function takes about 0.015s to 0.020s (on my system).
With the old GIL switching every 100 instructions, the loop is guaranteed to be interrupted every 8.3 calls, or roughly 1200 times. With the new GIL switching every 5ms, the loop is interrupted only 3 times.
I'm trying Python multiprocessing, and I want to use Lock to avoid overlapping variable 'es_id' values.
According to theory and examples, when a process calls lock, 'es_id' can't overlap because another process can't access it, but, the results show that es_id often overlaps.
How can the id values not overlap?
Part of my code is:
def saveDB(imgName, imgType, imgStar, imgPull, imgTag, lock): #lock=Lock() in main
imgName=NameFormat(imgName) #name/subname > name:subname
i=0
while i < len(imgName):
lock.acquire() #since global es_id
global es_id
print "getIMG.pt:save information about %s"%(imgName[i])
cmd="curl -XPUT http://localhost:9200/kimhk/imgName/"+str(es_id)+" -d '{" +\
'"image_name":"'+imgName[i]+'", '+\
'"image_type":"'+imgType[i]+'", '+\
'"image_star":"'+imgStar[i]+'", '+\
'"image_pull":"'+imgPull[i]+'", '+\
'"image_Tag":"'+",".join(imgTag[i])+'"'+\
"}'"
try:
subprocess.call(cmd,shell=True)
except subprocess.CalledProcessError as e:
print e.output
i+=1
es_id+=1
lock.release()
...
#main
if __name__ == "__main__":
lock = Lock()
exPg, proc_num=option()
procs=[]
pages=[ [] for i in range(proc_num)]
i=1
#Use Multiprocessing to get HTML data quickly
if proc_num >= exPg: #if page is less than proc_num, don't need to distribute the page to the process.
while i<=exPg:
page=i
proc=Process(target=getExplore, args=(page,lock,))
procs.append(proc)
proc.start()
i+=1
else:
while i<=exPg: #distribute the page to the process
page=i
index=(i-1)%proc_num #if proc_num=4 -> 0 1 2 3
pages[index].append(page)
i+=1
i=0
while i<proc_num:
proc=Process(target=getExplore, args=(pages[i],lock,))#
procs.append(proc)
proc.start()
i+=1
for proc in procs:
proc.join()
execution result screen:
result is the output of subprocess.call (cmd, shell = True). I use XPUT to add data to ElasticSearch, and es_id is the id of the data. I want these id to increase sequentially without overlap. (Because they will be overwritten by the previous data if they overlap)
I know XPOST doesn't need to use a lock code because it automatically generates an ID, but I need to access all the data sequentially in the future (like reading one line of files).
If you know how to access all the data sequentially after using XPOST, can you tell me?
It looks like you are trying to access a global variable with a lock, but global variables are different instances between processes. What you need to use is a shared memory value. Here's a working example. It has been tested on Python 2.7 and 3.6:
from __future__ import print_function
import multiprocessing as mp
def process(counter):
# Increment the counter 3 times.
# Hold the counter's lock for read/modify/write operations.
# Keep holding it so the value doesn't change before printing,
# and keep prints from multiple processes from trying to write
# to a line at the same time.
for _ in range(3):
with counter.get_lock():
counter.value += 1
print(mp.current_process().name,counter.value)
def main():
counter = mp.Value('i') # shared integer
processes = [mp.Process(target=process,args=(counter,)) for i in range(3)]
for p in processes:
p.start()
for p in processes:
p.join()
if __name__ == '__main__':
main()
Output:
Process-2 1
Process-2 2
Process-1 3
Process-3 4
Process-2 5
Process-1 6
Process-3 7
Process-1 8
Process-3 9
You've only given part of your code, so I can only see a potential problem. It doesn't do any good to lock-protect one access to es_id. You must lock-protect them all, anywhere they occur in the program. Perhaps it is best to create an access function for this purpose, like:
def increment_es_id():
global es_id
lock.acquire()
es_id += 1
lock.release()
This can be called safely from any thread.
In your code, it's a good practice to move the acquire/release calls as close together as you can make them. Here you only need to protect one variable, so you can move the acquire/release pair to just before and after the es_id += 1 statement..
Even better is to use the lock in a context manager (although in this simple case it won't make any difference):
def increment_es_id2():
global es_id
with lock:
es_id += 1
i would like to understand how to use python threading and queue. my goal is to have 40 threads always alive, this is my code:
for iteration iterations: # main iteration
dance = 1
if threads >= len(MAX_VALUE_ITERATION) :
threads = len(MAX_VALUE_ITERATION)-1 # adjust number of threads as because in this iteration i have just x subvalues
else:
threads = threads_saved # recover the settings or the passed argument
while dance <= 5: # iterate from 1 to 5
request = 0
for lol in MAX_LOL: # lol iterate
for thread_n in range(threads): # MAX threads
t = threading.Thread(target=do_something)
t.setDaemon(True)
t.start()
request += 1
main_thread = threading.currentThread()
for t in threading.enumerate():
if t is main_thread:
continue
if request < len(MAX_LOL)-1 and settings.errors_count <= MAX_ERR_COUNT:
t.join()
dance += 1
The code you see here it was cleaned because it was long to debug for you, so i try to semplify a little bit.
As you can see there are many iteration , i start from a dbquery and i fetch the result in the list (iterations)
next i adjust the max number of the threads allowed
then i iterate again from 1 to 5 (it's an argoument passed to the small thread)
then inside the value fetched from the query iteration there is a json that contain another list i need to iterate again ...
and then finally i open the threads with start and join ...
The script open x threads and then, when they (all or almost) finish it will open others threads ... but my goal is to keep max X threads forever , i mean once one thread is finished another have to spawn and so on ... until the max_number of the threads is reached.
i hope you can help me.
Thanks
I have seen 2 different answers on thread safety of python function attributes. Assuming a single process with possible multiple threads, and thinking that functions are global, is there a definite problem in using a function attribute as static storage? No answers based on programming style preferences, please.
There's nothing really wrong with what you describe. A "static" global function object to hold your variables:
from threading import Thread
def static():
pass
static.x = 0
def thread():
static.x += 1
print(static.x)
for nothing in range(10):
Thread(target=thread).start()
Output:
1
2
3
4
5
6
7
8
9
10
That "works" because each thread executes and finishes quickly, and within the same amount of time. But let's say you have threads running for arbitrary lengths of time:
from threading import Thread
from time import sleep
from random import random
from threading import Thread
def static():
pass
static.x = 0
def thread():
static.x += 1
x = static.x
sleep(random())
print(x)
for nothing in range(10):
Thread(target=thread).start()
Output:
2
3
8
1
5
7
10
4
6
9
The behavior becomes undefined.
Amendment:
As tdelaney pointed out,
"[The] first example only works by luck and would fail randomly on
repeated runs..."
The first example is meant to be an illustration of how multithreading can appear to be functioning properly. It is by no means thread-safe code.
Amendment II: You may want to take a look at this question.