i would like to understand how to use python threading and queue. my goal is to have 40 threads always alive, this is my code:
for iteration iterations: # main iteration
dance = 1
if threads >= len(MAX_VALUE_ITERATION) :
threads = len(MAX_VALUE_ITERATION)-1 # adjust number of threads as because in this iteration i have just x subvalues
else:
threads = threads_saved # recover the settings or the passed argument
while dance <= 5: # iterate from 1 to 5
request = 0
for lol in MAX_LOL: # lol iterate
for thread_n in range(threads): # MAX threads
t = threading.Thread(target=do_something)
t.setDaemon(True)
t.start()
request += 1
main_thread = threading.currentThread()
for t in threading.enumerate():
if t is main_thread:
continue
if request < len(MAX_LOL)-1 and settings.errors_count <= MAX_ERR_COUNT:
t.join()
dance += 1
The code you see here it was cleaned because it was long to debug for you, so i try to semplify a little bit.
As you can see there are many iteration , i start from a dbquery and i fetch the result in the list (iterations)
next i adjust the max number of the threads allowed
then i iterate again from 1 to 5 (it's an argoument passed to the small thread)
then inside the value fetched from the query iteration there is a json that contain another list i need to iterate again ...
and then finally i open the threads with start and join ...
The script open x threads and then, when they (all or almost) finish it will open others threads ... but my goal is to keep max X threads forever , i mean once one thread is finished another have to spawn and so on ... until the max_number of the threads is reached.
i hope you can help me.
Thanks
Related
I am trying to make a webscraper with multithreading to make it faster. I want to make the value increase every execution. but sometimes the value is skipping or repeating on itself.
import threading
num = 0
def scan():
while True:
global num
num += 1
print(num)
open('logs.txt','a').write(str(f'{num}\n'))
for x in range(500):
threading.Thread(target=scan).start()
Result:
2
2
5
5
7
8
10
10
12
13
13
13
16
17
19
19
22
23
24
25
26
28
29
29
31
32
33
34
Expected result:
1
2
3
4
5
6
7
8
9
10
so since the variable num is a shared resource, you need to put a lock on it. This is done as follows:
num_lock = threading.Lock()
Everytime you want to update the shared variable, you need your thread to first acquire the lock. Once the lock is acquired, only that thread will have access to update the value of num, and no other thread will be able to do so while the current thread has acquired the lock.
Ensure that you use wait or a try-finally block while doing this, to guarantee that the lock will be released even if the current thread fails to update the shared variable.
Something like this:
num_lock.acquire()
try:
num+=1
finally:
num_lock.release()
using with:
with num_lock:
num+=1
Seems like a race condition. You could use a lock so that only one thread can get a particular number. It would make sense also to use lock for writing to the output file.
Here is an example with lock. You do not guarantee the order in which the output is written of course, but every item should be written exactly once. In this example I added a limit of 10000 so that you can more easily check that everything is written eventually in the test code, because otherwise at whatever point you interrupt it, it is harder to verify whether a number got skipped or it was just waiting for a lock to write the output.
The my_num is not shared, so you after you have already claimed it inside the with num_lock section, you are free to release that lock (which protects the shared num) and then continue to use my_num outside of the with while other threads can access the lock to claim their own value. This minimises the duration of time that the lock is held.
import threading
num = 0
num_lock = threading.Lock()
file_lock = threading.Lock()
def scan():
global num_lock, file_lock, num
while num < 10000:
with num_lock:
num += 1
my_num = num
# do whatever you want here using my_num
# but do not touch num
with file_lock:
open('logs.txt','a').write(str(f'{my_num}\n'))
threads = [threading.Thread(target=scan) for _ in range(500)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
An important callout in addition to threading.Lock:
Use join to make the parent thread wait for forked threads to complete.
Without this, threads would still race.
Suppose I'm using the num after threads complete:
import threading
lock, num = threading.Lock(), 0
def operation():
global num
print("Operation has started")
with lock:
num += 1
threads = [threading.Thread(target=operation) for x in range(10)]
for t in threads:
t.start()
for t in threads:
t.join()
print(num)
Without join, inconsistent (9 gets printed once, 10 otherwise):
Operation has started
Operation has started
Operation has started
Operation has started
Operation has startedOperation has started
Operation has started
Operation has started
Operation has started
Operation has started9
With join, its consistent:
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
Operation has started
10
I am a beginner in Python, so I would very appreciate it if you can help me with clear and easy explanations.
In my Python script, I have a function that makes several threads to do an I/O bound task (What it really does is making several Azure requests concurrently using Azure Python SDK), and I also have a list of time differences like [1 second, 3 seconds, 10 seconds, 5 seconds, ..., 7 seconds] so that I execute the function again after each time difference.
Let's say I want to execute the function and execute it again after 5 seconds. The first execution can take much more than 5 seconds to finish as it has to wait for the requests it makes to be done. So, I want to execute each function in a different process so that different executions of the function do not block each other (Even if they don't block each other without using different processes, I just didn't want threads in different executions to be mixed).
My code is like:
import multiprocessing as mp
from time import sleep
def function(num_threads):
"""
This functions makes num_threads number of threads to make num_threads number of requests
"""
# Time to wait in seconds between each execution of the function
times = [1, 10, 7, 3, 13, 19]
# List of number of requests to make for each execution of the function
num_threads_list = [1, 2, 3, 4, 5, 6]
processes = []
for i in range(len(times)):
p = mp.Process(target=function, args=[num_threads_list[i]])
p.start()
processes.append(p)
sleep(times[i])
for process in processes:
process.join()
Question I have due to mare:
the length of the list "times" is very big in my real script (, which is 1000). Considering the time differences in the list "times", I guess there are at most 5 executions of the function running concurrently using processes. I wonder if each process terminates when it is done executing the function, so that there are actually at most 5 processes running. Or, Does it remain so that there will be 1000 processes, which sounds very weird given the number of CPU cores of my computer?
Please tell me if you think there is a better way to do what I explained above.
Thank you!
The main problem I destilate from your question is having a large amount of processes running simultaniously.
You can prevent that by maintaining a list of processes with a maximum length. Something like this.
import multiprocessing as mp
from time import sleep
from random import randint
def function(num_threads):
"""
This functions makes num_threads number of threads to make num_threads number of requests
"""
sleep(randint(3, 7))
# Time to wait in seconds between each execution of the function
times = [1, 10, 7, 3, 13, 19]
# List of number of requests to make for each execution of the function
num_threads_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
process_data_list = []
max_processes = 4
# =======================================================================================
def main():
times_index = 0
while times_index < len(times):
# cleanup stopped processes -------------------------------
cleanup_done = False
while not cleanup_done:
cleanup_done = True
# search stopped processes
for i, process_data in enumerate(process_data_list):
if not process_data[1].is_alive():
print(f'process {process_data[0]} finished')
# remove from processes
p = process_data_list.pop(i)
del p
# start new search
cleanup_done = False
break
# try start new process ---------------------------------
if len(process_data_list) < max_processes:
process = mp.Process(target=function, args=[num_threads_list[times_index]])
process.start()
process_data_list.append([times_index, process])
print(f'process {times_index} started')
times_index += 1
else:
sleep(0.1)
# wait for all processes to finish --------------------------------
while process_data_list:
for i, process_data in enumerate(process_data_list):
if not process_data[1].is_alive():
print(f'process {process_data[0]} finished')
# remove from processes
p = process_data_list.pop(i)
del p
# start new search
break
print('ALL DONE !!!!!!')
# =======================================================================================
if __name__ == '__main__':
main()
It runs max_processes at once as you can see in the result.
process 0 started
process 1 started
process 2 started
process 3 started
process 3 finished
process 4 started
process 1 finished
process 5 started
process 0 finished
process 2 finished
process 5 finished
process 4 finished
ALL DONE !!!!!!
You would also use a timer to do the job like in the following code.
I voluntarily put 15 second to thread 2 in order that one could see it’s effectively ended in last position once time elapsed.
This code sample has two main functions.
The first one your_process_here() like it’s name says is waiting for your own code
The second one is a manager which organizes the threads slicing in order to not overload the system.
Parameters
max_process : total number of processes being executed by the script
simultp : maximum number of simultaneous processes
timegl : time guideline which defines the waiting time for each thread since time parent starts. So waiting time is at least the time defined in the guideline (which refers to parent's start time).
Say in other words, since its guideline time is elapsed, thread starts as soon as possible when taking into account the maximum number of simultaneous threads allowed.
In this example
max_process = 6
simultp = 3
timegl = [1, 15, 1, 0.22, 6, 0.5] (just for explanations because the more logical is to have an increase series there)
Result in the shell
simultaneously launched processes : 3
process n°2 is active and will wait 14.99 seconds more before treatment function starts
process n°1 is active and will wait 0.98 seconds more before treatment function starts
process n°3 is active and will wait 0.98 seconds more before treatment function starts
---- process n°1 ended ----
---- process n°3 ended ----
simultaneously launched processes : 3
process n°5 is active and will wait 2.88 seconds more before treatment function starts
process n°4 is active and will start now
---- process n°4 ended ----
---- process n°5 ended ----
simultaneously launched processes : 2
process n°6 is active and will start now
---- process n°6 ended ----
---- process n°2 ended ----
Code
import multiprocessing as mp
from threading import Timer
import time
def your_process_here(starttime, pnum, timegl):
# Delay since the parent thread starts
delay_since_pstart = time.time() - starttime
# Time to sleep in order to follow the most possible the time guideline
diff = timegl[pnum-1]- delay_since_pstart
if diff > 0: # if time ellapsed since Parent starts < guideline time
print('process n°{0} is active and will wait {1} seconds more before treatment function starts'\
.format(pnum, round(diff, 2)))
time.sleep(diff) # wait for X more seconds
else:
print('process n°{0} is active and will start now'.format(pnum))
########################################################
## PUT THE CODE AFTER SLEEP() TO START CODE WITH A DELAY
## if pnum == 1:
## function1()
## elif pnum == 2:
## function2()
## ...
print('---- process n°{0} ended ----'.format(pnum))
def process_manager(max_process, simultp, timegl, starttime=0, pnum=1, launchp=[]):
# While your number of simultaneous current processes is less than simultp and
# the historical number of processes is less than max_process
while len(mp.active_children()) < simultp and len(launchp) < max_process:
# Incrementation of the process number
pnum = len(launchp) + 1
# Start a new process
mp.Process(target=your_process_here, args=(starttime, pnum, timegl)).start()
# Historical of all launched unique processes
launchp = list(set(launchp + mp.active_children()))
# ...
####### THESE 2 FOLLOWING LINES ARE TO DELETE IN OPERATIONAL CODE ############
print('simultaneously launched processes : ', len(mp.active_children()))
time.sleep(3) # optionnal : This a break of 3 seconds before the next slice of process to be treated
##############################################################################
if pnum < max_process:
delay_repeat = 0.1 # 100 ms
# If all the processes have not been launched renew the operation
Timer(delay_repeat, process_manager, (max_process, simultp, timegl, starttime, pnum, launchp)).start()
if __name__ == '__main__':
max_process = 6 # maximum of processes
simultp = 3 # maximum of simultaneous processes to save resources
timegl = [1, 15, 1, 0.22, 6, 0.5] # Time guideline
starttime = time.time()
process_manager(max_process, simultp, timegl, starttime)
For the code segment below, I would like limit the number of running threads to 20 threads. My attempt at doing this seems flawed, because once the counter hits 20, it would just not create new threads, but those values of "a" would not trigger the do_something() function (which must account for every "a" in the array). Any help is greatly appreciated.
count = 0
for i in range(len(array_of_letters)):
if i == "a":
if count < 20:
count=+1
t = threading.Thread(target=do_something, args = (q,u))
print "new thread started : %s"%(str(threading.current_thread().ident))
t.start()
count=-1
concurrent.futures has a ThreadPoolExecutor class, which allows submitting many tasks and specify the maximum number of working threads:
with ThreadPoolExecutor(max_workers=20) as executor:
for letter in array_of_letters):
executor.submit(do_something, letter)
Check more examples in the package docs.
I want to run a serial program on multiple cores at the same time and I need to do that multiple time (in a loop).
I use subprocess.Popen to distribute the jobs on the processors by limiting the number of jobs to the number of available processors. I add the jobs to a list and then I check with poll() if the jobs are done, I remove them from the list and continue the submission until the total number of jobs are completed.
I have been looking on the web and found a couple of interesting scripts to do that and came out with my adapted version:
nextProc = 0
processes = []
while (len(processes) < limitProc): # Here I assume that limitProc < ncores
input = filelist[nextProc]+'.in' # filelist: list of input file
output = filelist[nextProc]+'.out' # list of output file
cwd = pathlist[nextProc] # list of paths
processes.append(subprocess.Popen(['myProgram','-i',input,'-screen',output],cwd=cwd,bufsize=-1))
nextProc += 1
time.sleep(wait)
while (len(processes) > 0): # Loop until all processes are done
time.sleep(wait)
for i in xrange(len(processes)-1, -1, -1): # Remove processes done (traverse backward)
if processes[i].poll() is not None:
del processes[i]
time.sleep(wait)
while (len(processes) < limitProc) and (nextProc < maxProcesses): # Submit new processes
output = filelist[nextProc]+'.out'
input = filelist[nextProc]+'.in'
cwd = pathlist[nextProc]
processes.append(subprocess.Popen(['myProgram','-i',input,'-screen',output],cwd=cwd,bufsize=-1))
nextProc += 1
time.sleep(wait)
print 'Jobs Done'
I run this script in a loop and the problem is that the execution time increases from one step to another. Here is the graph: http://i62.tinypic.com/2lk8f41.png
myProgram time execution is constant.
I'd be so glad if someone could explain me what is causing this leak.
Thanks a lot,
Begbi
This is a followup question to this. User Will suggested using a queue, I tried to implement that solution below. The solution works just fine with j=1000, however, it hangs as I try to scale to larger numbers. I am stuck here and cannot determine why it hangs. Any suggestions would be appreciated. Also, the code is starting to get ugly as I keep messing with it, I apologize for all the nested functions.
def run4(j):
"""
a multicore approach using queues
"""
from multiprocessing import Process, Queue, cpu_count
import os
def bazinga(uncrunched_queue, crunched_queue):
"""
Pulls the next item off queue, generates its collatz
length and
"""
num = uncrunched_queue.get()
while num != 'STOP': #Signal that there are no more numbers
length = len(generateChain(num, []) )
crunched_queue.put([num , length])
num = uncrunched_queue.get()
def consumer(crunched_queue):
"""
A process to pull data off the queue and evaluate it
"""
maxChain = 0
biggest = 0
while not crunched_queue.empty():
a, b = crunched_queue.get()
if b > maxChain:
biggest = a
maxChain = b
print('%d has a chain of length %d' % (biggest, maxChain))
uncrunched_queue = Queue()
crunched_queue = Queue()
numProcs = cpu_count()
for i in range(1, j): #Load up the queue with our numbers
uncrunched_queue.put(i)
for i in range(numProcs): #put sufficient stops at the end of the queue
uncrunched_queue.put('STOP')
ps = []
for i in range(numProcs):
p = Process(target=bazinga, args=(uncrunched_queue, crunched_queue))
p.start()
ps.append(p)
p = Process(target=consumer, args=(crunched_queue, ))
p.start()
ps.append(p)
for p in ps: p.join()
You're putting 'STOP' poison pills into your uncrunched_queue (as you should), and having your producers shut down accordingly; on the other hand your consumer only checks for emptiness of the crunched queue:
while not crunched_queue.empty():
(this working at all depends on a race condition, btw, which is not good)
When you start throwing non-trivial work units at your bazinga producers, they take longer. If all of them take long enough, your crunched_queue dries up, and your consumer dies. I think you may be misidentifying what's happening - your program doesn't "hang", it just stops outputting stuff because your consumer is dead.
You need to implement a smarter methodology for shutting down your consumer. Either look for n poison pills, where n is the number of producers (who accordingly each toss one in the crunched_queue when they shut down), or use something like a Semaphore that counts up for each live producer and down when one shuts down.