Python multithreading producing funky results - python

I'm fairly new to multithreading in Python and encountered an issue (likely due to concurrency problems). When I run the code below, it produces "normal" 1,2,3,4,5,6,7,8,9 digits for the first 9 numbers. However, when it moves on to the next batch of numbers (the ones that should be printed by each thread after it "sleeps" for 2 seconds) it spits out:
different numbers each time
often very large numbers
sometimes no numbers at all
I'm guessing this is a concurrency issue where by the time each original thread got to printing the second number after "sleep" the i variable has been tampered with by the code, but can someone please explain what exactly is happening step-by-step and why the no numbers/large numbers phenomenon?
import threading
import time
def foo(text):
print(text)
time.sleep(2)
print(text)
for i in range(1,10):
allTreads = []
current_thread = threading.Thread(target = foo, args= (i,))
allTreads.append(current_thread)
current_thread.start()

Well, your problem is called race condition. Sometimes when the code is executed, one thread will print a number before the implicit '\n' of another thread, and that's why you often see those kind of behaviours.
Also, whats the purpose of the allTreads list there? It is restarted at every iteration, so it stores the current_thread and then is deleted at the end of the current iteration.
In order to avoid race conditions, you need some kind of synchronization between threads. Consider the threading.Lock(), in order to avoid that more than one thread at a time prints the given text:
import threading
import time
lock = threading.Lock()
def foo(text):
with lock:
print(text)
time.sleep(2)
with lock:
print(text)
for i in range(1,10):
allTreads = []
current_thread = threading.Thread(target = foo, args= (i,))
allTreads.append(current_thread)
current_thread.start()
The threading documentation in python is quite good. I recommend you to read these two links:
Python Threading Documentation
Real Python Threading

Related

Python - Why doesn't multithreading increase the speed of my code?

I tried improving my code by running this with and without using two threads:
from threading import Lock
from threading import Thread
import time
start_time = time.clock()
arr_lock = Lock()
arr = range(5000)
def do_print():
# Disable arr access to other threads; they will have to wait if they need to read
a = 0
while True:
arr_lock.acquire()
if len(arr) > 0:
item = arr.pop(0)
print item
arr_lock.release()
b = 0
for a in range(30000):
b = b + 1
else:
arr_lock.release()
break
thread1 = Thread(target=do_print)
thread1.start()
thread1.join()
print time.clock() - start_time, "seconds"
When running 2 threads my code's run time increased. Does anyone know why this happened, or perhaps know a different way to increase the performance of my code?
The primary reason you aren't seeing any performance improvements with multiple threads is because your program only enables one thread to do anything useful at a time. The other thread is always blocked.
Two things:
Remove the print statement that's invoked inside the lock. print statements drastically impact performance and timing. Also, the I/O channel to stdout is essentially single threaded, so you've built another implicit lock into your code. So let's just remove the print statement.
Use a proper sleep technique instead of "spin locking" and counting up from 0 to 30000. That's just going to burn a core needlessly.
Try this as your main loop
while True:
arr_lock.acquire()
if len(arr) > 0:
item = arr.pop(0)
arr_lock.release()
time.sleep(0)
else:
arr_lock.release()
break
This should run slightly better... I would even advocate getting the sleep statement out altogether so you can just let each thread have a full quantum.
However, because each thread is either doing "nothing" (sleeping or blocked on acquire) or just doing a single pop call on the array while in the lock, the majority of the time spent is going to be in the acquire/release calls instead of actually operating on the array. Hence, multiple threads aren't going to make your program run faster.

''.join() in ThreadPoolExecutor eat memory

try code like this:
import gc
import random
from concurrent.futures import ThreadPoolExecutor
zen = "Special cases aren't special enough to break the rules. "
def abc(length: int):
msg = ''.join(random.sample(zen, length))
print(msg)
del msg
if __name__ == '__main__':
pool = ThreadPoolExecutor(max_workers=8)
while True:
for x in range(256):
pool.submit(abc, random.randint(2, 6))
print('===================================================')
gc.collect()
The code may take about 8MB if it running without ThreadPoolExecutor, or about 30MB using str() instead of ''.join(). But this code keep eating RAM without limit. I thought it is caused by random.sample or something else, but it proved that ''.join() in ThreadPoolExecutor cause this problem.
It confuse me as there is no modules mutually import each other(share zen only), & neither the del or Gc work :(
ps: please notes that infinite loop is not a problem. when you run something like:
while True:
print(1234567)
the memory usage will keeping under a certain line (code above may take not more than 1MB?). The code at the top don't have an increasing list or dict, & the variable has been del at the end of the module. So it should be cleaned up when a thread finish as I think, which obviously not.
pss: let's talk like this: the cause of the problem is anything in ''.join() will not be recycled. As if we change the abc module this way:
tmp = random.sample(zen, length)
msg = ''.join(tmp)
print(msg[:8])
del msg, tmp
Gc works effectively, and the usage keeping about 26MB.
So is there something I missed when using ''.join() or python language has a bug there?
When you run the code without threads each sentence will execute completely, and with that I mean that the gc.collect() will be called after the inner loop has ended.
But when you execute the code with threads a new thread will be called before the most recent one has ended, so the number of new threads will rapidly increase and as there's no limit for the number of threads you will have more threads than your CPU's can handle causing an accumulation of threads.

Python: multithreading in infinite loop

I have a code which is basically running an infinite loop, and in each iteration of the loop I run some instructions. Some of these instructions have to run in "parallel", which I do by using multiprocessing. Here is an example of my code structure:
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
def buy_fruit(fruit, number):
print('I bought '+str(number)+' times the following fruit:'+fruit)
return 'ok'
def func1(parameter1, parameter2):
myParameters=(parameter1,parameter2)
pool= Threadpool(2)
data = pool.starmap(func2,zip(myParameters))
return 'ok'
def func2(parameter1):
print(parameter1)
return 'ok'
while true:
myFruits=('apple','pear','orange')
myQuantities=(5,10,2)
pool= Threadpool(2)
data = pool.starmap(buy_fruit,zip(myFruits,myQuantities))
func1('hello', 'hola')
I agree it's a bit messy, because I have multi-processes within the main loop, but also within functions.
So everything works well, until the loop runs a few minutes and I get an error:
"RuntimeError: can't start new thread"
I saw online that this is due to the fact that I have opened too many threads.
What is the simplest way to close all my Threads by the end of each loop iteration, so I can restart "fresh" at the start of the new loop iteration?
Thank you in advance for your time and help!
Best,
Julia
PS: The example code is just an example, my real function opens many threads in each loop and each function takes a few seconds to execute.
You are creating a new ThreadPool object inside the endless loop, which is a likely cause to your problem, because you are not terminating the threads at the end of the loop. Have you tried creating the object outside of the endless loop?
pool = ThreadPool(2)
while True:
myFruits = ('apple','pear','orange')
myQuantities = (5,10,2)
data = pool.starmap(buy_fruit, zip(myFruits,myQuantities))
Alternatively, and to answer your question, if your use case for some reason requires creating a new ThreadPool Object in each loop iteration, use a ContextManager (with Notation) to make sure all threads are closed upon leaving the ContextManager.
while True:
myFruits = ('apple','pear','orange')
myQuantities = (5,10,2)
with ThreadPool(2) as pool:
data = pool.starmap(buy_fruit, zip(myFruits,myQuantities))
Notice however the noticable performance difference this has compared to the above code. Creating and terminating Threads is expensive, which is why the example above will run much faster, and is probably what you'll want to use.
Regarding your edit involving "nested ThreadPools": I would suggest to maintain one single instance of your ThreadPool, and pass references to your nested functions as required.
def func1(pool, parameter1, parameter2):
...
...
pool = ThreadPool(2)
while True:
myFruits=('apple','pear','orange')
myQuantities=(5,10,2)
data = pool.starmap(buy_fruit, zip(myFruits,myQuantities))
func1(pool, 'hello', 'hola')

Return whichever expression returns first

I have two different functions f, and g that compute the same result with different algorithms. Sometimes one or the other takes a long time while the other terminates quickly. I want to create a new function that runs each simultaneously and then returns the result from the first that finishes.
I want to create that function with a higher order function
h = firstresult(f, g)
What is the best way to accomplish this in Python?
I suspect that the solution involves threading. I'd like to avoid discussion of the GIL.
I would simply use a Queue for this. Start the threads and the first one which has a result ready writes to the queue.
Code
from threading import Thread
from time import sleep
from Queue import Queue
def firstresult(*functions):
queue = Queue()
threads = []
for f in functions:
def thread_main():
queue.put(f())
thread = Thread(target=thread_main)
threads.append(thread)
thread.start()
result = queue.get()
return result
def slow():
sleep(1)
return 42
def fast():
return 0
if __name__ == '__main__':
print firstresult(slow, fast)
Live demo
http://ideone.com/jzzZX2
Notes
Stopping the threads is an entirely different topic. For this you need to add some state variable to the threads which needs to be checked in regular intervals. As I want to keep this example short I simply assumed that part and assumed that all workers get the time to finish their work even though the result is never read.
Skipping the discussion about the Gil as requested by the questioner. ;-)
Now - unlike my suggestion on the other answer, this piece of code does exactly what you are requesting:
from multiprocessing import Process, Queue
import random
import time
def firstresult(func1, func2):
queue = Queue()
proc1 = Process(target=func1,args=(queue,))
proc2 = Process(target=func2, args=(queue,))
proc1.start();proc2.start()
result = queue.get()
proc1.terminate(); proc2.terminate()
return result
def algo1(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 1")
def algo2(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 2")
print firstresult(algo1, algo2)
Run each function in a new worker thread, the 2 worker threads send the result back to the main thread in a 1 item queue or something similar. When the main thread receives the result from the winner, it kills (do python threads support kill yet? lol.) both worker threads to avoid wasting time (one function may take hours while the other only takes a second).
Replace the word thread with process if you want.
You will need to run each function in another process (with multiprocessing) or in a different thread.
If both are CPU bound, multithread won help much - exactly due to the GIL -
so multiprocessing is the way.
If the return value is a pickleable (serializable) object, I have this decorator I created that simply runs the function in background, in another process:
https://bitbucket.org/jsbueno/lelo/src
It is not exactly what you want - as both are non-blocking and start executing right away. The tirck with this decorator is that it blocks (and waits for the function to complete) as when you try to use the return value.
But on the other hand - it is just a decorator that does all the work.

Threading parameters

Again a question from me.. having some issues again. Hope to find someone who's a lot smarter and knows this.. :D
I'm now having the issue with threading that when opening threading urls in a range of (1,1000), I would love to see actually all the different urls. Only when i run the code i get a lot of double variables (probably because the crawls go that fast). Anyway this is my code: I try to see at which Thread it is, but I get doubles.
import threading
import urllib2
import time
import collections
results2 = []
def crawl():
var_Number = thread.getName().split("-")[1]
try:
data = urllib2.urlopen("http://www.waarmaarraar.nl").read()
results2.append(var_Number)
except:
crawl()
threads = []
for n in xrange(1, 1000):
thread = threading.Thread(target=crawl)
thread.start()
threads.append(thread)
# to wait until all three functions are finished
print "Waiting..."
for thread in threads:
thread.join()
print "Complete."
# print results (All numbers, should be 1/1000)
results2.sort()
print results2
# print doubles (should be [])
print [x for x, y in collections.Counter(results2).items() if y > 1]
However, if I add time.sleep(0.1) directly under the xrange line, those doubles will not occur. Although this does slow my programm down a lot. Anyone knows a better way to fix this?
There is a recursive call to crawl() in the exception handler. The same thread runs the function several times if there is an error. Thus results2 may contain the same var_Number several times. If you add time.sleep(.1) (a pause); your script consumes less resources e.g., number of open fds, running threads and the request to the remote server is more likely to succeed.
Also default thread names may repeat. If a thread exited; another thread may have the same name e.g., if the implementation uses .ident attribute to generate a name.
Notes:
use pep-8 naming conventions. You could use pep8, pyflakes, epylint command-line tools to check your code automatically
you don't need 1000 threads to fetch 1000 urls (see my comment to your previous question)
it is not nice to generate requests without a pause to the same site.
According to the documentation on Thread.getName() it is a correct behavior.
If you want an unique name for each of your thread you have to set it using the name attribute.
Based on what you expect in the end, replacing
for n in xrange(1, 1000):
thread = threading.Thread(target=crawl)
thread.start()
threads.append(thread)
with
for n in xrange(1, 1000):
thread = threading.Thread(target=crawl)
thread.name = n
thread.start()
threads.append(thread)
and var_Number = thread.getName().split("-")[1] with var_Number = thread.name should help you.
EDIT
After some testing a user-custom name can be reused by another thread, so the only way to pass n will be to use the args or kwargs of threading.Thread().
This behavior make sense, if we need to use some sort of data in a Thread, pass it correctly don't try to put it where it don't belong.

Categories

Resources