Python Threading: Only two active threads, how do I get more?

Python Threading: Only two active threads, how do I get more? - python

So I have the following program:
https://github.com/eWizardII/homobabel/blob/master/Experimental/demo_async_falcon.py
However, when it's run I only get two active threads are running, how can I make it so that there are more threads running. I have tried doing stuff like urlv2 = birdofprey(ip2) where ip2 = str(host+1) however that just ends up sending the same thing to two threads. Any help would be appreciated.
Thanks,

Active count=2 means that you have one of your designed thread (birdofprey), and the main thread. This is because you use lock, so the second birdofprey thread waits for the first and so on. I didn't get deeper into the algorithm, but it seems that you don't need to lock birdofprey threads, since they don't share any data (I could get wrong). If they share, you should make exclusive access to the shared data, and not to lock the whole body of run.
Update upon comment
remove locks (if there is no shared data. storage_i is not a shared data.);
in the for loop` create threads, start them, append to a list;
make the second loop over the list of threads, call join collect the information you need.

line 75, urlv.join() blocks until the thread finishes. So you actually create one thread, wait until it's done and then start the next. The other thread is the main thread.

I think the problem is that you need to pull urlv.join() out of the for loop. Right now, because of the join, you're waiting for your new thread to complete before starting the next one.
But for general readability, maintainability, etc., you might want to consider using Python's Queue class to set up a work queue and have a pool of worker threads pulling from it.

Related

Clean up a thread without .join() and without blocking the main thread

I am in a situation where I have two endpoints I can ask for a value, and one may be faster than the other. The calls to the endpoints are blocking. I want to wait for one to complete and take that result without waiting for the other to complete.
My solution was to issue the requests in separate threads and have those threads set a flag to true when they complete. In the main thread, I continuously check the flags (I know it is a busy wait, but that is not my primary concern right now) and when one completes it takes that value and returns it as the result.
The issue I have is that I never clean up the other thread. I can't find any way to do it without using .join(), which would just block and defeat the purpose of this whole thing. So, how can I clean up that other, slower thread that is blocking without joining it from the main thread?

What you want is to make your threads daemons, so when you get the result and finish your main, the other running thread will be forced to finish. You do that by changing the daemon keyword to True:
tr = threading.Thread(daemon=True)
From the threading docs:
The significance of this flag is that the entire Python program exits
when only daemon threads are left.
Although:
Daemon threads are abruptly stopped at shutdown. Their resources (such
as open files, database transactions, etc.) may not be released
properly. If you want your threads to stop gracefully, make them
non-daemonic and use a suitable signalling mechanism such as an Event.
I don't have any particular experience with Events so can't elaborate on that. Feel free to click the link and read on.

One bad and dirty solution is to implement a methode for the threads which close the socket which is blocking. Now you have to catch the exception in the main thread.

Lock threads in Python for a task

I have a program that uses threads to start another thread once a certain threshold is reached. Right now the second thread is being started multiple times. I implemented a lock but I don't think I did it right.
for i in range(max_threads):
t1 = Thread(target=grab_queue)
t1.start()
in grab_queue, I have:
...
rows.append(resultJson)
if len(rows.value()) >= 250:
with Lock():
row_thread = Thread(target=insert_rows, kwargs={'rows': rows.value()})
row_thread.start()
rows.reset()
Which starts another thread to process the list of rows. I would like to make sure that as soon as it hits the if condition, the other threads wont run in order to make sure that extra threads to process the list of rows aren't started.

Your lock is covering the wrong portion of the code. You have a race condition between the check for the size of rows, and the portion of the code where you reset the rows. Given that the lock is taken only after the size check, two threads could easily both decide that the array has grown too large, and only then would the lock kick in to serialize the resetting of the array. "Serialize" in this case means that the task would still be performed twice, once by each thread, but it would happen in succession rather than in parallel.
The correct code could look like this:
rows.append(resultJson)
with grow_lock:
if len(rows.value()) >= 250:
row_thread = Thread(target=insert_rows, kwargs={'rows': rows.value()})
row_thread.start()
rows.reset()
There is another issue with the code as shown in the question: if Lock() refers to threading.Lock, it is creating and locking a new lock on each invocation, and in each thread! A lock protects a resource shared among threads, and to perform that function, the lock must itself be shared. To fix the problem, instantiate the lock once and pass it to the thread's target function.
Taking a step back, your code implements a custom thread pool. Getting that right and covering all the corner cases takes a lot of work, testing, and debugging. There are production-tested modules specialized for that purpose, such as the multiprocessing module shipped with Python (which supports both process and thread pools), and it is a good idea to get acquainted with them before reimplementing their functionality. See, for example, this article for an accessible introduction to multiprocessing-based thread pools.

What is the main difference of thread.join vs queue.join?

In Python, what is the difference between using a thread.join vs a queue.join? I feel that it could both do the same job in some scenarios. Especially if there is a one-one correpodence between thread spawned and item picked from the queue for a job. Is it something like if you are going to use Threading on a queue, it is best to depend on queue.join and if you are just doing something in paralelly where there is no queue data structure used, but its something like a list you could use thread.join? Ofcourse in the scenario of thread.join you need to mention all the threads spawned.
Also just as an aside is queue something you would normally use for consuming input? I think in the scenario of chaining inputs for another job it makes sense to use as an output as well, but in general queue is for processing input? Can someone clarify?

Queue.join will wait for the queue to be empty (actually that Queue.task_done is called for each item after processing). Thread.join will block until all threads terminates. The behavior using one or other might be similar if all the threads take items from the queue, make a task and return when there's nothing left. However you can still have threads which don't use a queue at all, thus Queue.join would be useless.

Python Multiprocessing - how to make processes wait while active?

Well, I'm quite new to python and multiprocessing, and what I need to know is if there is any way to make active processes wait for something like "all processes have finished using a given resource", then continue their works. And yes, I really need them to wait, the main purpose is related to synchronization. It's not about finishing the processes and joining them, it's about waiting while they're running, should I use something like a Condition/Event or something? I couldn't find anything really helpful anywhere.
It would be something like this:
import multiprocessing
def worker(args):
#1. working
#2. takes the resource from the manager
#3. waits for all other processes to finish the same step above
#4. returns to 1.
if __name__ == '__main__':
manager = multiprocessing.Manager()
resource = manager.something()
pool = multiprocessing.Pool(n)
result = pool.map(worker, args)
pool.close()
pool.join()
Edit: The "working" part takes a lot more time than the other parts, so I still take advantage of multiprocessing, even if the access to that single resource is serial. Let's say the problem works this way: I have multiple processes running a solution finder (an evolutionary algorithm), and every "n" solutions made, I use that resource to exchange some data between those processes and improve solutions using the information. So, I need all of them to wait before exchanging that info. It's a little hard to explain, and I'm not really here to discuss the theory, I just want to know if there is any way I could do what I tried to describe in the main question.

I'm not sure, that I understood your question. But I think you can use Queue. It's good solution to transmit data from one process to another. You can implement something like:
1. Process first chunk
2. Write results to queue
3. Waits until queue is not full
4. Returns to 1

I actually found out a way to do what I wanted.
As you can see in the question, the code was using a manager along the processes. So, in simple words, I made a shared resource which works basically like a "Log". Every time a process finishes its work, it writes a permission in the log. Once all the desired permissions are there, the processes continue their works (also, using this, I could set specific orders of access for a resource, for example).
Please note that this is not a Lock or a Semaphore.
I suppose this isn't a good method at all, but it suits the problem's needs and doesn't delay the execution.

When, why, and how to call thread.join() in Python?

I have this python threading code.
import threading
def sum(value):
sum = 0
for i in range(value+1):
sum += i
print "I'm done with %d - %d\n" % (value, sum)
return sum
r = range(500001, 500000*2, 100)
ts = []
for u in r:
t = threading.Thread(target=sum, args = (u,))
ts.append(t)
t.start()
for t in ts:
t.join()
Executing this, I have hundreds of threads are working.
However, when I move the t.join() right after the t.start(), I have only two threads working.
for u in r:
t = threading.Thread(target=sum, args = (u,))
ts.append(t)
t.start()
t.join()
I tested with the code that does not invoke the t.join(), but it seems to work fine?
Then when, how, and how to use thread.join()?

You seem to not understand what Thread.join does. When calling join, the current thread will block until that thread finished. So you are waiting for the thread to finish, preventing you from starting any other thread.
The idea behind join is to wait for other threads before continuing. In your case, you want to wait for all threads to finish at the end of the main program. Otherwise, if you didn’t do that, and the main program would end, then all threads it created would be killed. So usually, you should have a loop at the end, that joins all created threads to prevent the main thread from exiting down early.

Short answer: this one:
for t in ts:
t.join()
is generally the idiomatic way to start a small number of threads. Doing .join means that your main thread waits until the given thread finishes before proceeding in execution. You generally do this after you've started all of the threads.
Longer answer:
len(list(range(500001, 500000*2, 100)))
Out[1]: 5000
You're trying to start 5000 threads at once. It's miraculous your computer is still in one piece!
Your method of .join-ing in the loop that dispatches workers is never going to be able to have more than 2 threads (i.e. only one worker thread) going at once. Your main thread has to wait for each worker thread to finish before moving on to the next one. You've prevented a computer-meltdown, but your code is going to be WAY slower than if you'd just never used threading in the first place!
At this point I'd talk about the GIL, but I'll put that aside for the moment. What you need to limit your thread creation to a reasonable limit (i.e. more than one, less than 5000) is a ThreadPool. There are various ways to do this. You could roll your own - this is fairly simple with a threading.Semaphore. You could use 3.2+'s concurrent.futures package. You could use some 3rd party solution. Up to you, each is going to have a different API so I can't really discuss that further.
Obligatory GIL Discussion
cPython programmers have to live with the GIL. The Global Interpreter Lock, in short, means that only one thread can be executing python bytecode at once. This means that on processor-bound tasks (like adding a bunch of numbers), threading will not result in any speed-up. In fact, the overhead involved in setting up and tearing down threads (not to mention context switching) will result in a slowdown. Threading is better positioned to provide gains on I/O bound tasks, such as retrieving a bunch of URLs.
multiprocessing and friends sidestep the GIL limitation by, well, using multiple processes. This isn't free - data transfer between processes is expensive, so a lot of care needs to be made not to write workers that depend on shared state.

join() waits for your thread to finish, so the first use starts a hundred threads, and then waits for all of them to finish. The second use wait for end of every thread before it launches another one, which kind of defeats the purpose of threading.
The first use makes most sense. You run the threads (all of them) to do some parallel computation, and then wait until all of them finish, before you move on and use the results, to make sure the work is done (i.e. the results are actually there).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.