Threading parameters - python

Again a question from me.. having some issues again. Hope to find someone who's a lot smarter and knows this.. :D
I'm now having the issue with threading that when opening threading urls in a range of (1,1000), I would love to see actually all the different urls. Only when i run the code i get a lot of double variables (probably because the crawls go that fast). Anyway this is my code: I try to see at which Thread it is, but I get doubles.
import threading
import urllib2
import time
import collections
results2 = []
def crawl():
var_Number = thread.getName().split("-")[1]
try:
data = urllib2.urlopen("http://www.waarmaarraar.nl").read()
results2.append(var_Number)
except:
crawl()
threads = []
for n in xrange(1, 1000):
thread = threading.Thread(target=crawl)
thread.start()
threads.append(thread)
# to wait until all three functions are finished
print "Waiting..."
for thread in threads:
thread.join()
print "Complete."
# print results (All numbers, should be 1/1000)
results2.sort()
print results2
# print doubles (should be [])
print [x for x, y in collections.Counter(results2).items() if y > 1]
However, if I add time.sleep(0.1) directly under the xrange line, those doubles will not occur. Although this does slow my programm down a lot. Anyone knows a better way to fix this?

There is a recursive call to crawl() in the exception handler. The same thread runs the function several times if there is an error. Thus results2 may contain the same var_Number several times. If you add time.sleep(.1) (a pause); your script consumes less resources e.g., number of open fds, running threads and the request to the remote server is more likely to succeed.
Also default thread names may repeat. If a thread exited; another thread may have the same name e.g., if the implementation uses .ident attribute to generate a name.
Notes:
use pep-8 naming conventions. You could use pep8, pyflakes, epylint command-line tools to check your code automatically
you don't need 1000 threads to fetch 1000 urls (see my comment to your previous question)
it is not nice to generate requests without a pause to the same site.

According to the documentation on Thread.getName() it is a correct behavior.
If you want an unique name for each of your thread you have to set it using the name attribute.
Based on what you expect in the end, replacing
for n in xrange(1, 1000):
thread = threading.Thread(target=crawl)
thread.start()
threads.append(thread)
with
for n in xrange(1, 1000):
thread = threading.Thread(target=crawl)
thread.name = n
thread.start()
threads.append(thread)
and var_Number = thread.getName().split("-")[1] with var_Number = thread.name should help you.
EDIT
After some testing a user-custom name can be reused by another thread, so the only way to pass n will be to use the args or kwargs of threading.Thread().
This behavior make sense, if we need to use some sort of data in a Thread, pass it correctly don't try to put it where it don't belong.

Related

Python multithreading producing funky results

I'm fairly new to multithreading in Python and encountered an issue (likely due to concurrency problems). When I run the code below, it produces "normal" 1,2,3,4,5,6,7,8,9 digits for the first 9 numbers. However, when it moves on to the next batch of numbers (the ones that should be printed by each thread after it "sleeps" for 2 seconds) it spits out:
different numbers each time
often very large numbers
sometimes no numbers at all
I'm guessing this is a concurrency issue where by the time each original thread got to printing the second number after "sleep" the i variable has been tampered with by the code, but can someone please explain what exactly is happening step-by-step and why the no numbers/large numbers phenomenon?
import threading
import time
def foo(text):
print(text)
time.sleep(2)
print(text)
for i in range(1,10):
allTreads = []
current_thread = threading.Thread(target = foo, args= (i,))
allTreads.append(current_thread)
current_thread.start()
Well, your problem is called race condition. Sometimes when the code is executed, one thread will print a number before the implicit '\n' of another thread, and that's why you often see those kind of behaviours.
Also, whats the purpose of the allTreads list there? It is restarted at every iteration, so it stores the current_thread and then is deleted at the end of the current iteration.
In order to avoid race conditions, you need some kind of synchronization between threads. Consider the threading.Lock(), in order to avoid that more than one thread at a time prints the given text:
import threading
import time
lock = threading.Lock()
def foo(text):
with lock:
print(text)
time.sleep(2)
with lock:
print(text)
for i in range(1,10):
allTreads = []
current_thread = threading.Thread(target = foo, args= (i,))
allTreads.append(current_thread)
current_thread.start()
The threading documentation in python is quite good. I recommend you to read these two links:
Python Threading Documentation
Real Python Threading

Python Maximum Thread Count [duplicate]

import threading
threads = []
for n in range(0, 60000):
t = threading.Thread(target=function,args=(x, n))
t.start()
threads.append(t)
for t in threads:
t.join()
It is working well for range up to 800 on my laptop, but if I increase range to more than 800 I get the error can't create new thread.
How can I control number to threads to get created or any other way to make it work like timeout? I tried using threading.BoundedSemaphore function but that doesn't seem to work properly.
The problem is that no major platform (as of mid-2013) will let you create anywhere near this number of threads. There are a wide variety of different limitations you could run into, and without knowing your platform, its configuration, and the exact error you got, it's impossible to know which one you ran into. But here are two examples:
On 32-bit Windows, the default thread stack is 1MB, and all of your thread stacks have to fit into the same 2GB of virtual memory space as everything else in your program, so you will run out long before 60000.
On 64-bit linux, you will likely exhaust one of your session's soft ulimit values before you get anywhere near running out of page space. (Linux has a variety of different limits beyond the ones required by POSIX.)
So, how can i control number to threads to get created or any other way to make it work like timeout or whatever?
Using as many threads as possible is very unlikely to be what you actually want to do. Running 800 threads on an 8-core machine means that you're spending a whole lot of time context-switching between the threads, and the cache keeps getting flushed before it ever gets primed, and so on.
Most likely, what you really want is one of the following:
One thread per CPU, serving a pool of 60000 tasks.
Maybe processes instead of threads (if the primary work is in Python, or in C code that doesn't explicitly release the GIL).
Maybe a fixed number of threads (e.g., a web browsers may do, say, 12 concurrent requests at a time, whether you have 1 core or 64).
Maybe a pool of, say, 600 batches of 100 tasks apiece, instead of 60000 single tasks.
60000 cooperatively-scheduled fibers/greenlets/microthreads all sharing one real thread.
Maybe explicit coroutines instead of a scheduler.
Or "magic" cooperative greenlets via, e.g. gevent.
Maybe one thread per CPU, each running 1/Nth of the fibers.
But it's certainly possible.
Once you've hit whichever limit you're hitting, it's very likely that trying again will fail until a thread has finished its job and been joined, and it's pretty likely that trying again will succeed after that happens. So, given that you're apparently getting an exception, you could handle this the same way as anything else in Python: with a try/except block. For example, something like this:
threads = []
for n in range(0, 60000):
while True:
t = threading.Thread(target=function,args=(x, n))
try:
t.start()
threads.append(t)
except WhateverTheExceptionIs as e:
if threads:
threads[0].join()
del threads[0]
else:
raise
else:
break
for t in threads:
t.join()
Of course this assumes that the first task launched is likely to be the one of the first tasks finished. If this is not true, you'll need some way to explicitly signal doneness (condition, semaphore, queue, etc.), or you'll need to use some lower-level (platform-specific) library that gives you a way to wait on a whole list until at least one thread is finished.
Also, note that on some platforms (e.g., Windows XP), you can get bizarre behavior just getting near the limits.
On top of being a lot better, doing the right thing will probably be a lot simpler as well. For example, here's a process-per-CPU pool:
with concurrent.futures.ProcessPoolExecutor() as executor:
fs = [executor.submit(function, x, n) for n in range(60000)]
concurrent.futures.wait(fs)
… and a fixed-thread-count pool:
with concurrent.futures.ThreadPoolExecutor(12) as executor:
fs = [executor.submit(function, x, n) for n in range(60000)]
concurrent.futures.wait(fs)
… and a balancing-CPU-parallelism-with-numpy-vectorization batching pool:
with concurrent.futures.ThreadPoolExecutor() as executor:
batchsize = 60000 // os.cpu_count()
fs = [executor.submit(np.vector_function, x,
np.arange(n, min(n+batchsize, 60000)))
for n in range(0, 60000, batchsize)]
concurrent.futures.wait(fs)
In the examples above, I used a list comprehension to submit all of the jobs and gather their futures, because we're not doing anything else inside the loop. But from your comments, it sounds like you do have other stuff you want to do inside the loop. So, let's convert it back into an explicit for statement:
with concurrent.futures.ProcessPoolExecutor() as executor:
fs = []
for n in range(60000):
fs.append(executor.submit(function, x, n))
concurrent.futures.wait(fs)
And now, whatever you want to add inside that loop, you can.
However, I don't think you actually want to add anything inside that loop. The loop just submits all the jobs as fast as possible; it's the wait function that sits around waiting for them all to finish, and it's probably there that you want to exit early.
To do this, you can use wait with the FIRST_COMPLETED flag, but it's much simpler to use as_completed.
Also, I'm assuming error is some kind of value that gets set by the tasks. In that case, you will need to put a Lock around it, as with any other mutable value shared between threads. (This is one place where there's slightly more than a one-line difference between a ProcessPoolExecutor and a ThreadPoolExecutor—if you use processes, you need multiprocessing.Lock instead of threading.Lock.)
So:
error_lock = threading.Lock
error = []
def function(x, n):
# blah blah
try:
# blah blah
except Exception as e:
with error_lock:
error.append(e)
# blah blah
with concurrent.futures.ProcessPoolExecutor() as executor:
fs = [executor.submit(function, x, n) for n in range(60000)]
for f in concurrent.futures.as_completed(fs):
do_something_with(f.result())
with error_lock:
if len(error) > 1: exit()
However, you might want to consider a different design. In general, if you can avoid sharing between threads, your life gets a lot easier. And futures are designed to make that easy, by letting you return a value or raise an exception, just like a regular function call. That f.result() will give you the returned value or raise the raised exception. So, you can rewrite that code as:
def function(x, n):
# blah blah
# don't bother to catch exceptions here, let them propagate out
with concurrent.futures.ProcessPoolExecutor() as executor:
fs = [executor.submit(function, x, n) for n in range(60000)]
error = []
for f in concurrent.futures.as_completed(fs):
try:
result = f.result()
except Exception as e:
error.append(e)
if len(error) > 1: exit()
else:
do_something_with(result)
Notice how similar this looks to the ThreadPoolExecutor Example in the docs. This simple pattern is enough to handle almost anything without locks, as long as the tasks don't need to interact with each other.

Does the "print" function run in a new subprocess or something similar?

I'm using praw to get new submissions from reddit:
for submission in submissions:
print('Submission being evaluated:', submission.id)
p = Process(target = evaluate, args = (submission.id, lock))
p.start()
When using this code I sometimes get ids that link to older submissions.
So I changed my script to check if the submissions are new:
for submission in submissions:
if ((time.time()-submission.created) < 15): #if submission is new
lock.acquire()
print('Submission being evaluated:', submission.id)
lock.release()
p = Process(target = evaluate, args = (submission.id, lock))
p.start()
else:
lock.acquire()
print("Submission "+submission.id+" was older than 15 seconds")
lock.release()
But for an extended period of time the else part didn't get executed even though I got a fair amount of old submission ids with the previous script.
So my question is, when I run print(submission.id) is it running in the background when the subprocess is created, maybe causing a problem and changing the value of submission.id or is it just some coincidence that with the second script I got no old submissions?
Thanks in advance!
To answer the question in the title, no.
sys.stdout, the stream print writes to, is usually line buffered though (though that shouldn't matter in this case, as print writes a newline character (unless told not to)), and is shared between threads and subprocesses (unless explicitly unshared).
Without knowing more about the code around this, it's hard to say more. (Who knows, maybe you have a background thread somewhere in there that sneakily changes submission.id?)
EDIT:
The new information in the original post, namely that
print('Submission being evaluated:', submission.id)
is being printed, not
print(submission.id)
is critical.
Each argument of a print() call is printed atomically, but if two processes or threads are print()ing simultaneously, let's say print('a', 'b'), it's entirely possible that you get a a b b instead of a b a b.
Here is a function I like to use for safely printing to the console. That is the proper use of Lock(), use it around very simple operations. I actually use it in a class, so I dont pass around the lock object as in the below example, but same principle.
Also, the answer is likely yes, but it's more uncertain than certain. Are you also using a lock everytime you read and write submission.id? Generally, if you have an object shared by multiple processes, its best to do this, and, also best to use the Value class from the multiprocessing library, since value objects are designed to be safely shared between processes. Below is a trivial but clear example without processes (thats your job!).
from multiprocessing import Process, Lock, #...
myLock = Lock()
myID = ""
def SetID(id, lock):
with lock:
id = "set with lock"
return
def SafePrint(msg, lock):
lock.acquire()
print(msg)
lock.release()
return
SetID(myID, myLock)
SafePrint(myID, myLock)

Return whichever expression returns first

I have two different functions f, and g that compute the same result with different algorithms. Sometimes one or the other takes a long time while the other terminates quickly. I want to create a new function that runs each simultaneously and then returns the result from the first that finishes.
I want to create that function with a higher order function
h = firstresult(f, g)
What is the best way to accomplish this in Python?
I suspect that the solution involves threading. I'd like to avoid discussion of the GIL.
I would simply use a Queue for this. Start the threads and the first one which has a result ready writes to the queue.
Code
from threading import Thread
from time import sleep
from Queue import Queue
def firstresult(*functions):
queue = Queue()
threads = []
for f in functions:
def thread_main():
queue.put(f())
thread = Thread(target=thread_main)
threads.append(thread)
thread.start()
result = queue.get()
return result
def slow():
sleep(1)
return 42
def fast():
return 0
if __name__ == '__main__':
print firstresult(slow, fast)
Live demo
http://ideone.com/jzzZX2
Notes
Stopping the threads is an entirely different topic. For this you need to add some state variable to the threads which needs to be checked in regular intervals. As I want to keep this example short I simply assumed that part and assumed that all workers get the time to finish their work even though the result is never read.
Skipping the discussion about the Gil as requested by the questioner. ;-)
Now - unlike my suggestion on the other answer, this piece of code does exactly what you are requesting:
from multiprocessing import Process, Queue
import random
import time
def firstresult(func1, func2):
queue = Queue()
proc1 = Process(target=func1,args=(queue,))
proc2 = Process(target=func2, args=(queue,))
proc1.start();proc2.start()
result = queue.get()
proc1.terminate(); proc2.terminate()
return result
def algo1(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 1")
def algo2(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 2")
print firstresult(algo1, algo2)
Run each function in a new worker thread, the 2 worker threads send the result back to the main thread in a 1 item queue or something similar. When the main thread receives the result from the winner, it kills (do python threads support kill yet? lol.) both worker threads to avoid wasting time (one function may take hours while the other only takes a second).
Replace the word thread with process if you want.
You will need to run each function in another process (with multiprocessing) or in a different thread.
If both are CPU bound, multithread won help much - exactly due to the GIL -
so multiprocessing is the way.
If the return value is a pickleable (serializable) object, I have this decorator I created that simply runs the function in background, in another process:
https://bitbucket.org/jsbueno/lelo/src
It is not exactly what you want - as both are non-blocking and start executing right away. The tirck with this decorator is that it blocks (and waits for the function to complete) as when you try to use the return value.
But on the other hand - it is just a decorator that does all the work.

Multi-threaded web scraping in Python/PySide/PyQt

I'm building a web scraper of a kind. Basically, what the soft would do is:
User (me) inputs some data (IDs) - IDs are complex, so not just numbers
Based on those IDs, the script visits http://localhost/ID
What is the best way to accomplish this? So I'm looking upwards of 20-30 concurrent connections to do it.
I was thinking, would a simple loop be the solution? This loop would start QThreads (it's a Qt app), so they would run concurrently.
The problem I am seeing with the loop however is how to instruct it to use only those IDs not used before i.e. in the iteration/thread that had been executed just before it was? Would I need some sort of a "delegator" function which will keep track of what IDs had been used and delegate the unused ones to the QThreads?
Now I've written some code but I am not sure if it is correct:
class GUI(QObject):
def __init__(self):
print "GUI CLASS INITIALIZED!!!"
self.worker = Worker()
for i in xrange(300):
QThreadPool().globalInstance().start(self.worker)
class Worker(QRunnable):
def run(self):
print "Hello world from thread", QThread.currentThread()
Now I'm not sure if these achieve really what I want. Is this actually running in separate threads? I'm asking because currentThread() is the same every time this is executed, so it doesn't look that way.
Basically, my question comes down to how do I execute several same QThreads concurrently?
Thanks in advance for the answer!
As Dikei says, Qt is red herring here. Focus on just using Python threads as it will keep your code much simpler.
In the code below we have a set, job_queue, containing the jobs to be executed. We also have a function, worker_thread which takes a job from the passed in queue and executes. Here it just sleeps for a random period of time. The key thing here is that set.pop is thread safe.
We create an array of thread objects, workers, and call start on each as we create it. From the Python documentation threading.Thread.start runs the given callable in a separate thread of control. Lastly we go through each worker thread and block until it has exited.
import threading
import random
import time
pool_size = 5
job_queue = set(range(100))
def worker_thread(queue):
while True:
try:
job = queue.pop()
except KeyError:
break
print "Processing %i..." % (job, )
time.sleep(random.random())
print "Thread exiting."
workers = []
for thread in range(pool_size):
workers.append(threading.Thread(target=worker_thread, args=(job_queue, )))
workers[-1].start()
for worker in workers:
worker.join()
print "All threads exited"

Categories

Resources