Multi-threaded web scraping in Python/PySide/PyQt

Multi-threaded web scraping in Python/PySide/PyQt - python

I'm building a web scraper of a kind. Basically, what the soft would do is:
User (me) inputs some data (IDs) - IDs are complex, so not just numbers
Based on those IDs, the script visits http://localhost/ID
What is the best way to accomplish this? So I'm looking upwards of 20-30 concurrent connections to do it.
I was thinking, would a simple loop be the solution? This loop would start QThreads (it's a Qt app), so they would run concurrently.
The problem I am seeing with the loop however is how to instruct it to use only those IDs not used before i.e. in the iteration/thread that had been executed just before it was? Would I need some sort of a "delegator" function which will keep track of what IDs had been used and delegate the unused ones to the QThreads?
Now I've written some code but I am not sure if it is correct:
class GUI(QObject):
def __init__(self):
print "GUI CLASS INITIALIZED!!!"
self.worker = Worker()
for i in xrange(300):
QThreadPool().globalInstance().start(self.worker)
class Worker(QRunnable):
def run(self):
print "Hello world from thread", QThread.currentThread()
Now I'm not sure if these achieve really what I want. Is this actually running in separate threads? I'm asking because currentThread() is the same every time this is executed, so it doesn't look that way.
Basically, my question comes down to how do I execute several same QThreads concurrently?
Thanks in advance for the answer!

As Dikei says, Qt is red herring here. Focus on just using Python threads as it will keep your code much simpler.
In the code below we have a set, job_queue, containing the jobs to be executed. We also have a function, worker_thread which takes a job from the passed in queue and executes. Here it just sleeps for a random period of time. The key thing here is that set.pop is thread safe.
We create an array of thread objects, workers, and call start on each as we create it. From the Python documentation threading.Thread.start runs the given callable in a separate thread of control. Lastly we go through each worker thread and block until it has exited.
import threading
import random
import time
pool_size = 5
job_queue = set(range(100))
def worker_thread(queue):
while True:
try:
job = queue.pop()
except KeyError:
break
print "Processing %i..." % (job, )
time.sleep(random.random())
print "Thread exiting."
workers = []
for thread in range(pool_size):
workers.append(threading.Thread(target=worker_thread, args=(job_queue, )))
workers[-1].start()
for worker in workers:
worker.join()
print "All threads exited"

Related

Running multiple python scripts from a single script, and communicating back and forth between them?

I have a script that I wrote that I am able to pass arguments to, and I want launch multiple simultaneous iterations (maybe 100+) with unique arguments. My plan was to write another python script which then launch these subscripts/processes, however to be effective, I need the that script to be able to monitor the subscripts for any errors.
Is there any straightforward way to do this, or a library that offers this functionality? I've been searching for a while and am not having good luck finding anything. Creating subprocesses and multiple threads seems straight forward enough but I can't really find any guides or tutorials on how to then communicate with those threads/subprocesses.

A better way to do this would be to make use of threads. If you made the script you want to call into a function in this larger script, you could have your main function call this script as many times as you want and have the threads report back with information as needed. You can read a little bit about how threads work here.

I suggest to use threading.Thread or multiprocessing.Process despite of requirements.
Simple way to communicate between Threads / Processes is to use Queue. Multiprocessing module provides some other ways to communicate between processes (Queue, Event, Manager, ...)
You can see some elementary communication in the example:
import threading
from Queue import Queue
import random
import time
class Worker(threading.Thread):
def __init__(self, name, queue_error):
threading.Thread.__init__(self)
self.name = name
self.queue_error = queue_error
def run(self):
time.sleep(random.randrange(1, 10))
# Do some processing ...
# Report errors
self.queue_error.put((self.name, 'Error state'))
class Launcher(object):
def __init__(self):
self.queue_error = Queue()
def main_loop(self):
# Start threads
for i in range(10):
w = Worker(i, self.queue_error)
w.start()
# Check for errors
while True:
while not self.queue_error.empty():
error_data = self.queue_error.get()
print 'Worker #%s reported error: %s' % (error_data[0], error_data[1])
time.sleep(0.1)
if __name__ == '__main__':
l = Launcher()
l.main_loop()

Like someone else said, you have to use multiple processes for true parallelism instead of threads because the GIL limitation prevents threads from running concurrently.
If you want to use the standard multiprocessing library (which is based on launching multiple processes), I suggest using a pool of workers. If I understood correctly, you want to launch 100+ parallel instances. Launching 100+ processes on one host will generate too much overhead. Instead, create a pool of P workers where P is for example the number of cores in your machine and submit the 100+ jobs to the pool. This is simple to do and there are many examples on the web. Also, when you submit jobs to the pool, you can provide a callback function to receive errors. This may be sufficient for your needs (there are examples here).
The Pool in multiprocessing however can't distribute work across multiple hosts (e.g. cluster of machines) last time I looked. So, if you need to do this, or if you need a more flexible communication scheme, like being able to send updates to the controlling process while the workers are running, my suggestion is to use charm4py (note that I am a charm4py developer so this is where I have experience).
With charm4py you could create N workers which are distributed among P processes by the runtime (works across multiple hosts), and the workers can communicate with the controller simply by doing remote method invocation. Here is a small example:
from charm4py import charm, Chare, Group, Array, ArrayMap, Reducer, threaded
import time
WORKER_ITERATIONS = 100
class Worker(Chare):
def __init__(self, controller):
self.controller = controller
#threaded
def work(self, x, done_future):
result = -1
try:
for i in range(WORKER_ITERATIONS):
if i % 20 == 0:
# send status update to controller
self.controller.progressUpdate(self.thisIndex, i, ret=True).get()
if i == 5 and self.thisIndex[0] % 2 == 0:
# trigger NameError on even-numbered workers
test[3] = 3
time.sleep(0.01)
result = x**2
except Exception as e:
# send error to controller
self.controller.collectError(self.thisIndex, e)
# send result to controller
self.contribute(result, Reducer.gather, done_future)
# This custom map is used to prevent workers from being created on process 0
# (where the controller is). Not strictly needed, but allows more timely
# controller output
class WorkerMap(ArrayMap):
def procNum(self, index):
return (index[0] % (charm.numPes() - 1)) + 1
class Controller(Chare):
def __init__(self, args):
self.startTime = time.time()
done_future = charm.createFuture()
# create 12 workers, which are distributed by charm4py among processes
workers = Array(Worker, 12, args=[self.thisProxy], map=Group(WorkerMap))
# start work
for i in range(12):
workers[i].work(i, done_future)
print('Results are', done_future.get()) # wait for result
exit()
def progressUpdate(self, worker_id, current_step):
print(round(time.time() - self.startTime, 3), ': Worker', worker_id,
'progress', current_step * 100 / WORKER_ITERATIONS, '%')
# the controller can return a value here and the worker would receive it
def collectError(self, worker_id, error):
print(round(time.time() - self.startTime, 3), ': Got error', error,
'from worker', worker_id)
charm.start(Controller)
In this example, the Controller will print status updates and errors as they happen. It
will print final results from all workers when they are all done. The result for workers
that have failed will be -1.
The number of processes P is given at launch. The runtime will distribute the N workers among the available processes. This happens when the workers are created and there is no dynamic load balancing in this particular example.
Also, note that in the charm4py model remote method invocation is asynchronous and returns a future which the caller can block on, but only the calling thread blocks (not the whole process).
Hope this helps.

Python Gevent Shared Queue (Listener Process)

I am trying to get some code working where I can implement logging into a multi-threaded program using gevent. What I'd like to do is set up custom logging handlers to put log events into a Queue, while a listener process is continuously watching for new log events to handle appropriately. I have done this in the past with Multiprocessing, but never with Gevent.
I'm having an issue where the program is getting caught up in the infinite loop (listener process), and not allowing the other threads to "do work"...
Ideally, after the worker processes have finished, I can pass an arbitrary value to the listener process to tell it to break the loop, and then join all the processes together. Here's what I have so far:
import gevent
from gevent.pool import Pool
import Queue
import random
import time
def listener(q):
while True:
if not q.empty():
num = q.get()
print "The number is: %s" % num
if num <= 100:
print q.get()
# got passed 101, break out
else:
break
else:
continue
def worker(pid,q):
if pid == 0:
listener(q)
else:
gevent.sleep(random.randint(0,2)*0.001)
num = random.randint(1,100)
q.put(num)
def main():
q = Queue.Queue()
all_threads = []
all_threads = [gevent.spawn(worker, pid,q) for pid in xrange(10)]
gevent.wait(all_threads[1:])
q.put(101)
gevent.joinall(all_threads)
if __name__ == '__main__':
main()
As I said, the program seems to be getting hung up on that first process and does not allow the other workers to do their thing. I have also tried spawning the listener process completely separately itself (which is actually how I would rather do it), but that didn't seem to work either so I tried this way.
Any help would be appreciated, feel like I am probably just missing something obvious about gevent's back end.
Thanks

The first problem is that your listener is never yielding if the queue is initially empty. The first task you spawn is your listener. When it starts, there's a while True:, the q will be empty, so you go to the else branch, which just continues, looping back to the start of the while loop, and then the q is still empty. So you just sit in the first thread constantly checking the q is empty.
The key thing here is that gevent does not use "native" threads or processes. Unlike "real" threads, which can be switched to at any time by something behind the scenes (like your OS scheduler), gevent uses 'greenlets', which require that you do something to "yield control" to another task. That something is whatever gevent thinks would block, such as read from the network, disk, or use one of the blocking gevent operations.
One crude fix would be to start your listener when pid == 9 rather than 0. By making it spawn last, there will be items in the q, and it will go into the main if branch. The downside is that this doesn't fix the logic problem, so the first time the queue is empty, you'll get stuck in your infinite loop again.
A more correct fix would be to put gevent.sleep() instead of continue. sleep is a blocking operation, so your other tasks will get a chance to run. Without arguments, it waits for no time, but still gives gevent the chance to decide to switch to another task if it is ready to run. This still isn't very efficient, though, as if the Queue is empty, it's going to spend a lot of pointless time checking that over and over and asking to run again as soon as it can. sleep'ing for longer than the default of 0 will be more efficient, but would delay processing your log messages.
However, you can instead take advantage of the fact that many of gevent's types, such as Queue, can be used in more Pythonic ways and make your code a lot simpler and easier to understand, as well as more efficient.
import gevent
from gevent.queue import Queue
def listener(q):
for msg in q:
print "the number is %d" % msg
def worker(pid,q):
gevent.sleep(random.randint(0,2)*0.001)
num = random.randint(1,100)
q.put(num)
def main():
q = Queue()
listener_task = gevent.spawn(listener, q)
worker_tasks = [gevent.spawn(worker, pid, q) for pid in xrange(1, 10)]
gevent.wait(worker_tasks)
q.put(StopIteration)
gevent.join(listener_task)
Here, Queue can operate as an iterator in a for loop. As long as there are messages, it will get an item, run the loop, and then wait for another item. If there are no items, it will just block and hang around until the next one arrives. Since it blocks, though, gevent will switch to one of your other tasks to run, avoiding the infinite loop problem your example code has.
Because this version is using the Queue as a for loop iterator, there's also automatically a nice sentinel value we can put in the queue to make the listener task quit. If a for loop gets StopIteration from its iterator, it will exit cleanly. So when our for loop that's reading from q gets StopIteration from the q, it exits, and then the function exits, and the spawned task is finished.

Using Multithreaded queue in python the correct way?

I am trying to use The Queue in python which will be multithreaded. I just wanted to know the approach I am using is correct or not. And if I am doing something redundant or If there is a better approach that I should use.
I am trying to get new requests from a table and schedule them using some logic to perform some operation like running a query.
So here from the main thread I spawn a separate thread for the queue.
if __name__=='__main__':
request_queue = SetQueue(maxsize=-1)
worker = Thread(target=request_queue.process_queue)
worker.setDaemon(True)
worker.start()
while True:
try:
#Connect to the database get all the new requests to be verified
db = Database(username_testschema, password_testschema, mother_host_testschema, mother_port_testschema, mother_sid_testschema, 0)
#Get new requests for verification
verify_these = db.query("SELECT JOB_ID FROM %s.table WHERE JOB_STATUS='%s' ORDER BY JOB_ID" %
(username_testschema, 'INITIATED'))
#If there are some requests to be verified, put them in the queue.
if len(verify_these) > 0:
for row in verify_these:
print "verifying : %s" % row[0]
verify_id = row[0]
request_queue.put(verify_id)
except Exception as e:
logger.exception(e)
finally:
time.sleep(10)
Now in the Setqueue class I have a process_queue function which is used for processing the top 2 requests in every run that were added to the queue.
'''
Overridding the Queue class to use set as all_items instead of list to ensure unique items added and processed all the time,
'''
class SetQueue(Queue.Queue):
def _init(self, maxsize):
Queue.Queue._init(self, maxsize)
self.all_items = set()
def _put(self, item):
if item not in self.all_items:
Queue.Queue._put(self, item)
self.all_items.add(item)
'''
The Multi threaded queue for verification process. Take the top two items, verifies them in a separate thread and sleeps for 10 sec.
This way max two requests per run will be processed.
'''
def process_queue(self):
while True:
scheduler_obj = Scheduler()
try:
if self.qsize() > 0:
for i in range(2):
job_id = self.get()
t = Thread(target=scheduler_obj.verify_func, args=(job_id,))
t.start()
for i in range(2):
t.join(timeout=1)
self.task_done()
except Exception as e:
logger.exception(
"QUEUE EXCEPTION : Exception occured while processing requests in the VERIFICATION QUEUE")
finally:
time.sleep(10)
I want to see if my understanding is correct and if there can be any issues with it.
So the main thread running in while True in the main func connects to database gets new requests and puts it in the queue. The worker thread(daemon) for the queue keeps on getting new requests from the queue and fork non-daemon threads which do the processing and since timeout for the join is 1 the worker thread will keep on taking new requests without getting blocked, and its child thread will keep on processing in the background. Correct?
So in case if the main process exit these won`t be killed until they finish their work but the worker daemon thread would exit.
Doubt : If the parent is daemon and child is non daemon and if parent exits does child exit?).
I also read here :- David beazley multiprocessing
By david beazley in using a Pool as a Thread Coprocessor section where he is trying to solve a similar problem. So should I follow his steps :-
1. Create a pool of processes.
2. Open a thread like I am doing for request_queue
3. In that thread
def process_verification_queue(self):
while True:
try:
if self.qsize() > 0:
job_id = self.get()
pool.apply_async(Scheduler.verify_func, args=(job_id,))
except Exception as e:
logger.exception("QUEUE EXCEPTION : Exception occured while processing requests in the VERIFICATION QUEUE")
Use a process from the pool and run the verify_func in parallel. Will this give me more performance?

While its possible to create a new independent thread for the queue, and process that data separately the way you are doing it, I believe it is more common for each independent worker thread to post messages to a queue that they already "know" about. Then that queue is processed from some other thread by pulling messages out of that queue.
Design Idea
The way I invision your application would be three threads. The main thread, and two worker threads. 1 worker thread would get requests from the database and put them in the queue. The other worker thread would process that data from the queue
The main thread would just waiting for the other threads to finish by using the thread functions .join()
You would protect queue that the threads have access to and make it thread safe by using a mutex. I have seen this pattern in many other designs in other languages as well.
Suggested Reading
"Effective Python" by Brett Slatkin has a great example of this very question.
Instead of inheriting from Queue, he just creates a wrapper to it in his class
called MyQueue and adds a get() and put(message) function.
He even provides the source code at his Github repo
https://github.com/bslatkin/effectivepython/blob/master/example_code/item_39.py
I'm not affiliated with the book or its author, but I highly recommend it as I learned quite a few things from it :)

I like this explanation of the advantages & differences between using threads and processes -
".....But there's a silver lining: processes can make progress on multiple threads of execution simultaneously. Since a parent process doesn't share the GIL with its child processes, all processes can execute simultaneously (subject to the constraints of the hardware and OS)...."
He has some great explanations for getting around GIL and how to improve performance
Read more here:
http://jeffknupp.com/blog/2013/06/30/pythons-hardest-problem-revisited/

Threading advice for web crawler - scheduling with single list

I would like to add in multiple threading to my web crawler but I can see that the way the spider schedules links to be crawled may be incompatible with multi-threading. The crawler is only ever going to be active on a handful of news websites but rather than starting a new thread per domain I would prefer to have multiple threads opened on the same domain. My web crawling code is operated through the following function:
def crawl_links():
links_to_crawl.append(domain[0])
while len(links_to_crawl) > 0:
link = links_to_crawl[0]
if link in crawled_links or link in ignored_links:
del links_to_crawl[0]
else:
print '\n', link
try:
html = get_html(link)
GetLinks(html)
SaveFile(html)
crawled_links.append(links_to_crawl.pop(0))
except (ValueError, urllib2.URLError, Timeout.Timeout, httplib.IncompleteRead):
ignored_links.append(link_to_crawl.pop(0))
print 'Spider finished!'
print 'Ignored links:\n', ignored_links
print 'Crawled links:\n', crawled_links
print 'Relative links\n', relative_links
If my understanding of how threading will work is correct, if I simply opened multiple threads on this process they will all crawl the same links (potentially multiple times) or they will clash a bit. Without necessarily going into specifics, how would you advise to restructure the scheduling to make it compatible with multiple threads running at the same time?
I've given this some thought and the only workaround I could come up with is having the GetLinks() class appending links to multiple lists, with an individual list per thread... but this seems like quite a clumsy workaround.

Here is a general scheme that I have used in order to run a multi-threaded application in Python.
The scheme takes a table of input arguments, and executes in parallel one thread for each row.
Each thread takes one row, and executes sequentially one thread for each item in the row.
Each item contains a fixed number of arguments which are passed to the executed thread.
Input Example:
table = \
[
[[12,32,34],[11,20,14],[33,67,56],[10,20,45]],
[[21,21,67],[44,34,74],[23,12,54],[31,23,13]],
[[31,67,56],[34,22,67],[87,74,52],[87,74,52]],
]
In this example we will have 3 threads running in parallel, each one executing 4 threads sequentially.
In order to keep your threads balanced, it is advisable to have the same number of items in each row.
Threading Scheme:
import threading
import MyClass # This is for you to implement
def RunThreads(outFileName,errFileName):
# Create a shared object for saving the output of different threads
outFile = CriticalSection(outFileName)
# Create a shared object for saving the errors of different threads
errFile = CriticalSection(errFileName)
# Run in parallel one thread for each row in the input table
RunParallelThreads(outFile,errFile)
def RunParallelThreads(outFile,errFile):
# Create all the parallel threads
threads = [threading.Thread(target=RunSequentialThreads,args=(outFile,errFile,row)) for row in table]
# Start all the parallel threads
for thread in threads: thread.start()
# Wait for all the parallel threads to complete
for thread in threads: thread.join()
def RunSequentialThreads(outFile,errFile,row):
myObject = MyClass()
for item in row:
# Create a thread with the arguments given in the current item
thread = threading.Thread(target=myObject.Run,args=(outFile,errFile,item[0],item[1],item[2]))
# Start the thread
thread.start()
# Wait for the thread to complete, but only up to 600 seconds
thread.join(600)
# Terminate the thread if it hasn't completed up to this point
if thread.isAlive():
thread._Thread__stop()
errFile.write('Timeout on arguments: '+item[0]+' '+item[1]+' '+item[2]+'\n')
The class below implements an object which can be safely shared among different threads running in parallel. It provides a single interface method called write, which allows any thread to update the shared object in a safe manner (i.e., without the OS switching to another thread during the process).
import codecs
class CriticalSection:
def __init__(self,fileName):
self.mutex = threading.Lock()
self.fileDesc = codecs.open(fileName,mode='w',encoding='utf-8')
def __del__(self):
del self.mutex
self.fileDesc.close()
def write(self,data):
self.mutex.acquire()
self.fileDesc.write(data)
self.mutex.release()
The above scheme should allow you to control the level of "parallel-ness" and the level of "sequential-ness" within your application.
For example, you can use a single row for all the items, and have your application running in a complete sequential manner.
In contrast, you can place each item in a separate row, and have your application running in a complete parallel manner.
And of course, you can choose any combination of the above...
Note:
In MyClass, you will need to implement method Run, which will take the outFile and errFile objects, as well as the arguments that you have defined for each thread.

Return whichever expression returns first

I have two different functions f, and g that compute the same result with different algorithms. Sometimes one or the other takes a long time while the other terminates quickly. I want to create a new function that runs each simultaneously and then returns the result from the first that finishes.
I want to create that function with a higher order function
h = firstresult(f, g)
What is the best way to accomplish this in Python?
I suspect that the solution involves threading. I'd like to avoid discussion of the GIL.

I would simply use a Queue for this. Start the threads and the first one which has a result ready writes to the queue.
Code
from threading import Thread
from time import sleep
from Queue import Queue
def firstresult(*functions):
queue = Queue()
threads = []
for f in functions:
def thread_main():
queue.put(f())
thread = Thread(target=thread_main)
threads.append(thread)
thread.start()
result = queue.get()
return result
def slow():
sleep(1)
return 42
def fast():
return 0
if __name__ == '__main__':
print firstresult(slow, fast)
Live demo
http://ideone.com/jzzZX2
Notes
Stopping the threads is an entirely different topic. For this you need to add some state variable to the threads which needs to be checked in regular intervals. As I want to keep this example short I simply assumed that part and assumed that all workers get the time to finish their work even though the result is never read.
Skipping the discussion about the Gil as requested by the questioner. ;-)

Now - unlike my suggestion on the other answer, this piece of code does exactly what you are requesting:
from multiprocessing import Process, Queue
import random
import time
def firstresult(func1, func2):
queue = Queue()
proc1 = Process(target=func1,args=(queue,))
proc2 = Process(target=func2, args=(queue,))
proc1.start();proc2.start()
result = queue.get()
proc1.terminate(); proc2.terminate()
return result
def algo1(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 1")
def algo2(queue):
time.sleep(random.uniform(0,1))
queue.put("algo 2")
print firstresult(algo1, algo2)

Run each function in a new worker thread, the 2 worker threads send the result back to the main thread in a 1 item queue or something similar. When the main thread receives the result from the winner, it kills (do python threads support kill yet? lol.) both worker threads to avoid wasting time (one function may take hours while the other only takes a second).
Replace the word thread with process if you want.

You will need to run each function in another process (with multiprocessing) or in a different thread.
If both are CPU bound, multithread won help much - exactly due to the GIL -
so multiprocessing is the way.
If the return value is a pickleable (serializable) object, I have this decorator I created that simply runs the function in background, in another process:
https://bitbucket.org/jsbueno/lelo/src
It is not exactly what you want - as both are non-blocking and start executing right away. The tirck with this decorator is that it blocks (and waits for the function to complete) as when you try to use the return value.
But on the other hand - it is just a decorator that does all the work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multi-threaded web scraping in Python/PySide/PyQt - python

Related

Running multiple python scripts from a single script, and communicating back and forth between them?

Python Gevent Shared Queue (Listener Process)

Using Multithreaded queue in python the correct way?

Threading advice for web crawler - scheduling with single list

Return whichever expression returns first

Categories

Resources