Threading advice for web crawler - scheduling with single list

Threading advice for web crawler - scheduling with single list - python

I would like to add in multiple threading to my web crawler but I can see that the way the spider schedules links to be crawled may be incompatible with multi-threading. The crawler is only ever going to be active on a handful of news websites but rather than starting a new thread per domain I would prefer to have multiple threads opened on the same domain. My web crawling code is operated through the following function:
def crawl_links():
links_to_crawl.append(domain[0])
while len(links_to_crawl) > 0:
link = links_to_crawl[0]
if link in crawled_links or link in ignored_links:
del links_to_crawl[0]
else:
print '\n', link
try:
html = get_html(link)
GetLinks(html)
SaveFile(html)
crawled_links.append(links_to_crawl.pop(0))
except (ValueError, urllib2.URLError, Timeout.Timeout, httplib.IncompleteRead):
ignored_links.append(link_to_crawl.pop(0))
print 'Spider finished!'
print 'Ignored links:\n', ignored_links
print 'Crawled links:\n', crawled_links
print 'Relative links\n', relative_links
If my understanding of how threading will work is correct, if I simply opened multiple threads on this process they will all crawl the same links (potentially multiple times) or they will clash a bit. Without necessarily going into specifics, how would you advise to restructure the scheduling to make it compatible with multiple threads running at the same time?
I've given this some thought and the only workaround I could come up with is having the GetLinks() class appending links to multiple lists, with an individual list per thread... but this seems like quite a clumsy workaround.

Here is a general scheme that I have used in order to run a multi-threaded application in Python.
The scheme takes a table of input arguments, and executes in parallel one thread for each row.
Each thread takes one row, and executes sequentially one thread for each item in the row.
Each item contains a fixed number of arguments which are passed to the executed thread.
Input Example:
table = \
[
[[12,32,34],[11,20,14],[33,67,56],[10,20,45]],
[[21,21,67],[44,34,74],[23,12,54],[31,23,13]],
[[31,67,56],[34,22,67],[87,74,52],[87,74,52]],
]
In this example we will have 3 threads running in parallel, each one executing 4 threads sequentially.
In order to keep your threads balanced, it is advisable to have the same number of items in each row.
Threading Scheme:
import threading
import MyClass # This is for you to implement
def RunThreads(outFileName,errFileName):
# Create a shared object for saving the output of different threads
outFile = CriticalSection(outFileName)
# Create a shared object for saving the errors of different threads
errFile = CriticalSection(errFileName)
# Run in parallel one thread for each row in the input table
RunParallelThreads(outFile,errFile)
def RunParallelThreads(outFile,errFile):
# Create all the parallel threads
threads = [threading.Thread(target=RunSequentialThreads,args=(outFile,errFile,row)) for row in table]
# Start all the parallel threads
for thread in threads: thread.start()
# Wait for all the parallel threads to complete
for thread in threads: thread.join()
def RunSequentialThreads(outFile,errFile,row):
myObject = MyClass()
for item in row:
# Create a thread with the arguments given in the current item
thread = threading.Thread(target=myObject.Run,args=(outFile,errFile,item[0],item[1],item[2]))
# Start the thread
thread.start()
# Wait for the thread to complete, but only up to 600 seconds
thread.join(600)
# Terminate the thread if it hasn't completed up to this point
if thread.isAlive():
thread._Thread__stop()
errFile.write('Timeout on arguments: '+item[0]+' '+item[1]+' '+item[2]+'\n')
The class below implements an object which can be safely shared among different threads running in parallel. It provides a single interface method called write, which allows any thread to update the shared object in a safe manner (i.e., without the OS switching to another thread during the process).
import codecs
class CriticalSection:
def __init__(self,fileName):
self.mutex = threading.Lock()
self.fileDesc = codecs.open(fileName,mode='w',encoding='utf-8')
def __del__(self):
del self.mutex
self.fileDesc.close()
def write(self,data):
self.mutex.acquire()
self.fileDesc.write(data)
self.mutex.release()
The above scheme should allow you to control the level of "parallel-ness" and the level of "sequential-ness" within your application.
For example, you can use a single row for all the items, and have your application running in a complete sequential manner.
In contrast, you can place each item in a separate row, and have your application running in a complete parallel manner.
And of course, you can choose any combination of the above...
Note:
In MyClass, you will need to implement method Run, which will take the outFile and errFile objects, as well as the arguments that you have defined for each thread.

Related

Running multiple python scripts from a single script, and communicating back and forth between them?

I have a script that I wrote that I am able to pass arguments to, and I want launch multiple simultaneous iterations (maybe 100+) with unique arguments. My plan was to write another python script which then launch these subscripts/processes, however to be effective, I need the that script to be able to monitor the subscripts for any errors.
Is there any straightforward way to do this, or a library that offers this functionality? I've been searching for a while and am not having good luck finding anything. Creating subprocesses and multiple threads seems straight forward enough but I can't really find any guides or tutorials on how to then communicate with those threads/subprocesses.

A better way to do this would be to make use of threads. If you made the script you want to call into a function in this larger script, you could have your main function call this script as many times as you want and have the threads report back with information as needed. You can read a little bit about how threads work here.

I suggest to use threading.Thread or multiprocessing.Process despite of requirements.
Simple way to communicate between Threads / Processes is to use Queue. Multiprocessing module provides some other ways to communicate between processes (Queue, Event, Manager, ...)
You can see some elementary communication in the example:
import threading
from Queue import Queue
import random
import time
class Worker(threading.Thread):
def __init__(self, name, queue_error):
threading.Thread.__init__(self)
self.name = name
self.queue_error = queue_error
def run(self):
time.sleep(random.randrange(1, 10))
# Do some processing ...
# Report errors
self.queue_error.put((self.name, 'Error state'))
class Launcher(object):
def __init__(self):
self.queue_error = Queue()
def main_loop(self):
# Start threads
for i in range(10):
w = Worker(i, self.queue_error)
w.start()
# Check for errors
while True:
while not self.queue_error.empty():
error_data = self.queue_error.get()
print 'Worker #%s reported error: %s' % (error_data[0], error_data[1])
time.sleep(0.1)
if __name__ == '__main__':
l = Launcher()
l.main_loop()

Like someone else said, you have to use multiple processes for true parallelism instead of threads because the GIL limitation prevents threads from running concurrently.
If you want to use the standard multiprocessing library (which is based on launching multiple processes), I suggest using a pool of workers. If I understood correctly, you want to launch 100+ parallel instances. Launching 100+ processes on one host will generate too much overhead. Instead, create a pool of P workers where P is for example the number of cores in your machine and submit the 100+ jobs to the pool. This is simple to do and there are many examples on the web. Also, when you submit jobs to the pool, you can provide a callback function to receive errors. This may be sufficient for your needs (there are examples here).
The Pool in multiprocessing however can't distribute work across multiple hosts (e.g. cluster of machines) last time I looked. So, if you need to do this, or if you need a more flexible communication scheme, like being able to send updates to the controlling process while the workers are running, my suggestion is to use charm4py (note that I am a charm4py developer so this is where I have experience).
With charm4py you could create N workers which are distributed among P processes by the runtime (works across multiple hosts), and the workers can communicate with the controller simply by doing remote method invocation. Here is a small example:
from charm4py import charm, Chare, Group, Array, ArrayMap, Reducer, threaded
import time
WORKER_ITERATIONS = 100
class Worker(Chare):
def __init__(self, controller):
self.controller = controller
#threaded
def work(self, x, done_future):
result = -1
try:
for i in range(WORKER_ITERATIONS):
if i % 20 == 0:
# send status update to controller
self.controller.progressUpdate(self.thisIndex, i, ret=True).get()
if i == 5 and self.thisIndex[0] % 2 == 0:
# trigger NameError on even-numbered workers
test[3] = 3
time.sleep(0.01)
result = x**2
except Exception as e:
# send error to controller
self.controller.collectError(self.thisIndex, e)
# send result to controller
self.contribute(result, Reducer.gather, done_future)
# This custom map is used to prevent workers from being created on process 0
# (where the controller is). Not strictly needed, but allows more timely
# controller output
class WorkerMap(ArrayMap):
def procNum(self, index):
return (index[0] % (charm.numPes() - 1)) + 1
class Controller(Chare):
def __init__(self, args):
self.startTime = time.time()
done_future = charm.createFuture()
# create 12 workers, which are distributed by charm4py among processes
workers = Array(Worker, 12, args=[self.thisProxy], map=Group(WorkerMap))
# start work
for i in range(12):
workers[i].work(i, done_future)
print('Results are', done_future.get()) # wait for result
exit()
def progressUpdate(self, worker_id, current_step):
print(round(time.time() - self.startTime, 3), ': Worker', worker_id,
'progress', current_step * 100 / WORKER_ITERATIONS, '%')
# the controller can return a value here and the worker would receive it
def collectError(self, worker_id, error):
print(round(time.time() - self.startTime, 3), ': Got error', error,
'from worker', worker_id)
charm.start(Controller)
In this example, the Controller will print status updates and errors as they happen. It
will print final results from all workers when they are all done. The result for workers
that have failed will be -1.
The number of processes P is given at launch. The runtime will distribute the N workers among the available processes. This happens when the workers are created and there is no dynamic load balancing in this particular example.
Also, note that in the charm4py model remote method invocation is asynchronous and returns a future which the caller can block on, but only the calling thread blocks (not the whole process).
Hope this helps.

How many threads are running in this python example?

I want to learn how to run a function in multithreading using python. In other words, I have a long list of arguments that I want to send to a function that might take time to finish. I want my program to loop over the arguments and call the functions on parallel (no need to wait until the function finishes fromt he forst argument).
I found this example code from here:
import Queue
import threading
import urllib2
# called by each thread
def get_url(q, url):
q.put(urllib2.urlopen(url).read())
theurls = ["http://google.com", "http://yahoo.com"]
q = Queue.Queue()
for u in theurls:
t = threading.Thread(target=get_url, args = (q,u))
t.daemon = True
t.start()
s = q.get()
print s
My question are:
1) I normally know that I have to specify a number of threads that I want my program to run in parallel. There is no specific number of threads in the code above.
2) The number of threads is something that varies from device to device (depends on the processor, memory, etc.). Since this code does not specify any number of threads, how the program knows the right number of threads to run concurrently?

The threads are being created in the for loop. The for loop gets executed twice since there are two elements in theurls . This also answers your other two questions. Thus you end up with two threads in the program
plus main loop thread
Total 3

Using Multithreaded queue in python the correct way?

I am trying to use The Queue in python which will be multithreaded. I just wanted to know the approach I am using is correct or not. And if I am doing something redundant or If there is a better approach that I should use.
I am trying to get new requests from a table and schedule them using some logic to perform some operation like running a query.
So here from the main thread I spawn a separate thread for the queue.
if __name__=='__main__':
request_queue = SetQueue(maxsize=-1)
worker = Thread(target=request_queue.process_queue)
worker.setDaemon(True)
worker.start()
while True:
try:
#Connect to the database get all the new requests to be verified
db = Database(username_testschema, password_testschema, mother_host_testschema, mother_port_testschema, mother_sid_testschema, 0)
#Get new requests for verification
verify_these = db.query("SELECT JOB_ID FROM %s.table WHERE JOB_STATUS='%s' ORDER BY JOB_ID" %
(username_testschema, 'INITIATED'))
#If there are some requests to be verified, put them in the queue.
if len(verify_these) > 0:
for row in verify_these:
print "verifying : %s" % row[0]
verify_id = row[0]
request_queue.put(verify_id)
except Exception as e:
logger.exception(e)
finally:
time.sleep(10)
Now in the Setqueue class I have a process_queue function which is used for processing the top 2 requests in every run that were added to the queue.
'''
Overridding the Queue class to use set as all_items instead of list to ensure unique items added and processed all the time,
'''
class SetQueue(Queue.Queue):
def _init(self, maxsize):
Queue.Queue._init(self, maxsize)
self.all_items = set()
def _put(self, item):
if item not in self.all_items:
Queue.Queue._put(self, item)
self.all_items.add(item)
'''
The Multi threaded queue for verification process. Take the top two items, verifies them in a separate thread and sleeps for 10 sec.
This way max two requests per run will be processed.
'''
def process_queue(self):
while True:
scheduler_obj = Scheduler()
try:
if self.qsize() > 0:
for i in range(2):
job_id = self.get()
t = Thread(target=scheduler_obj.verify_func, args=(job_id,))
t.start()
for i in range(2):
t.join(timeout=1)
self.task_done()
except Exception as e:
logger.exception(
"QUEUE EXCEPTION : Exception occured while processing requests in the VERIFICATION QUEUE")
finally:
time.sleep(10)
I want to see if my understanding is correct and if there can be any issues with it.
So the main thread running in while True in the main func connects to database gets new requests and puts it in the queue. The worker thread(daemon) for the queue keeps on getting new requests from the queue and fork non-daemon threads which do the processing and since timeout for the join is 1 the worker thread will keep on taking new requests without getting blocked, and its child thread will keep on processing in the background. Correct?
So in case if the main process exit these won`t be killed until they finish their work but the worker daemon thread would exit.
Doubt : If the parent is daemon and child is non daemon and if parent exits does child exit?).
I also read here :- David beazley multiprocessing
By david beazley in using a Pool as a Thread Coprocessor section where he is trying to solve a similar problem. So should I follow his steps :-
1. Create a pool of processes.
2. Open a thread like I am doing for request_queue
3. In that thread
def process_verification_queue(self):
while True:
try:
if self.qsize() > 0:
job_id = self.get()
pool.apply_async(Scheduler.verify_func, args=(job_id,))
except Exception as e:
logger.exception("QUEUE EXCEPTION : Exception occured while processing requests in the VERIFICATION QUEUE")
Use a process from the pool and run the verify_func in parallel. Will this give me more performance?

While its possible to create a new independent thread for the queue, and process that data separately the way you are doing it, I believe it is more common for each independent worker thread to post messages to a queue that they already "know" about. Then that queue is processed from some other thread by pulling messages out of that queue.
Design Idea
The way I invision your application would be three threads. The main thread, and two worker threads. 1 worker thread would get requests from the database and put them in the queue. The other worker thread would process that data from the queue
The main thread would just waiting for the other threads to finish by using the thread functions .join()
You would protect queue that the threads have access to and make it thread safe by using a mutex. I have seen this pattern in many other designs in other languages as well.
Suggested Reading
"Effective Python" by Brett Slatkin has a great example of this very question.
Instead of inheriting from Queue, he just creates a wrapper to it in his class
called MyQueue and adds a get() and put(message) function.
He even provides the source code at his Github repo
https://github.com/bslatkin/effectivepython/blob/master/example_code/item_39.py
I'm not affiliated with the book or its author, but I highly recommend it as I learned quite a few things from it :)

I like this explanation of the advantages & differences between using threads and processes -
".....But there's a silver lining: processes can make progress on multiple threads of execution simultaneously. Since a parent process doesn't share the GIL with its child processes, all processes can execute simultaneously (subject to the constraints of the hardware and OS)...."
He has some great explanations for getting around GIL and how to improve performance
Read more here:
http://jeffknupp.com/blog/2013/06/30/pythons-hardest-problem-revisited/

Multiprocessing with python3 only runs once

I have a problem running multiple processes in python3 .
My program does the following:
1. Takes entries from an sqllite database and passes them to an input_queue
2. Create multiple processes that take items off the input_queue, run it through a function and output the result to the output queue.
3. Create a thread that takes items off the output_queue and prints them (This thread is obviously started before the first 2 steps)
My problem is that currently the 'function' in step 2 is only run as many times as the number of processes set, so for example if you set the number of processes to 8, it only runs 8 times then stops. I assumed it would keep running until it took all items off the input_queue.
Do I need to rewrite the function that takes the entries out of the database (step 1) into another process and then pass its output queue as an input queue for step 2?
Edit:
Here is an example of the code, I used a list of numbers as a substitute for the database entries as it still performs the same way. I have 300 items on the list and I would like it to process all 300 items, but at the moment it just processes 10 (the number of processes I have assigned)
#!/usr/bin/python3
from multiprocessing import Process,Queue
import multiprocessing
from threading import Thread
## This is the class that would be passed to the multi_processing function
class Processor:
def __init__(self,out_queue):
self.out_queue = out_queue
def __call__(self,in_queue):
data_entry = in_queue.get()
result = data_entry*2
self.out_queue.put(result)
#Performs the multiprocessing
def perform_distributed_processing(dbList,threads,processor_factory,output_queue):
input_queue = Queue()
# Create the Data processors.
for i in range(threads):
processor = processor_factory(output_queue)
data_proc = Process(target = processor,
args = (input_queue,))
data_proc.start()
# Push entries to the queue.
for entry in dbList:
input_queue.put(entry)
# Push stop markers to the queue, one for each thread.
for i in range(threads):
input_queue.put(None)
data_proc.join()
output_queue.put(None)
if __name__ == '__main__':
output_results = Queue()
def output_results_reader(queue):
while True:
item = queue.get()
if item is None:
break
print(item)
# Establish results collecting thread.
results_process = Thread(target = output_results_reader,args = (output_results,))
results_process.start()
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
# Perform multi processing
perform_distributed_processing(dbList,10,Processor,output_results)
# Wait for it all to finish.
results_process.join()

A collection of processes that service an input queue and write to an output queue is pretty much the definition of a process pool.
If you want to know how to build one from scratch, the best way to learn is to look at the source code for multiprocessing.Pool, which is pretty simply Python, and very nicely written. But, as you might expect, you can just use multiprocessing.Pool instead of re-implementing it. The examples in the docs are very nice.
But really, you could make this even simpler by using an executor instead of a pool. It's hard to explain the difference (again, read the docs for both modules), but basically, a future is a "smart" result object, which means instead of a pool with a variety of different ways to run jobs and get results, you just need a dumb thing that doesn't know how to do anything but return futures. (Of course in the most trivial cases, the code looks almost identical either way…)
from concurrent.futures import ProcessPoolExecutor
def Processor(data_entry):
return data_entry*2
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
yield from executor.map(processor_factory, dbList)
if __name__ == '__main__':
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
for result in perform_distributed_processing(dbList, 8, Processor):
print(result)
Or, if you want to handle them as they come instead of in order:
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
fs = (executor.submit(processor_factory, db) for db in dbList)
yield from map(Future.result, as_completed(fs))
Notice that I also replaced your in-process queue and thread, because it wasn't doing anything but providing a way to interleave "wait for the next result" and "process the most recent result", and yield (or yield from, in this case) does that without all the complexity, overhead, and potential for getting things wrong.

Don't try to rewrite the whole multiprocessing library again. I think you can use any of multiprocessing.Pool methods depending on your needs - if this is a batch job you can even use the synchronous multiprocessing.Pool.map() - only instead of pushing to input queue, you need to write a generator that yields input to the threads.

Multi-threaded web scraping in Python/PySide/PyQt

I'm building a web scraper of a kind. Basically, what the soft would do is:
User (me) inputs some data (IDs) - IDs are complex, so not just numbers
Based on those IDs, the script visits http://localhost/ID
What is the best way to accomplish this? So I'm looking upwards of 20-30 concurrent connections to do it.
I was thinking, would a simple loop be the solution? This loop would start QThreads (it's a Qt app), so they would run concurrently.
The problem I am seeing with the loop however is how to instruct it to use only those IDs not used before i.e. in the iteration/thread that had been executed just before it was? Would I need some sort of a "delegator" function which will keep track of what IDs had been used and delegate the unused ones to the QThreads?
Now I've written some code but I am not sure if it is correct:
class GUI(QObject):
def __init__(self):
print "GUI CLASS INITIALIZED!!!"
self.worker = Worker()
for i in xrange(300):
QThreadPool().globalInstance().start(self.worker)
class Worker(QRunnable):
def run(self):
print "Hello world from thread", QThread.currentThread()
Now I'm not sure if these achieve really what I want. Is this actually running in separate threads? I'm asking because currentThread() is the same every time this is executed, so it doesn't look that way.
Basically, my question comes down to how do I execute several same QThreads concurrently?
Thanks in advance for the answer!

As Dikei says, Qt is red herring here. Focus on just using Python threads as it will keep your code much simpler.
In the code below we have a set, job_queue, containing the jobs to be executed. We also have a function, worker_thread which takes a job from the passed in queue and executes. Here it just sleeps for a random period of time. The key thing here is that set.pop is thread safe.
We create an array of thread objects, workers, and call start on each as we create it. From the Python documentation threading.Thread.start runs the given callable in a separate thread of control. Lastly we go through each worker thread and block until it has exited.
import threading
import random
import time
pool_size = 5
job_queue = set(range(100))
def worker_thread(queue):
while True:
try:
job = queue.pop()
except KeyError:
break
print "Processing %i..." % (job, )
time.sleep(random.random())
print "Thread exiting."
workers = []
for thread in range(pool_size):
workers.append(threading.Thread(target=worker_thread, args=(job_queue, )))
workers[-1].start()
for worker in workers:
worker.join()
print "All threads exited"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Threading advice for web crawler - scheduling with single list - python

Related

Running multiple python scripts from a single script, and communicating back and forth between them?

How many threads are running in this python example?

Using Multithreaded queue in python the correct way?

Multiprocessing with python3 only runs once

Multi-threaded web scraping in Python/PySide/PyQt

Categories

Resources