Scraping Websites

Scraping Websites - python

I have been trying to access some data from a website. I have been using Python's mechanize, and beautifulsoup4 packages for this purpose. But since the amount of pages that I have to parse is around 100,000 and more, doing it single with a single thread doesn't make sense. I tried python's EventLet package to have some concurrency, but it didn't yield any improvement. Can anyone suggest something else that I can do, or should do to speed up the data acquisition process?

I am going to quote my own answer to this question since it fits perfectly here as well:
For your specific task I would recommend a multiprocessing worker pool. You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool.
Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one.
Happy multi-processing!

Related

Flask: spawning a single async sub-task within a request

I have seen a few variants of my question but not quite exactly what I am looking for, hence opening a new question.
I have a Flask/Gunicorn app that for each request inserts some data in a store and, consequently, kicks off an indexing job. The indexing is 2-4 times longer than the main data write and I would like to do that asynchronously to reduce the response latency.
The overall request lifespan is 100-150ms for a large request body.
I have thought about a few ways to do this, that is as resource-efficient as possible:
Use Celery. This seems the most obvious way to do it, but I don't want to introduce a large library and most of all, a dependency on Redis or other system packages.
Use subprocess.Popen. This may be a good route but my bottleneck is I/O, so threads could be more efficient.
Using threads? I am not sure how and if that can be done. All I know is how to launch multiple processes concurrently with ThreadPoolExecutor, but I only need to spawn one additional task, and return immediately without waiting for the results.
asyncio? This too I am not sure how to apply to my situation. asyncio has always a blocking call.
Launching data write and indexing concurrently: not doable. I have to wait for a response from the data write to launch indexing.
Any suggestions are welcome!
Thanks.

Celery will be your best bet - it's exactly what it's for.
If you have a need to introduce dependencies, it's not a bad thing to have dependencies. Just as long as you don't have unneeded dependencies.
Depending on your architecture, though, more advanced and locked-in solutions might be available. You could, if you're using AWS, launch an AWS Lambda function by firing off an AWS SNS notification, and have that handle what it needs to do. The sky is the limit.

I actually should have perused the Python manual section on concurrency better: the threading module does just what I needed: https://docs.python.org/3.5/library/threading.html
And I confirmed with some dummy sleep code that the sub-thread gets completed even after the Flask request is completed.

Multiprocessing in python with an unknown number of processors

This is probably a simple question, but after reading through documentation, blogs, and googling for a couple days, I haven't found a straightforward answer.
When using the multiprocessing module (https://docs.python.org/3/library/multiprocessing.html) in python, does the module distribute the work evenly between the number of given processors/cores?
More specifically, if I am doing development work on my local machine with four processors, and I write a function that uses multiprocessing to execute six functions, do three or four of them run in parallel and then the others run after something has finished? And, when I deploy it to production with six processors, do all six of those run in parallel?
I am trying to understand how much I need to direct the multiprocessing library. I have seen no direction in code samples, so I am assuming its handled. I want to be sure I can safely use this in multiple environments.
EDIT
After some comments, I wanted to clarify. I may be misunderstanding something.
I have several different functions I want to run at the same time. I want each of those functions to run on its own core. Speed is very important. My question is: "If I have five functions, and only four cores, how is this handled?"
Thank you.

The short answer is, if you don't specify a number of processes the default will be to spawn as many processes as your machine has cores, as indicated by multiprocessing.cpu_count().
The long answer is that it depends on how you are creating the subprocesses...
If you create a Pool object and then use that with a map or starmap or similar function, that will create "cpu_count" number of processes as described above. Or you can use the processes argument to specify a different number of subprocesses to spawn. The map function will then distribute the work to those processes.
with multiprocessing.Pool(processes=N) as pool:
rets = pool.map(func, args)
How the work is distributed by the map function can be a little complicated and you're best off reading the docs in detail if you're performance driven enough that you really care about chunking etc etc.
There are also other libraries that can help manage parallel processing at a higher level and have lots of options, such as Joblib and parmap. Again, best to read the docs.
If you specifically want to launch a number of processes equal to the number of jobs you have and don't care that it might be more than the number of cpus in the machine. You can use the Process object instead of the Pool object. This interface parallels the way the threading library can be used for concurrency.
i.e.
jobs = []
for _ in range(num_jobs):
job = multiprocessing.Process(target=func, args=args)
job.start()
jobs.append(job)
# wait for them all to finish
for job in jobs:
job.join()
Consider the above example pseudocode. You won't be able to copy paste that and expect it to work. Unless you're launching multiple instances of the same function with the same arguments of course.

How do I use multiprocessing/multithreading to make my Python script quicker?

I am fairly new to Python and programming in general. I have written a script to go through a long list (~7000) of URLs and check their status to find any broken links. Predictably, this takes a few hours to request each URL one by one. I have heard that multiprocessing (or multithreading?) can be used to speed things up. What is the best approach to this? How many processes/threads should I run in one go? Do I have to create batches of URLs to check concurrently?

The answer to the question depends on whether the process spends most of its time processing data or waiting for the network. If it is the former, then you need to use multiprocessing, and spawn about as many processes as you have physical cores on the system. Do not forget to make sure that you choose the appropriate algorithm for the task. Finally, if all else fails, coding parts of the program in C can be a viable solution as well.
If your program is slow because it spends a lot of time waiting for individual server responses, you can parallelize network access using threads or an asynchronous IO framework. In this case you can use many more threads than you have physical processor cores because most of the time your cores will be sleeping waiting for something interesting to happen. You will need to measure the results on your machine to find out the best number of threads that works for you.
Whatever you do, please make sure that your program is not hammering the remote servers with a large number of concurrent or repeated requests.

Should I create a new Pool object every time or reuse a single one?

I'm trying to understand the best practices with Python's multiprocessing.Pool object.
In my program I use Pool.imap very frequently. Normally every time I start tasks in parallel I create a new pool object and then close it after I'm done.
I recently encountered a hang where the number of tasks submitted to the pool was less than the number of processes. What was odd was that it only occurred in my test pipeline which had a bunch of things run before it. Running the test as a standalone did not cause the hand. I assume it has to do with making multiple pools.
I'd really like to find some resources to help me understand the best practices in using Python's multiprocessing. Specifically I'm currently trying to understand the implications of making several pool objects versus using only one.

When you create a Pool of worker processes, new processes are spawned from the parent one. This is a very fast operation but it has its cost.
Therefore, as long as you don't have a very good reason, for example the Pool breaks due to one worker dying unexpectedly, it's better to always use the same Pool instance.
The reason for the hang is hard to tell without inspecting the code. You might not have clean the previous instances properly (call close()/stop() and then always call join()). You might have sent too big data through the Pool channel which usually ends up with a deadlock and so on.
Surely a pool does not break if you submit less tasks than workers. The pool is designed exactly to de-couple the number of tasks from the number of workers.

Python: Interruptable threading in wx

My wx GUI shows thumbnails, but they're slow to generate, so:
The program should remain usable while the thumbnails are generating.
Switching to a new folder should stop generating thumbnails for the old folder.
If possible, thumbnail generation should make use of multiple processors.
What is the best way to do this?

Putting the thumbnail generation in a background thread with threading.Thread will solve your first problem, making the program usable.
If you want a way to interrupt it, the usual way is to add a "stop" variable which the background thread checks every so often (e.g., once per thumbnail), and the GUI thread sets when it wants to stop it. Ideally you should protect this with a threading.Condition. (The condition isn't actually necessary in most cases—the same GIL that prevents your code from parallelizing well also protects you from certain kinds of race conditions. But you shouldn't rely on that.)
For the third problem, the first question is: Is thumbnail generation actually CPU-bound? If you're spending more time reading and writing images from disk, it probably isn't, so there's no point trying to parallelize it. But, let's assume that it is.
First, if you have N cores, you want a pool of N threads, or N-1 if the main thread has a lot of work to do too, or maybe something like 2N or 2N-1 to trade off a bit of best-case performance for a bit of worst-case performance.
However, if that CPU work is done in Python, or in a C extension that nevertheless holds the Python GIL, this won't help, because most of the time, only one of those threads will actually be running.
One solution to this is to switch from threads to processes, ideally using the standard multiprocessing module. It has built-in APIs to create a pool of processes, and to submit jobs to the pool with simple load-balancing.
The problem with using processes is that you no longer get automatic sharing of data, so that "stop flag" won't work. You need to explicitly create a flag in shared memory, or use a pipe or some other mechanism for communication instead. The multiprocessing docs explain the various ways to do this.
You can actually just kill the subprocesses. However, you may not want to do this. First, unless you've written your code carefully, it may leave your thumbnail cache in an inconsistent state that will confuse the rest of your code. Also, if you want this to be efficient on Windows, creating the subprocesses takes some time (not as in "30 minutes" or anything, but enough to affect the perceived responsiveness of your code if you recreate the pool every time a user clicks a new folder), so you probably want to create the pool before you need it, and keep it for the entire life of the program.
Other than that, all you have to get right is the job size. Hopefully creating one thumbnail isn't too big of a job—but if it's too small of a job, you can batch multiple thumbnails up into a single job—or, more simply, look at the multiprocessing API and change the way it batches jobs when load-balancing.
Meanwhile, if you go with a pool solution (whether threads or processes), if your jobs are small enough, you may not really need to cancel. Just drain the job queue—each worker will finish whichever job it's working on now, but then sleep until you feed in more jobs. Remember to also drain the queue (and then maybe join the pool) when it's time to quit.
One last thing to keep in mind is that if you successfully generate thumbnails as fast as your computer is capable of generating them, you may actually cause the whole computer—and therefore your GUI—to become sluggish and unresponsive. This usually comes up when your code is actually I/O bound and you're using most of the disk bandwidth, or when you use lots of memory and trigger swap thrash, but if your code really is CPU-bound, and you're having problems because you're using all the CPU, you may want to either use 1 fewer core, or look into setting thread/process priorities.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.