Multithreading to crawl large volume of URLs and download - python

I've been using multithreading to do this, however it hangs up a lot. I was thinking about multiprocessing, but I am not sure if that is any more advantageous.
I have a series of names, and for each name a range of dates. I spawn a thread for each date in the range and then do work inside. Once work is complete, it puts result into Queue() for main to update the GUI.
Is using a Queue() to hold desired URLs better than starting say, 350 threads, at once and waiting? Python seems to hang when I start that many threads.

It is my understanding that threads are better at waiting (I/O bound work) and multiprocessing is better at cpu bound work. It would seem threading/green-threads are the way to go. Check out the aiohttp library, or may I suggest scrapy which runs on the twisted framework which is async. Either of these (especially scrapy) will solve your problem. But why reinvent the wheel by rolling your own when scrapy has everything you need? If scrapy seems to bloated for your use case, why not use the non-block request tools provided in aiohttp using python 3.* async/await syntax?

Related

Python Webscraping: grequests vs. mult-threaded requests?

I'm trying to make web-scraper I'm writing in Python faster.
Currently I fire up a set amount of scraper threads, create a queue with a list of URLs I want to scrape and let them dequeue entries so they can scrape.
grequests states it's asynchronous but I'm not sure what the quite means beyond firing multiple threads (like I'm doing) and using gevent to trigger an event when it's finished.
Does grequests do anything more beyond create a thread per job and will it technically run more quickly than the program I've outlined above?
Check this out:
https://adl1995.github.io/a-comparison-of-response-times-using-urllib-grequests-and-asyncio.html
TL;DR:
"Using aiohttp with asyncio seems to be the best option. Its response time is almost 50% less than grequests."

Flask: spawning a single async sub-task within a request

I have seen a few variants of my question but not quite exactly what I am looking for, hence opening a new question.
I have a Flask/Gunicorn app that for each request inserts some data in a store and, consequently, kicks off an indexing job. The indexing is 2-4 times longer than the main data write and I would like to do that asynchronously to reduce the response latency.
The overall request lifespan is 100-150ms for a large request body.
I have thought about a few ways to do this, that is as resource-efficient as possible:
Use Celery. This seems the most obvious way to do it, but I don't want to introduce a large library and most of all, a dependency on Redis or other system packages.
Use subprocess.Popen. This may be a good route but my bottleneck is I/O, so threads could be more efficient.
Using threads? I am not sure how and if that can be done. All I know is how to launch multiple processes concurrently with ThreadPoolExecutor, but I only need to spawn one additional task, and return immediately without waiting for the results.
asyncio? This too I am not sure how to apply to my situation. asyncio has always a blocking call.
Launching data write and indexing concurrently: not doable. I have to wait for a response from the data write to launch indexing.
Any suggestions are welcome!
Thanks.
Celery will be your best bet - it's exactly what it's for.
If you have a need to introduce dependencies, it's not a bad thing to have dependencies. Just as long as you don't have unneeded dependencies.
Depending on your architecture, though, more advanced and locked-in solutions might be available. You could, if you're using AWS, launch an AWS Lambda function by firing off an AWS SNS notification, and have that handle what it needs to do. The sky is the limit.
I actually should have perused the Python manual section on concurrency better: the threading module does just what I needed: https://docs.python.org/3.5/library/threading.html
And I confirmed with some dummy sleep code that the sub-thread gets completed even after the Flask request is completed.

greuests alternative - python

I am developing a program that downloads multiple pages, and I used grequests to minimize the download time and also because it supports requests session since the program requires a login. grequests is based on gevent which gave me a hard time when compiling the program (py2exe, bbfreeze). Is there any alternative that can use requests sessions ? Or are there any tips on compiling a program with gevent ?
I can't use pyinstaller: I have to use esky which allows updates.
Sure, there are plenty of alternatives. There's absolutely no reason you have to use gevent—or greenlets at all—to download multiple pages.
If you're trying to handle thousands of connections, that's one thing, but normally a parallel downloader only wants 4-16 simultaneous connections, and any modern OS can run 4-16 threads just fine. Here's an example using Python 3.2+. If you're using 2.x or 3.1, download the futures backport from PyPI—it's pure Python, so you should have no trouble building and packaging it.
import concurrent.futures
import requests
def get_url(url, other, args):
# your existing requests-based code here
urls = [your, list, of, page, urls, here]
with concurrent.futures.ThreadPoolExecutor() as pool:
pool.map(get_url, urls)
If you have some simple post-processing to do after each of the downloads on the main thread, the example in the docs shows how to do exactly that.
If you've heard that "threads are bad in Python because of the GIL", you've heard wrong. Threads that do CPU-bound work in Python are bad because of the GIL. Threads that do I/O-bound work, like downloading a web page, are perfectly fine. And that's exactly the same restriction as when using greenlets, like your existing grequests code, which works.
As I said, this isn't the only alternative. For example, curl (with any of its various Python bindings) is a pain to get the hang of in the first place compared to requests—but once you do, having it multiplex multiple downloads for you isn't much harder than doing one at a time. But threading is the easiest alternative, especially if you've already written code around greenlets.
* In 2.x and 3.1, it can be a problem to have a single thread doing significant CPU work while background threads are doing I/O. In 3.2+, it works the way it should.

Multi threading or Multi processing for opening pages in BeautifulSoup in python

I have a program that opens a long list of web pages using beautifulsoup, and extracts data from it.
Obviously it's fairly slow since it has to wait for each one to complete. I'd like to make it retrieve more than one at a time to speed it up.
I know that multithreading in python is a lot of the times, slower than using a singlethread.
What would be the best for this? Multithreading or creating multiprocessing?
thats one of the main reasons to use scrapy, scrapy is built on twisted library to make http calls asynchronous without using multi threading and multi processing
a good point to start from can be the excellent scrapy tutorial
it also worth noting that multi threading / processing is usually the right approach when doing heavy cpu computation in a multi core environment, but when it comes to parallel IO operations you better choose an asynchronous programming solution rather than having threads waiting in blocking operation for some IO to happen while holding systems resources.

Scraping Websites

I have been trying to access some data from a website. I have been using Python's mechanize, and beautifulsoup4 packages for this purpose. But since the amount of pages that I have to parse is around 100,000 and more, doing it single with a single thread doesn't make sense. I tried python's EventLet package to have some concurrency, but it didn't yield any improvement. Can anyone suggest something else that I can do, or should do to speed up the data acquisition process?
I am going to quote my own answer to this question since it fits perfectly here as well:
For your specific task I would recommend a multiprocessing worker pool. You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool.
Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one.
Happy multi-processing!

Categories

Resources