how can i process large responses using aiohttp - python

I have a simple script that executes asynchronous requests using aiohttp, and I have a response handler that runs a loop to search for some text in the response. the problem is that when I have a lot of urls with large responses, this loop slows down the process and locks the tasks. how can I solve this?
one solution that I tried was to save all of these responses and put the loop at the end, but this uses a lot of memory.
here is the simple loop that i am using:
for key, value in _dict:
if value in response.text:
VALID.add(str(response.request.url))
i also make it with a set compression but it still lock the script, i need a way to execute it faster or a better way to do it

Related

Python Webscraping: grequests vs. mult-threaded requests?

I'm trying to make web-scraper I'm writing in Python faster.
Currently I fire up a set amount of scraper threads, create a queue with a list of URLs I want to scrape and let them dequeue entries so they can scrape.
grequests states it's asynchronous but I'm not sure what the quite means beyond firing multiple threads (like I'm doing) and using gevent to trigger an event when it's finished.
Does grequests do anything more beyond create a thread per job and will it technically run more quickly than the program I've outlined above?
Check this out:
https://adl1995.github.io/a-comparison-of-response-times-using-urllib-grequests-and-asyncio.html
TL;DR:
"Using aiohttp with asyncio seems to be the best option. Its response time is almost 50% less than grequests."

Running a Constant Amount of Asynchronous Tasks At the Same Time Using the Python Asyncio Library

I have a program where I need to make a large number of URL requests. I cannot make all requests at the same time because there are always new URLs being added to the queue. Neither can I run them synchronously because some requests take a very long time to finish which would slow down the program. What I think would be best is to make sure a specific number of asynchronous tasks are running at the same time by launching new tasks whenever a task is completed.
The problem is that I have not found any other way to use the asyncio library other than to make a large array of tasks and await them. This is problematic because there is always a couple of requests getting stuck which causes the program to get stuck at await.
How would I solve this problem?
You could either use asyncio.wait_for() or you could use this https://github.com/aio-libs/async-timeout

Flask: spawning a single async sub-task within a request

I have seen a few variants of my question but not quite exactly what I am looking for, hence opening a new question.
I have a Flask/Gunicorn app that for each request inserts some data in a store and, consequently, kicks off an indexing job. The indexing is 2-4 times longer than the main data write and I would like to do that asynchronously to reduce the response latency.
The overall request lifespan is 100-150ms for a large request body.
I have thought about a few ways to do this, that is as resource-efficient as possible:
Use Celery. This seems the most obvious way to do it, but I don't want to introduce a large library and most of all, a dependency on Redis or other system packages.
Use subprocess.Popen. This may be a good route but my bottleneck is I/O, so threads could be more efficient.
Using threads? I am not sure how and if that can be done. All I know is how to launch multiple processes concurrently with ThreadPoolExecutor, but I only need to spawn one additional task, and return immediately without waiting for the results.
asyncio? This too I am not sure how to apply to my situation. asyncio has always a blocking call.
Launching data write and indexing concurrently: not doable. I have to wait for a response from the data write to launch indexing.
Any suggestions are welcome!
Thanks.
Celery will be your best bet - it's exactly what it's for.
If you have a need to introduce dependencies, it's not a bad thing to have dependencies. Just as long as you don't have unneeded dependencies.
Depending on your architecture, though, more advanced and locked-in solutions might be available. You could, if you're using AWS, launch an AWS Lambda function by firing off an AWS SNS notification, and have that handle what it needs to do. The sky is the limit.
I actually should have perused the Python manual section on concurrency better: the threading module does just what I needed: https://docs.python.org/3.5/library/threading.html
And I confirmed with some dummy sleep code that the sub-thread gets completed even after the Flask request is completed.

best way to check download duration of a file from multiple urls in python (threading or async)?

What is the best way to check the download duration of a file from like 50 urls? I would like to download from each file using my entire bandwidth, should i use multi threading or co-routines or just plain old synchronous way? why?
This is the code i use to check the download duration from a single url:
import urllib
import time
start = time.time()
with urllib.urlopen('http://example.com/file') as response:
data = response.read()
end = time.time()
duration = end - start
Multithreading and coroutines in Python are still restricted by the Global Interpreter Lock (GIL) to only run one Python instruction at a time. If Python code uses multithreading or coroutines on regular Python code that just performs a parallel calculation with no delays for things like input or output, then it doesn't actually execute in parallel. Because each of your threads would be delayed by the download, they are I/O bound.
Because the downloads are completely I/O bound, multithreading or coroutines should both work fine. If you're concerned about overhead, I would just compare results with the two versions.
If you really are just throwing away the downloaded data from large files, consider using streaming and the iter_content method to avoid holding more data in memory than you need.

Scraping Websites

I have been trying to access some data from a website. I have been using Python's mechanize, and beautifulsoup4 packages for this purpose. But since the amount of pages that I have to parse is around 100,000 and more, doing it single with a single thread doesn't make sense. I tried python's EventLet package to have some concurrency, but it didn't yield any improvement. Can anyone suggest something else that I can do, or should do to speed up the data acquisition process?
I am going to quote my own answer to this question since it fits perfectly here as well:
For your specific task I would recommend a multiprocessing worker pool. You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool.
Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one.
Happy multi-processing!

Categories

Resources