I am making head requests on anywhere between 100,000 to 500,000 URLs to returns the size and the HTTP status code. I have tried four different methods: a threadpool, an asynchronous twisted client, a grequests implementation, and a concurrent.futures based solution. In a previous question similar to this one, the threadpool implementation is said to finish in 6 to 10 minutes. Trying the exact code and feeding a dummy list of 100,000 URLs takes over 4 hours on my machine. My twisted solution (different than the one mentioned in the link question) similarly takes around 3.5 hours to complete, the same with the the concurrent.futures solution.
I am relatively confident I have written the implementations correctly, specifically in the case of copy and pasting the code from a previous example. How can I diagnose where the slow down is occurring? My guess is that it is when making the connection, but I have no idea how to prove this or fix it if it is a problem. I am pretty certain it is not a CPU bound problem, as the CPU time after 100,000 URLs is only 3 minutes. Any help in figuring out how to diagnose the issue, and in turn fixing would be greatly appreciated.
Some More Information:
Using Requests to make the request, or treq with twisted.
Appending the results to a list (with garbage compiler off) or a
pandas dataframe does not seem to make a speed difference.
I have experimented with anywhere between 4 and 200 workers/threads
in my various tests, and 15 seems to be optimal.
The machine I am using has 16 cores and a high speed (100 MBPS)
internet connection.
Related
I have a django app with a celery instance that consumes and synchronizes a very large amount of data multiple times a day. I’ll note that I am using asyncio to call a library for an API that wasn’t made for async. I’ve noticed that after a week or so the server becomes painfully slow and can even become days behind in tasks after a few weeks.
Looking at my host’s profiler the RAM or CPU usage isn’t going wild, but I know it’s becoming slower and slower every week because that celery instance also handles emails at a specific time which send out hours and hours later as the weeks pass.
Restarting the instance seems to fix everything instantly, leading me to believe I have something like a memory leak (but the ram isn’t going wild) or something like unclosed threads (I have no idea how to detect this and the CPU isn’t going wild).
Any ideas?
This sounds like a very familiar issue with celery which is still opened on Github - here
We are experiencing similar issues and unfortunately didnt find a good workaround.
It seems that this comment found the issue, but we didnt have time to find and implement a workaround, so i cant say for sure - Please update if you found something helpful to solve. As this is Open Source, no one is responsible for making a fix but the community itself :)
Using urllib.request.urlopen("http://google.com") or requests.get(("http://google.com") results in extended time delay (~1 minute or more) prior to getting response.
Hey everyone,
I am trying to do some web scraping using some code that relies on urllib. Things were going well yesterday, but today I'm getting significant time lags. I've narrowed it down to urllib and reproduced the problem in requests.get. Basically when I run the below code it takes roughly 1 minute to get a response. This was not happening yesterday. The response is good, but I am just not aware of what is happening in the backend to cause the delay. Any suggestions on how to debug, or do you all have an idea of what could be going on?
Thanks in advance.
My OS:Ubuntu 18.04
import urllib
response = urllib.request.urlopen('http://google.com')
print(response)
I get the result I am looking for, but the problem I'm running into is it takes >1min of load time...
There may be a few solutions:
1. You have a slow connection due to a service provider issue.
2. You have too many taks / cpu overload (not sure how slow your computer is tho)
Try restarting your computer and restarting your wifi. It may be a hardware issue and I there are many possible solutions. I do not believe that it is a problem with your code because storing a variable into an object is not at all a demanding task. I suggest you follow these steps that I have provided and give me an update if it works / didn't work :)
I am trying to do some python based web scraping where execution time is pretty critical.
I've tried phantomjs, selenium, and pyqt4 now, and all three libraries have given me similar response times. I'd post example code, but my problem affects all three, so I believe the problem either lies in a shared dependency or outside of my code. At around 50 concurrent requests, we see a huge desegregation in response time. It takes about 40 seconds to get back all 50 pages, and that time gets exponentially slower with greater page demands. Ideally I'm looking for ~200+ requests in about 10 seconds. I used multiprocessing to spawn each instance of phantonjs/pyqt4/selenium, so each url request gets it's own instance so that I'm not blocked by single threading.
I don't believe it's a hardware bottleneck, it's running on 32 dedicated cpu cores, totaling to 64 threads, and cpu usage doesn't typically spike to over 10-12%. Bandwidth as well sits reasonably comfortably at around 40-50% of my total throughput.
I've read about the GIL, which I believe I've addressed with using multiprocessing. Is webscraping just an inherently slow thing? Should I stop expecting to pull 200ish webpages in ~10 seconds?
My overall question is, what is the best approach to high performance web scraping, where evaluating js on the webpage is a requirement?
"evaluating js on the webpage is a requirement" <- I think this is your problem right here. Simply downloading 50 web pages is fairly trivially parallelized and should only take as long as the slowest server takes to respond.
Now, spawning 50 javascript engines in parallel (which is essentially what I guess it is you are doing) to run the scripts on every page is a different matter. Imagine firing up 50 chrome browsers at the same time.
Anyway: profile and measure the parts of your application to find where the bottleneck lies. Only then you can see if you're dealing with an I/O bottleneck (sounds unlikely), a CPU bottleneck (more likely) or a global lock somewhere that serializes stuff (also likely but impossible to say without any code posted)
I have a function I'm calling with multiprocessing.Pool
Like this:
from multiprocessing import Pool
def ingest_item(id):
# goes and does alot of network calls
# adds a bunch to a remote db
return None
if __name__ == '__main__':
p = Pool(12)
thing_ids = range(1000000)
p.map(ingest_item, thing_ids)
The list pool.map is iterating over contains around 1 million items,
for each ingest_item() call it will go and call 3rd party services and add data to a remote Postgresql database.
On a 12 core machine this processes ~1,000 pool.map items in 24 hours. CPU and RAM usage is low.
How can I make this faster?
Would switching to Threads make sense as the bottleneck seems to be network calls?
Thanks in advance!
First: remember that you are performing a network task. You should expect your CPU and RAM usage to be low, because the network is orders of magnitude slower than your 12-core machine.
That said, it's wasteful to have one process per request. If you start experiencing issues from starting too many processes, you might try pycurl, as suggested here Library or tool to download multiple files in parallel
This pycurl example looks very similar to your task https://github.com/pycurl/pycurl/blob/master/examples/retriever-multi.py
It is unlikely that using threads will substantially improve performance. This is because no matter how much you break up the task all requests have to go through the network.
To improve performance you might want to see if the 3rd party services have some kind of bulk request API with better performance.
If your workload permits it you could attempt to use some kind of caching. However, from your explanation of the task it sounds like that would have little effect since you're primarily sending data, not requesting it. You could also consider caching open connections (If you aren't already doing so), this helps avoid the very slow TCP handshake. This type of caching is often used in web browsers (Eg. Chrome).
Disclaimer: I have no Python experience
Will running multiple processes that all make HTTP requests be notably faster than one?
I'm parsing about a million urls using lxml.html.parse
At first, I ran a Python process that simply looped through the urls and called lxml.html.parse(myUrl) on each, and waited for the rest of the method to deal with the data before doing so again. This way, I was able to process on the order of 10000 urls/hour.
I imagined that if I ran a few identical processes (dealing with different sets of urls), I would speed up the rate at which I could fetch these urls. Surprisingly, (to me at least), I measured about 10400 urls/hour this time, which isn't notably better, considering I'm sure both were fluctuating dramatically.
My question is: why isn't running three of these processes much faster than one?
I know for a fact that my requests aren't meaningfully affecting their target in any way, so I don't think it's them. Do I not have enough bandwidth to make these extra processes worthwhile? If not, how can I measure this? Am I totally misunderstanding how my MacBook is running these processes? (I'm assuming on different cores concurrent threads, or something roughly equivalent to that.) Something else entirely?
(Apologies if I mangled any web terminology -- I'm new to this kind of stuff. Corrections are appreciated.)
Note: I imagine that running these processes on three different servers would probably be about 3x as fast. (That correct?) I'm not interested in that -- worst case, 10000/hour is sufficient for my purposes.
Edit: from speedtest.net (twice):
With 3 running:
Ping: 29 ms (25 ms)
Download speed: 6.63 mbps (7.47 mbps)
Upload speed: 3.02 mbps (3.32 mbps)
With all paused:
Ping: 26 ms (28 ms)
Download speed: 9.32 mbps (8.82 mbps)
Upload speed: 5.15 mbps (6.56 mbps)
Considering you have roughly 7mbit/s (1MB/s counting high).
If you get 2.888 pages per second (10'400 pages per hour). I'd say you're maxing out your connection speed (especially if you're running ADSL or WiFi, you're hammering with TCP connection handshakes for sure).
You're downloading a page roughly containing 354kB of data in each of your processes, which isn't half bad considering that's close the the limit of your bandwidth.
Take in account for TCP headers and all that happens when you actually establish a connection (SYN, ACK.. etc) You're up in a descent speed tbh.
Note: This is just to take in account the download rate which is much higher than your upload speed which is also important considering that's what actually transmits your connection request, headers to the web server etc. And i know most 3G modems and ADSL lines claim to be "full duplex", they really aren't (especially ADSL). You'll never perform full speed in both direction despite what your ISP tells you. If you want to achieve such tasks you need to switch to fiber optics.
Ps. I assume you understand the basic difference between mega-bit and mega-byte.