I am unsure what tags to give this but I am using Selenium in python so I decided to start here. I am scraping a website thousands of times using selenium and requests in python. It starts fairly quickly but around the 3400 page load mark it slows down from around .1 seconds to 3 or 4 seconds. Any ideas on what is slowing the webpages loading. The program is being run on a very low power Linode (1 shared cpu and 1gb of ram). The cpu is pegged from the beginning when it is still running fast and from what I can tell, it is not using all the RAM. I also gave it a 10 gb swap. My internet download and upload is above 200 MB/s. I was thinking the website host themselves are limiting me but I don't know this stuff well enough to be sure.
Pretty sure it's the host. If they are limiting by your IP, you may want to use some proxies. If the website is on shared hosting or some low cost hosting then proxies won't help.
Related
I am getting this warning email from PythonAnywhere on every single request to my website. I am using spaCy and Django and have just upgraded my account. Everything seems to work fine, though. Except I am receiving warning emails, that is. I have only 2 GB RAM on my local machine and it can run my app along with a few other apps too without any issues. Then why is 3 GB RAM not enough on PythonAnywhere? (I also have 3 GB disc space on PythonAnywhere, of which only 27% is used up.)
I have tried searching for the answers on their forum and on the internet in general but I have not got any clue about the issue.
If your initial requests on the PythonAnywhere webapp works fine (ie. your code successfully allocates say 2GB RAM and returns a result), and you see the results correctly, but you receive emails about processes exceeding the RAM limit, then perhaps you have processes that are left hanging around, not cleaned up, and they are accumulating until they slowly get killed? Can you correspond this with the # of kill messages you get vs the number of times you hit the webapp and get a result? My theory would be corroborated if there are significantly less kill messages vs the hits for that particular model endpoint.
I am trying to do some python based web scraping where execution time is pretty critical.
I've tried phantomjs, selenium, and pyqt4 now, and all three libraries have given me similar response times. I'd post example code, but my problem affects all three, so I believe the problem either lies in a shared dependency or outside of my code. At around 50 concurrent requests, we see a huge desegregation in response time. It takes about 40 seconds to get back all 50 pages, and that time gets exponentially slower with greater page demands. Ideally I'm looking for ~200+ requests in about 10 seconds. I used multiprocessing to spawn each instance of phantonjs/pyqt4/selenium, so each url request gets it's own instance so that I'm not blocked by single threading.
I don't believe it's a hardware bottleneck, it's running on 32 dedicated cpu cores, totaling to 64 threads, and cpu usage doesn't typically spike to over 10-12%. Bandwidth as well sits reasonably comfortably at around 40-50% of my total throughput.
I've read about the GIL, which I believe I've addressed with using multiprocessing. Is webscraping just an inherently slow thing? Should I stop expecting to pull 200ish webpages in ~10 seconds?
My overall question is, what is the best approach to high performance web scraping, where evaluating js on the webpage is a requirement?
"evaluating js on the webpage is a requirement" <- I think this is your problem right here. Simply downloading 50 web pages is fairly trivially parallelized and should only take as long as the slowest server takes to respond.
Now, spawning 50 javascript engines in parallel (which is essentially what I guess it is you are doing) to run the scripts on every page is a different matter. Imagine firing up 50 chrome browsers at the same time.
Anyway: profile and measure the parts of your application to find where the bottleneck lies. Only then you can see if you're dealing with an I/O bottleneck (sounds unlikely), a CPU bottleneck (more likely) or a global lock somewhere that serializes stuff (also likely but impossible to say without any code posted)
I am about to develop a website. And, I expect to get a lot of traffic in it. It is in python (Django). I wonder if my web application uses 2 MB of RAM for one service(like if I run it in my PC directly in terminal, it consumes 2 MB of RAM) then if I get 1000 users on my website in a particular time, will my website need 2000 MB of RAM (2MB per user * 1000 users)? Does it go this way?
To test that, and see what kind of increase in memory consumption you can expect, open a few incognito chrome tabs and connect as different users. Then you can see if the memory increases linearly to the number of users.
We have recently launched a django site which amongst other things, has a screen representing all sorts of data. A request to the server is sent every 10 seconds to get new data. The average response size is 10kb.
The site is working on approx. 30 clients, meaning every client sends a get request every 10 seconds.
When locally testing, responses came back after 80ms. After deployment with 30~ users, we're taking up to 20 seconds!!
So the initial thought is that my code sucks. I went through all my queries and did everything i can to optimize then and reduce calls to the database (which was hard, nearly everything is somwething like object.filter(id=num) and my tables have less thab 5k rows atm...)
But then i noticed the same issue occurs in the admin panel! Which is clearly optimized and doesn't have my perhaps inefficient code, since I didn't write it. Opening the users tab takes 30 seconds at certain requests!!
So, what is it? Do I argue with the company sysadmins and demand a better server? They say we dont need better hardware (running on dual core 2.67ghz and 4gb ram, which isnt a lot, but still shouldn't be THAT slow)
Doesn't the fact that the admin site is slow imply that this is a hardware issue?
I'm using scrapy to download pages from many different domains in parallel. I have hundreds of thousands of pages to download, so performance is important.
Unfortunately, as I've profiled scrapy's speed, I'm only getting a couple pages per second. Really, about 2 pages per second on average. I've previously written my own multithreaded spiders to do hundreds of pages per second -- I thought for sure scrapy's use of twisted, etc. would be capable of similar magic.
How do I speed scrapy up? I really like the framework, but this performance issue could be a deal-breaker for me.
Here's the relevant part of the settings.py file. Is there some important setting I've missed?
LOG_ENABLED = False
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_IP = 8
A few parameters:
Using scrapy version 0.14
The project is deployed on an EC2 large instance, so there should be plenty of memory, CPU, and bandwidth to play with.
I'm scheduling crawls using the JSON protocol, keeping the crawler topped up with a few dozen concurrent crawls at any given time.
As I said at the beginning, I'm downloading pages from many sites, so remote server performance and CONCURRENT_REQUESTS_PER_IP shouldn't be a worry.
For the moment, I'm doing very little post-processing. No xpath; no regex; I'm just saving the url and a few basic statistics for each page. (This will change later once I get the basic performance kinks worked out.)
I had this problem in the past...
And large part of it I solved with a 'Dirty' old tricky.
Do a local cache DNS.
Mostly when you have this high cpu usage accessing simultaneous remote sites it is because scrapy is trying to resolve the urls.
And please remember to change your dns settings at the host (/etc/resolv.conf) to your LOCAL caching DNS server.
In the first ones will be slowly, but as soon it start caching and it is more efficient resolving you are going to see HUGE improvements.
I hope this will help you in your problem!