Will running multiple processes that all make HTTP requests be notably faster than one?
I'm parsing about a million urls using lxml.html.parse
At first, I ran a Python process that simply looped through the urls and called lxml.html.parse(myUrl) on each, and waited for the rest of the method to deal with the data before doing so again. This way, I was able to process on the order of 10000 urls/hour.
I imagined that if I ran a few identical processes (dealing with different sets of urls), I would speed up the rate at which I could fetch these urls. Surprisingly, (to me at least), I measured about 10400 urls/hour this time, which isn't notably better, considering I'm sure both were fluctuating dramatically.
My question is: why isn't running three of these processes much faster than one?
I know for a fact that my requests aren't meaningfully affecting their target in any way, so I don't think it's them. Do I not have enough bandwidth to make these extra processes worthwhile? If not, how can I measure this? Am I totally misunderstanding how my MacBook is running these processes? (I'm assuming on different cores concurrent threads, or something roughly equivalent to that.) Something else entirely?
(Apologies if I mangled any web terminology -- I'm new to this kind of stuff. Corrections are appreciated.)
Note: I imagine that running these processes on three different servers would probably be about 3x as fast. (That correct?) I'm not interested in that -- worst case, 10000/hour is sufficient for my purposes.
Edit: from speedtest.net (twice):
With 3 running:
Ping: 29 ms (25 ms)
Download speed: 6.63 mbps (7.47 mbps)
Upload speed: 3.02 mbps (3.32 mbps)
With all paused:
Ping: 26 ms (28 ms)
Download speed: 9.32 mbps (8.82 mbps)
Upload speed: 5.15 mbps (6.56 mbps)
Considering you have roughly 7mbit/s (1MB/s counting high).
If you get 2.888 pages per second (10'400 pages per hour). I'd say you're maxing out your connection speed (especially if you're running ADSL or WiFi, you're hammering with TCP connection handshakes for sure).
You're downloading a page roughly containing 354kB of data in each of your processes, which isn't half bad considering that's close the the limit of your bandwidth.
Take in account for TCP headers and all that happens when you actually establish a connection (SYN, ACK.. etc) You're up in a descent speed tbh.
Note: This is just to take in account the download rate which is much higher than your upload speed which is also important considering that's what actually transmits your connection request, headers to the web server etc. And i know most 3G modems and ADSL lines claim to be "full duplex", they really aren't (especially ADSL). You'll never perform full speed in both direction despite what your ISP tells you. If you want to achieve such tasks you need to switch to fiber optics.
Ps. I assume you understand the basic difference between mega-bit and mega-byte.
Related
My small AWS EC2 instance runs a two python scripts, one to receive JSON messages as a web-socket(~2msg/ms) and write to csv file, and one to compress and upload the csvs. After testing, the data(~2.4gb/day) recorded by the EC2 instance is sparser than if recorded on my own computer(~5GB). Monitoring shows the EC2 instance consumed all CPU credits and is operating on baseline power. My question is, does the instance drop messages because it cannot write them fast enough?
Thank you to anyone that can provide any insight!
It depends on the WebSocket server.
If your first script cannot run fast enough to match the message generation speed on server side, the TCP receive buffer will become full and the server will slow down on sending packets. Assuming a near-constant message production rate, unprocessed messages will pile up on the server, and the server could be coded to let them accumulate or eventually drop them.
Even if the server never dropped a message, without enough computational power, your instance would never catch up - on 8/15 it could be dealing with messages from 8/10 - so instance upgrade is needed.
Does data rate vary greatly throughout the day (e.g. much more messages in evening rush around 20:00)? If so, data loss may have occurred during that period.
But is Python really that slow? 5GB/day is less than 100KB per second, and even a fraction of one modern CPU core can easily handle it. Perhaps you should stress test your scripts and optimize them (reduce small disk writes, etc.)
I am trying to do some python based web scraping where execution time is pretty critical.
I've tried phantomjs, selenium, and pyqt4 now, and all three libraries have given me similar response times. I'd post example code, but my problem affects all three, so I believe the problem either lies in a shared dependency or outside of my code. At around 50 concurrent requests, we see a huge desegregation in response time. It takes about 40 seconds to get back all 50 pages, and that time gets exponentially slower with greater page demands. Ideally I'm looking for ~200+ requests in about 10 seconds. I used multiprocessing to spawn each instance of phantonjs/pyqt4/selenium, so each url request gets it's own instance so that I'm not blocked by single threading.
I don't believe it's a hardware bottleneck, it's running on 32 dedicated cpu cores, totaling to 64 threads, and cpu usage doesn't typically spike to over 10-12%. Bandwidth as well sits reasonably comfortably at around 40-50% of my total throughput.
I've read about the GIL, which I believe I've addressed with using multiprocessing. Is webscraping just an inherently slow thing? Should I stop expecting to pull 200ish webpages in ~10 seconds?
My overall question is, what is the best approach to high performance web scraping, where evaluating js on the webpage is a requirement?
"evaluating js on the webpage is a requirement" <- I think this is your problem right here. Simply downloading 50 web pages is fairly trivially parallelized and should only take as long as the slowest server takes to respond.
Now, spawning 50 javascript engines in parallel (which is essentially what I guess it is you are doing) to run the scripts on every page is a different matter. Imagine firing up 50 chrome browsers at the same time.
Anyway: profile and measure the parts of your application to find where the bottleneck lies. Only then you can see if you're dealing with an I/O bottleneck (sounds unlikely), a CPU bottleneck (more likely) or a global lock somewhere that serializes stuff (also likely but impossible to say without any code posted)
I am doing my bachelor's thesis where I wrote a program that is distributed over many servers and exchaning messages via IPv6 multicast and unicast. The network usage is relatively high but I think it is not too high when I have 15 servers in my test where there are 2 requests every second that are going like that:
Server 1 requests information from server 3-15 via multicast. every of 3-15 must respond. if one response is missing after 0.5 sec, the multicast is resent, but only the missing servers must respond (so in most cases this is only one server)
Server 2 does exactly the same. If there are missing results after 5 retries the missing servers are marked as dead and the change is synced with the other server (1/2)
So there are 2 multicasts every second and 26 unicasts every second. I think this should not be too much?
Server 1 and 2 are running python web servers which I use to do the request every second on each server (via a web client)
The whole szenario is running in a mininet environment which is running in a virtual box ubuntu that has 2 cores (max 2.8ghz) and 1GB RAM. While running the test, i see via htop that the CPUs are at 100% while the RAM is at 50%. So the CPU is the bottleneck here.
I noticed that after 2-5 minutes (1 minute = 60 * (2+26) messages = 1680 messages) there are too many missing results causing too many sending repetitions while new requests are already coming in, so that the "management server" thinks the client servers (3-15) are down and deregisters them. After syncing this with the other management server, all client servers are marked as dead on both management servers which is not true...
I am wondering if the problem could be my debug outputs? I am printing 3-5 messages for every message that is sent and received. So that are about (let's guess it are 5 messages per sent/recvd msg) (26 + 2)*5 = 140 lines that are printed on the console.
I use python 2.6 for the servers.
So the question here is: Can the console output slow down the whole system that simple requests take more than 0.5 seconds to complete 5 times in a row? The request processing is simple in my test. No complex calculations or something like that. basically it is something like "return request_param in ["bla", "blaaaa", ...] (small list of 5 items)"
If yes, how can I disable the output completely without having to comment out every print statement? Or is there even the possibility to output only lines that contain "Error" or "Warning"? (not via grep, because when grep becomes active all the prints already have finished... I mean directly in python)
What else could cause my application to be that slow? I know this is a very generic question, but maybe someone already has some experience with mininet and network applications...
I finally found the real problem. It was not because of the prints (removing them improved performance a bit, but not significantly) but because of a thread that was using a shared lock. This lock was shared over multiple CPU cores causing the whole thing being very slow.
It even got slower the more cores I added to the executing VM which was very strange...
Now the new bottleneck seems to be the APScheduler... I always get messages like "event missed" because there is too much load on the scheduler. So that's the next thing to speed up... :)
I am making head requests on anywhere between 100,000 to 500,000 URLs to returns the size and the HTTP status code. I have tried four different methods: a threadpool, an asynchronous twisted client, a grequests implementation, and a concurrent.futures based solution. In a previous question similar to this one, the threadpool implementation is said to finish in 6 to 10 minutes. Trying the exact code and feeding a dummy list of 100,000 URLs takes over 4 hours on my machine. My twisted solution (different than the one mentioned in the link question) similarly takes around 3.5 hours to complete, the same with the the concurrent.futures solution.
I am relatively confident I have written the implementations correctly, specifically in the case of copy and pasting the code from a previous example. How can I diagnose where the slow down is occurring? My guess is that it is when making the connection, but I have no idea how to prove this or fix it if it is a problem. I am pretty certain it is not a CPU bound problem, as the CPU time after 100,000 URLs is only 3 minutes. Any help in figuring out how to diagnose the issue, and in turn fixing would be greatly appreciated.
Some More Information:
Using Requests to make the request, or treq with twisted.
Appending the results to a list (with garbage compiler off) or a
pandas dataframe does not seem to make a speed difference.
I have experimented with anywhere between 4 and 200 workers/threads
in my various tests, and 15 seems to be optimal.
The machine I am using has 16 cores and a high speed (100 MBPS)
internet connection.
I wish reduce the complete time a web server to REQUEST/RECEIVE data from an API server for a given query.
Assuming MySQL as bottleneck, I updated the API server db to Cassandra, but still complete time remains the same. May be something else is a bottleneck which I could not figure out.
Environment:
Number of Request Estimated per minute: 100
Database: MySQl / Cassandra
Hardware: EC2 Small
Server Used: Apache HTTP
Current Observations:
Cassandra Query Response Time: .03 Secs
Time between request made and response received: 4 Secs
Required:
Time between request made and response received: 1 Secs
BOTTOM LINE: How can we reduce the complete time taken in this given case?
Feel free to ask for more details if required. Thanks
Summarizing from the chat:
Environment:
Running on a small Amazon EC2 instance (1 virtual CPU, 1.7GB RAM)
Web server is Apache
100 worker threads
Python is using Pylons (which implies WSGI)
Test client in the EC2
Tests:
1.8k requests, single thread
Unknown CPU cost
Cassandra request time: 0.079s (spread 0.048->0.759)
MySQL request time: 0.169s (spread 0.047->1.52)
10k requests, multiple threads
CPU runs at 90%
Cassandra request time: 2.285s (spread 0.102->6.321)
MySQL request time: 7.879s (spread 0.831->14.065)
Observation: 100 threads is probably a lot too many on your small EC2 instance. Bear in mind that each thread spawns a Python process which occupies memory and resources - even when not doing anything. Reducing the threads reduces:
Memory contention (and memory paging kills performance)
CPU buffer misses
CPU contention
DB contention
Recommendation: You should aim to run only as many threads as are needed to max out your CPU (but fewer if they max out on memory or other resources). Running more threads increases overheads and decreases throughput.
Observation: Your best performance time in single-threaded mode gives a probable best-case cost of 0.05 CPU-seconds per request. Assuming some latency (waits for IO), your CPU cost may be quite a lot lower). Assuming CPU is the bottleneck in your architecture, you probably are capable of 20-40 transactions a second on your EC2 server with just thread tuning.
Recommendation: Use a standard Python profiler to profile the system (when running with an optimum number of threads). The profiler will indicate where the CPU spends the most time. Distinguish between waits (i.e. for the DB to return, for disk to read or write data) vs. the inherent CPU cost of the code.
Where you have a high inherent CPU cost: can you decrease the cost? If this is not in your code, can you avoid that code path by doing something different? Caching? Using another library?
Where there is latency: Given your single-threaded results, latency is not necessarily bad assuming that the CPU can service another request. In fact you can get a rough idea on the number of threads you need by calculating: (total time / (total time - wait time))
However, check to see that, while Python is waiting, the DB (for instance) isn't working hard to return a result.
Other thoughts: Consider how the test harness delivers HTTP requests - does it do so as fast as it can (eg tries to open 10k TCP sockets simultaneously?) If so, this may be skewing your results. It may be better to use a different loading pattern and tool.
Cassandra works faster on high load and average time of 3 - 4 secs on a two system on different sides of the world is ok.