Using urllib.request.urlopen("http://google.com") or requests.get(("http://google.com") results in extended time delay (~1 minute or more) prior to getting response.
Hey everyone,
I am trying to do some web scraping using some code that relies on urllib. Things were going well yesterday, but today I'm getting significant time lags. I've narrowed it down to urllib and reproduced the problem in requests.get. Basically when I run the below code it takes roughly 1 minute to get a response. This was not happening yesterday. The response is good, but I am just not aware of what is happening in the backend to cause the delay. Any suggestions on how to debug, or do you all have an idea of what could be going on?
Thanks in advance.
My OS:Ubuntu 18.04
import urllib
response = urllib.request.urlopen('http://google.com')
print(response)
I get the result I am looking for, but the problem I'm running into is it takes >1min of load time...
There may be a few solutions:
1. You have a slow connection due to a service provider issue.
2. You have too many taks / cpu overload (not sure how slow your computer is tho)
Try restarting your computer and restarting your wifi. It may be a hardware issue and I there are many possible solutions. I do not believe that it is a problem with your code because storing a variable into an object is not at all a demanding task. I suggest you follow these steps that I have provided and give me an update if it works / didn't work :)
Related
I've seen this stackoverflow question, as well as this one. The first says that the whitespace is from it being blocked by local work, but stepping through my program, the ~20 delay occurs right when I call dask.compute() and not in the surrounding code. The question asked said their issue was resolved by disabling garbage collection, but this did nothing for me. The second says to check the scheduler profiler, but that doesn't seem to be taking a long time either.
My task graph is dead simple - I'm calling a function on 500 objects with no task dependencies. (And repeat this 3 times, I'll link the functions once I figure out this issue). Here is my dask performance report html, and here is the section of code that is calling dask.compute().
Any suggestions as to what could be causing this? Any suggestions as to how I can better profile to figure this out?
This doesn't seem to be the main problem, but lines 585/587 will result in transfer of computations to the local machine, this could slow-down/introduce a bottleneck in the computations. If the results are not used locally downstream, then one option is to leave computations on the remote machines calling client.compute (assuming the client is named as client):
# changing 587: preprocessedcases = dask.compute(*preprocessedcases)
preprocessedcases = client.compute(*preprocessedcases)
I have a django app with a celery instance that consumes and synchronizes a very large amount of data multiple times a day. I’ll note that I am using asyncio to call a library for an API that wasn’t made for async. I’ve noticed that after a week or so the server becomes painfully slow and can even become days behind in tasks after a few weeks.
Looking at my host’s profiler the RAM or CPU usage isn’t going wild, but I know it’s becoming slower and slower every week because that celery instance also handles emails at a specific time which send out hours and hours later as the weeks pass.
Restarting the instance seems to fix everything instantly, leading me to believe I have something like a memory leak (but the ram isn’t going wild) or something like unclosed threads (I have no idea how to detect this and the CPU isn’t going wild).
Any ideas?
This sounds like a very familiar issue with celery which is still opened on Github - here
We are experiencing similar issues and unfortunately didnt find a good workaround.
It seems that this comment found the issue, but we didnt have time to find and implement a workaround, so i cant say for sure - Please update if you found something helpful to solve. As this is Open Source, no one is responsible for making a fix but the community itself :)
I'm currently using Python 2.7 and the requests module to upload files via HTTP POST requests in a little script of mine.
All things considered, the script is doing its job but sometimes the servers on the other end are quite slow and I want to abort the upload process after a certain amount of time (e.g. after 60 seconds).
I looked into the timeout function in the requests module but that only applies to the response time of the server, not the actual transmission time period.
So, actually, I've got two questions:
1) Do you know a solution to my problem or an already existing module out there that I could use?
2) Also, do you know if it's a Python problem that upload speeds via Python script are kind of slow compared to uploads using the browser (3-5 Megabyte/s vs. 20-30 Megabytes/s; I'm using a server)?
If this questions is a duplicate, please bear with me. I used the search but no thread was spot-on.
I am making head requests on anywhere between 100,000 to 500,000 URLs to returns the size and the HTTP status code. I have tried four different methods: a threadpool, an asynchronous twisted client, a grequests implementation, and a concurrent.futures based solution. In a previous question similar to this one, the threadpool implementation is said to finish in 6 to 10 minutes. Trying the exact code and feeding a dummy list of 100,000 URLs takes over 4 hours on my machine. My twisted solution (different than the one mentioned in the link question) similarly takes around 3.5 hours to complete, the same with the the concurrent.futures solution.
I am relatively confident I have written the implementations correctly, specifically in the case of copy and pasting the code from a previous example. How can I diagnose where the slow down is occurring? My guess is that it is when making the connection, but I have no idea how to prove this or fix it if it is a problem. I am pretty certain it is not a CPU bound problem, as the CPU time after 100,000 URLs is only 3 minutes. Any help in figuring out how to diagnose the issue, and in turn fixing would be greatly appreciated.
Some More Information:
Using Requests to make the request, or treq with twisted.
Appending the results to a list (with garbage compiler off) or a
pandas dataframe does not seem to make a speed difference.
I have experimented with anywhere between 4 and 200 workers/threads
in my various tests, and 15 seems to be optimal.
The machine I am using has 16 cores and a high speed (100 MBPS)
internet connection.
I'm writing a selenium script by python. Something that I found out, is that when selenium gets 404 status code. it crashes. What is the best way to deal with it?
I had a similar problem. Sometimes a server we were using (i.e., not the main server we were testing, only a "sub-server") throughout our tests would crash. I added a minor sanity test to see if the server is up or not before the main tests ran. That is, I performed a simple GET request to the server, surrounded it with try-catch and if that passed I continue with the tests. Let me stress out this point- before i even started selenium i would perform a GET request using python's urllib2. It's not the best of solutions but it's fast it was enough for me.