Celery: Instance becomes slow after a week or two - python

I have a django app with a celery instance that consumes and synchronizes a very large amount of data multiple times a day. I’ll note that I am using asyncio to call a library for an API that wasn’t made for async. I’ve noticed that after a week or so the server becomes painfully slow and can even become days behind in tasks after a few weeks.
Looking at my host’s profiler the RAM or CPU usage isn’t going wild, but I know it’s becoming slower and slower every week because that celery instance also handles emails at a specific time which send out hours and hours later as the weeks pass.
Restarting the instance seems to fix everything instantly, leading me to believe I have something like a memory leak (but the ram isn’t going wild) or something like unclosed threads (I have no idea how to detect this and the CPU isn’t going wild).
Any ideas?

This sounds like a very familiar issue with celery which is still opened on Github - here
We are experiencing similar issues and unfortunately didnt find a good workaround.
It seems that this comment found the issue, but we didnt have time to find and implement a workaround, so i cant say for sure - Please update if you found something helpful to solve. As this is Open Source, no one is responsible for making a fix but the community itself :)

Related

Why do I have such large gaps in my dask distributed task stream?

I've seen this stackoverflow question, as well as this one. The first says that the whitespace is from it being blocked by local work, but stepping through my program, the ~20 delay occurs right when I call dask.compute() and not in the surrounding code. The question asked said their issue was resolved by disabling garbage collection, but this did nothing for me. The second says to check the scheduler profiler, but that doesn't seem to be taking a long time either.
My task graph is dead simple - I'm calling a function on 500 objects with no task dependencies. (And repeat this 3 times, I'll link the functions once I figure out this issue). Here is my dask performance report html, and here is the section of code that is calling dask.compute().
Any suggestions as to what could be causing this? Any suggestions as to how I can better profile to figure this out?
This doesn't seem to be the main problem, but lines 585/587 will result in transfer of computations to the local machine, this could slow-down/introduce a bottleneck in the computations. If the results are not used locally downstream, then one option is to leave computations on the remote machines calling client.compute (assuming the client is named as client):
# changing 587: preprocessedcases = dask.compute(*preprocessedcases)
preprocessedcases = client.compute(*preprocessedcases)

Python multithreading - Global Interpreter Lock

Python threading module documentation says something like this
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.
Can someone explain whether I can use threading module in my situation or not?
I'm going to detect the frameworks used by websites.
So here is how my app works
My MySQL database contains around 10 million domains ( id, domain, frameworks )
Fetch 1000 rows from the database
Scrape domain one by one using requests module
Detect the frameworks
Update the database row with the results.
Since I have 10 million domains, its going to take very long time. So I would like to speed up the process by using threads.
But i'm not sure whether my app is I/O bound or not. Can someone explain?
Thankyou
I guess, the most time expensive activity will be fetching all the urls.
So the answer to your question is: Yes, your app is very likely to be I/O bound.
You plan to scrape domains one by one, this would lead into really long processing time. You shall definitely do that concurrently. One solution is described in my answer to similar question related to scraping web sites.
Anyway, the number of your urls seems really large, you might need to take advantage from splitting the work to multiple workers - for this purpose you might use e.g. Celery framework. However, as your task is really I/O bound, you would earn some speed only, if your workers work on multiple computers, ideally with independent connectivity. I did similar task on DigitalOcean machines and it worked very well.

R10 Boot Timeout Error - Conceptual

So I'm getting the very common
"Web process failed to bind to $PORT within 60 seconds of launch"
But none of the solutions I've tried have worked, so my question is much more conceptual.
What is suppose to be binding? It is my understanding that I do not need to write code specifically to bind the worker dyno to the $PORT, but rather that this failure is caused primarily by computationally intensive processes.
I don't have any really great code snippets to show here, but I've included the link to the github repo for the project I'm working on.
https://github.com/therightnee/RainbowReader_MKII
There is a long start up time when the RSS feeds are first parsed, but I've never seen it go past 30 seconds. Even so, currently when you go to the page it should just render a template. Initially, in this setup, there is no data processing being done. Testing locally, everything runs great, and even with the data parsing it doesn't take more than a minute in any test case.
This leads me to believe that somewhere I need to be setting or using the $PORT variable in some way but I don't know.
Thanks!

Google App Engine Instances keep quickly shutting down

So I've been using app engine for quite some time now with no issues. I'm aware that if the app hasn't been hit by a visitor for a while then the instance will shut down, and the first visitor to hit the site will have a few second delay while a new instance fires up.
However, recently it seems that the instances only stay alive for a very short period of time (sometimes less than a minute), and if I have 1 instance already up and running, and I refresh an app webpage, it still fires up another instance (and the page it starts is minimal homepage HTML, shouldn't require much CPU/memory). Looking at my logs its constantly starting up new instances, which was never the case previously.
Any tips on what I should be looking at, or any ideas of why this is happening?
Also, I'm using Python 2.7, threadsafe, python_precompiled, warmup inbound services, NDB.
Update:
So I changed my app to have at least 1 idle instance, hoping that this would solve the problem, but it is still firing up new instances even though one resident instance is already running. So when there is just the 1 resident instance (and I'm not getting any traffic except me), and I go to another page on my app, it is still starting up a new instance.
Additionally, I changed the Pending Latency to 1.5s as koma pointed out, but that doesn't seem to be helping.
The memory usage of the instances is always around 53MB, which is surprising when the pages being called aren't doing much. I'm using the F1 Frontend Instance Class and that has a limit of 128, but either way 53MB seems high for what it should be doing. Is that an acceptable size when it first starts up?
Update 2: I just noticed in the dashboard that in the last 14 hours, the request /_ah/warmup responded with 24 404 errors. Could this be related? Why would they be responding with a 404 response status?
Main question: Why would it constantly be starting up new instances (even with no traffic)? Especially where there are already existing instances, and why do they shut down so quickly?
My solution to this was to increase the Pending Latency time.
If a webpage fires 3 ajax requests at once, AppEngine was launching new instances for the additional requests. After configuring the Minimum Pending Latency time - setting it to 2.5 secs, the same instance was processing all three requests and throughput was acceptable.
My project still has little load/traffic... so in addition to raising the Pending Latency, I openend an account at Pingdom and configured it to ping my Appengine project every minute.
The combination of both, makes that I have one instance that stays alive and is serving up all requests most of the time. It will scale to new instances when really necessary.
1 idle instance means that app-engine will always fire up an extra instance for the next user that comes along - that's why you are seeing an extra instance fired up with that setting.
If you remove the idle instance setting (or use the default) and just increase pending latency it should "wait" before firing the extra instance.
With regards to the main question I think #koma might be onto something in saying that with default settings app-engine will tend to fire extra instances even if the requests are coming from the same session.
In my experience app-engine is great under heavy traffic but difficult (and sometimes frustrating) to work with under low traffic conditions. In particular it is very difficult to figure out the nuances of what the criteria for firing up new instances actually are.
Personally, I have a "wake-up" cron-job to bring up an instance every couple of minutes to make sure that if someone comes on the site an instance is ready to serve. This is not ideal because it will eat at my quote, but it works most of the time because traffic on my app is reasonably high.
I only started having this type of issue on Monday February 4 around 10 pm EST, and is continuing until now. I first started noticing then that instances kept firing up and shutting down, and latency increased dramatically. It seemed that the instance scheduler was turning off idle instances too rapidly, and causing subsequent thrashing.
I set minimum idle instances to 1 to stabilize latency, which worked. However, there is still thrashing of new instances. I tried the recommendations in this thread to only set minimum pending latency, but that does not help. Ultimately, idle instances are being turned off too quickly. Then when they're needed, the latency shoots up while trying to fire up new instances.
I'm not sure why you saw this a couple weeks ago, and it only started for me a couple days ago. Maybe they phased in their new instance scheduler to customers gradually? Are you not still seeing instances shutting down quickly?

Google AppEngine startup times

I've already read how to avoid slow ("cold") startup times on AppEngine, and implemented the solution from the cookbook using 10 second polls, but it doesn't seem to help a lot.
I use the Python runtime, and have installed several handlers to handle my requests, none of them doing something particularly time consuming (mostly just a DB fetch).
Although the Hot Handler is active, I experience slow load times (up to 15 seconds or more per handler) and the log shows frequently the This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time ... message after the app was IDLE for a while.
This is very odd. Do I have to fetch each URL separately in the Hot Handler?
The "appropriate" way of avoiding slow too many slow startup times is to use the "always on" option. Of course, this is not a free option ($0.30 per day).

Categories

Resources