Kubernetes deployment high memory usage

Kubernetes deployment high memory usage - python

I am using python flask in GKE contianer and moemory is increasing inside pod. I have set limit to pod but it's getting killed.
I am thinking it's memory leak can anybody suggest something after watching this. As disk increase memory also increase and there are some page faults also.
Is there anything container side linux os (using python-slim base). Memory is not coming back to os or python flask memory management issue ?
To check memory leak i have added stackimpact to application.
Please help...!
Thanks in advance

If you added a resource memory limit to each GKE Deployment when the memory limit was hit, the pod was killed, rescheduled, and should restarted and the other pods on the node should be fine.
You can find more information by running this command:
kubectl describe pod <YOUR_POD_NAME>
kubectl top pods
Please note if you put in a memory request that is larger than the amount of memory on your nodes, the pod will never be scheduled.
And if the Pod cannot be scheduled because of insufficient resources or some configuration error You might encounter an error indicating a lack memory or another resource. If a Pod is stuck in Pending it means that it can not be scheduled onto a node. In this case you need to delete Pods, adjust resource requests, or add new nodes to your cluster. You can find more information here.
Additionally, as per this document, Horizontal Pod Autoscaling (HPA) scales the replicas of your deployments based on metrics like memory or CPU usage.

Related

Routing requests to a specific Heroku Dyno

I built an real-time collaboration application with Prosemirror that uses a centralised operational transform algorithm (described here by Marijn Haverbeke) with a Python server using Django Channels with prosemirror-py as the central point.
The server creates a DocumentInstance for every document users are collaborating on, keeps it in memory and occasionally stores it in a Redis database. As long as there is only one Dyno, all requests are routed there. Then the server looks up the instance the request belongs to and updates it.
I would like to take advantage of Heroku's horizontal scaling and run more than one dyno. But as I understand, this would imply requests being routed to any of the running dynos. But since one DocumentInstance can only live on one server this would not work.
Is there a way to make sure that requests belonging to a specific DocumentInstance are only routed to the machine that keep that keeps it?
Or maybe there is an alternative architecture that I am overlooking?

When building scalable architecture the horizontal autoscaling layer is typically ephemeral, meaning no state is kept there it typically serves either for processing or IO to other services as these boxes are shut down and spun up with regularity. You should not be keeping the DocumentInstance data there as it will be lost given dynos spin up and shut down during autoscaling.
If you're using Redis as a centralized datastore you would be much better off either saving in real-time to Redis and having all dynos write/call to Redis to keep in sync, Redis is capable of this and would move all state from ephemeral dynos to a non-emphereal datastore. You may however be able to find a middle ground for your case using https://devcenter.heroku.com/articles/session-affinity.
I run 123 Dyno an autoscaling and monitoring add-on for Heroku for more options and faster, configurable autoscaling on Heroku when you get there.

How to set gunicorn worker number in a kubernetes' pod

I'm running a flask application with gunicorn and gevent worker class. In my own test environment, I follow the official guide multiprocessing.cpu_count() * 2 + 1 to set worker number.
If I want to put the application on Kubernetes' pod and assume that resources will be like
resources:
limits:
cpu: "10"
memory: "5Gi"
requests:
CPU: "3"
memory: "3Gi"
how to calculate the worker number? should I use limits CPU or requests CPU?
PS. I'm launching application via binary file packaged by pyinstaller, in essence flask run(python script.py), and launch gunicorn in the main thread:
def run():
...
if config.RUN_MODEL == 'GUNICORN':
sys.argv += [
"--worker-class", "event",
"-w", config.GUNICORN_WORKER_NUMBER,
"--worker-connections", config.GUNICORN_WORKER_CONNECTIONS,
"--access-logfile", "-",
"--error-logfile", "-",
"-b", "0.0.0.0:8001",
"--max-requests", config.GUNICORN_MAX_REQUESTS,
"--max-requests-jitter", config.GUNICORN_MAX_REQUESTS_JITTER,
"--timeout", config.GUNICORN_TIMEOUT,
"--access-logformat", '%(t)s %(l)s %(u)s "%(r)s" %(s)s %(M)sms',
"app.app_runner:app"
]
sys.exit(gunicorn.run())
if __name__ == "__main__":
run()
PS. Whether I set worker number by limits CPU (10*2+1=21) or requests CPU (3*2+1=7) the performance still can't catch up with my expectations. Any trial suggestions to improve performance will be welcome under this questions

how to calculate the worker number? should I use limits CPU or requests CPU?
It depends on your situation. First, look at the documentation about request and limits (this example is for memory, but the same is for CPU).
f the node where a Pod is running has enough of a resource available, it's possible (and allowed) for a container to use more resource than its request for that resource specifies. However, a container is not allowed to use more than its resource limit.
For example, if you set a memory request of 256 MiB for a container, and that container is in a Pod scheduled to a Node with 8GiB of memory and no other Pods, then the container can try to use more RAM.
If you set a memory limit of 4GiB for that container, the kubelet (and container runtime) enforce the limit. The runtime prevents the container from using more than the configured resource limit. For example: when a process in the container tries to consume more than the allowed amount of memory, the system kernel terminates the process that attempted the allocation, with an out of memory (OOM) error.
Answering your question: first of all, you need to know how many resources (eg. CPU) your application needs. Request will be the minimum amount of CPU that the application must receive (you have to calculate this value yourself. In other words - you must know how much the application needs minimum CPU to run properly and then you need to set the value.) For example, if your application will perform better, when it receives more CPU, consider adding a limit ( this is the maximum amount of CPU an application can receive). If you want to calculate the worker number based on the highest performance, use limit to calculate the value. If, on the other hand, you want your application to run smoothly (perhaps not as fast as possible, but it will consume less resources) use request type.

gunicorn behind nginx high memory usage

I have an issue with gunicorn behind nginx controller.
I have a microservice written in python with aiohttp and I am using gunicorn. That microservice deployed in a kubernetes cluster. I decided to test my app by doing some stresstest, for this purpose I used locust. The problem is: when I am running my app in a docker container locally, it shows pretty good results, but when I am doing stress test in a kubernetes cluster I see high memory usage by pod where my app is running. I thought that it is a memory leak and checked docker stats while stresstesting my app locally and it was using 80-90 MiB of ram. But when I am doing stresstest within a cluster I see growing memory usage on the grafana dashboard. Memory usage reaches up to 1.2 Gb and when I stop the locust it is not stabilizing and just jumps from 600 Mb to 1.2 and I see the spikes on the graph.
The pod is given 1 cpu and unlimited memory for now.
This is my gunicorn config:
workers = 1
bind = f"{SERVICE_HOST}:{SERVICE_PORT}"
worker_class = "aiohttp.GunicornUVLoopWebWorker"
#worker_connections = 4096
#max_requests = 4096
#max_requests_jitter = 100
I have tried different configuration of gunicorn with 3 workers (2*nCPU + 1) and max_request with jitter to restart workers. But haven't got good results.
One thing I discovered - when I am doing high load (500 users simultaneously) locust shows client timeouts with 'Remote disconnected'. I have read in gunicorn docs that it is a good practice to put gunicorn behind nginx because nginx can buffer the responses. And when I am testing locally or within a cluster I do not have errors like that.
The main question I have not figured out yet is why the memory usage differs locally and within a cluster?
With 1 worker when testing locally docker stats shows 80-90 MiB, but grafana graph shows what I have already described...

First of all thanks to #moonkotte for trying to help!
Today I found out what the cause of this problem is.
So, the problem is related to gunicorn workers and prometheus_mutiproc_dir env variable where the path is set to save counters data. I don't actually know for now why this is happening, but I just deleted this env variable and everything worked fine, but prometheus :). I think this relates to this issue and this limitations. Will dig deeper to solve this.

How can I make use of swap space/virtual RAM in Jupyter lab/notebook?

I am running processes in Jupyter (lab) in a JupyterHub-created container running on Kubernetes.
The processes are too RAM-intensive, to the extent that the pod sometimes gets evicted due to an OOM.
Without modifications to my code/algorithms etc., how can I in the general case tell Jupyter(Lab) to use swap space/virtual memory when a predefined RAM limit is reached?
PS This question has no answer mentioning swap space - Jupyter Lab freezes the computer when out of RAM - how to prevent it?

You can't actively control swap space.
In Kubernetes specifically, you just don't supply a memory limit for the Kubernetes pod.
That would at least not kill it because of OOM (out of memory). However, I doubt it would work because this will make the whole node go out of RAM, then swap and become extremely slow and thus at some point declared dead by the Kubernetes master. Which in turn will cause the Pod to run somewhere else and start all over again.
A more scalable approach for you might be to use out-of-core algorithms, that can operate on disk directly (so just attach a PV/PVC to your pod), but that depends on the algorithm or process you're using.

Managing workers on AWS

I occasionally have really high-CPU intensive tasks. They are launched into a separate high-intensity queue, that is consumed by a really large machine (lots of CPUs, lots of RAM). However, this machine only has to run about one hour per day.
I would like automate deployment of this image on AWS, to be triggered by outstanding messages in the high-intensity queue, and then safely stopped once it is not busy. Something along the lines of:
Some agent (presumably my own software running on my monitor server) checks the queue size, determines there are x > x_threshold new jobs to be done (e.g. I want to trigger if there are 5 outstanding "big" jobs")
A specific AWS instance is started, registers itself with the broker (RabbitMQ) and consumes the jobs
Once the worker has been idle for some t > t_idle (say, longer than 10 minutes), the machine is shut down.
Are there any tools that can I use for this, to ease the automation process, or am I going to have to bootstrap everything myself?

You can public a custom metric to AWS CloudWatch, then set up an autoscale trigger and scaling policy based on your custom metrics. Autoscale can start the instance for you and will kill it based on your policy. You'll have to include the appropriate user data in the launch configuration to bootstrap your host. Just like userdata for any EC2 instance, it could be a bash script or ansible playbook or whatever your config management tool of choice is.

Maybe overkill for your scenario, but as a starting point you may want to check out AWS OpsWorks.
http://aws.amazon.com/opsworks/
http://aws.amazon.com/opsworks/faqs/
if that is indeed a bit higher level than you need, you could use aws cloudformation - perhaps a bit 'closer to the metal' for what you want.
http://aws.amazon.com/cloudformation/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.