My small AWS EC2 instance runs a two python scripts, one to receive JSON messages as a web-socket(~2msg/ms) and write to csv file, and one to compress and upload the csvs. After testing, the data(~2.4gb/day) recorded by the EC2 instance is sparser than if recorded on my own computer(~5GB). Monitoring shows the EC2 instance consumed all CPU credits and is operating on baseline power. My question is, does the instance drop messages because it cannot write them fast enough?
Thank you to anyone that can provide any insight!
It depends on the WebSocket server.
If your first script cannot run fast enough to match the message generation speed on server side, the TCP receive buffer will become full and the server will slow down on sending packets. Assuming a near-constant message production rate, unprocessed messages will pile up on the server, and the server could be coded to let them accumulate or eventually drop them.
Even if the server never dropped a message, without enough computational power, your instance would never catch up - on 8/15 it could be dealing with messages from 8/10 - so instance upgrade is needed.
Does data rate vary greatly throughout the day (e.g. much more messages in evening rush around 20:00)? If so, data loss may have occurred during that period.
But is Python really that slow? 5GB/day is less than 100KB per second, and even a fraction of one modern CPU core can easily handle it. Perhaps you should stress test your scripts and optimize them (reduce small disk writes, etc.)
Related
I have IOT devices (arm64) running Ubuntu with SD cards (formatted as ext4 with Journaling) where my application logging(python logging library) is done to files on that SD card, overall the write speed (as reported by iotop) is around 40KB/s (the device operates 24/7/365)
What I see that once in a while (week or so?) there is a spike in iowait (see attached screenshot from netdata).
When this happens my proccess get stuck for 5-15 seconds which is a lot!
Now I know that I should change my logging to be non-blocking to avoid my process getting stuck if there is an issue with the disk but it seems excessive this amount of time considering the fact the write speed is very low.
It has gotten worse since I increased logging but still it is not a lot of data.
My next steps are:
Use QueueHandler to do logging without blocking
Disable journaling on sdcard
Disable docker logging as it is also writing to disk.
But I want to understand the underlying issue that causes this kind of stalls, what can it be?
Not a full solution but adding QueueHandler made my app survive this high loads.
It is easy to simulate this with slowpokefs or just doing a lot of IO (like taring a big folder) while logging constantly.
I have a simple Flask application that exposes one api. Calling the api runs a python algorithm that does a lot of string manipulation and file reading (no writing). The algorithm takes about 1000ms. I'm trying to see if there's anyway to optimize concurrent requests. I'm running on a single instance of 4 vCPU VM.
I wrote a client that makes a request every 1000ms. There's minimal RAM usage, and CPU usage is about 35%. When I up the request to every 750ms. RAM usage did not increase by much, but CPU usage doubles to 70%. If I increase the requests to every 500ms, the response will start taking longer time, eventually timing out. CPU usage is at 100%, and RAM is still minimal.
I followed this tutorial to set my application. I enabled threads in my uWSGI settings. However, I did not really notice much difference.
I was hoping to get some advice on what I can do software/settings-wise to respond better to concurrent requests.
I am doing my bachelor's thesis where I wrote a program that is distributed over many servers and exchaning messages via IPv6 multicast and unicast. The network usage is relatively high but I think it is not too high when I have 15 servers in my test where there are 2 requests every second that are going like that:
Server 1 requests information from server 3-15 via multicast. every of 3-15 must respond. if one response is missing after 0.5 sec, the multicast is resent, but only the missing servers must respond (so in most cases this is only one server)
Server 2 does exactly the same. If there are missing results after 5 retries the missing servers are marked as dead and the change is synced with the other server (1/2)
So there are 2 multicasts every second and 26 unicasts every second. I think this should not be too much?
Server 1 and 2 are running python web servers which I use to do the request every second on each server (via a web client)
The whole szenario is running in a mininet environment which is running in a virtual box ubuntu that has 2 cores (max 2.8ghz) and 1GB RAM. While running the test, i see via htop that the CPUs are at 100% while the RAM is at 50%. So the CPU is the bottleneck here.
I noticed that after 2-5 minutes (1 minute = 60 * (2+26) messages = 1680 messages) there are too many missing results causing too many sending repetitions while new requests are already coming in, so that the "management server" thinks the client servers (3-15) are down and deregisters them. After syncing this with the other management server, all client servers are marked as dead on both management servers which is not true...
I am wondering if the problem could be my debug outputs? I am printing 3-5 messages for every message that is sent and received. So that are about (let's guess it are 5 messages per sent/recvd msg) (26 + 2)*5 = 140 lines that are printed on the console.
I use python 2.6 for the servers.
So the question here is: Can the console output slow down the whole system that simple requests take more than 0.5 seconds to complete 5 times in a row? The request processing is simple in my test. No complex calculations or something like that. basically it is something like "return request_param in ["bla", "blaaaa", ...] (small list of 5 items)"
If yes, how can I disable the output completely without having to comment out every print statement? Or is there even the possibility to output only lines that contain "Error" or "Warning"? (not via grep, because when grep becomes active all the prints already have finished... I mean directly in python)
What else could cause my application to be that slow? I know this is a very generic question, but maybe someone already has some experience with mininet and network applications...
I finally found the real problem. It was not because of the prints (removing them improved performance a bit, but not significantly) but because of a thread that was using a shared lock. This lock was shared over multiple CPU cores causing the whole thing being very slow.
It even got slower the more cores I added to the executing VM which was very strange...
Now the new bottleneck seems to be the APScheduler... I always get messages like "event missed" because there is too much load on the scheduler. So that's the next thing to speed up... :)
Which is the more resource-friendly way to collect SNMP traps from a Cisco router via python:
I could use a manager on a PC running a server, where the Cisco SNMP traps are sent to in case one occurs
I could use an agent to send a GET/GETBULK request every x timeframe to check if any new traps have occurred
I am looking for a way to run the script so that it uses the least resources as possible. Not many traps will occur so the communication will be low mostly, but as soon as one does occur, the PC should know immediately.
Approach 1 is better from most perspectives.
It uses a little memory on the PC due to running a trap collecting daemon, but the footprint should be reasonably small since it only needs to listen for traps and decode them, not do any complex task.
Existing tools to receive traps include the net-snmp suite which allows you to just configure the daemon (i e you don't have to do any programming if you want to save some time).
Approach 2 has a couple of problems:
No matter what polling interval you choose, you run the risk of missing an alarm that was only active on the router for a short time.
Consumes CPU and network resources even if no faults are occurring.
Depending on the MIB of the router, some types of event may not be stored in any table for later retrieval. For Cisco, I would not expect this problem, but you do need to study the MIB and make sure of this.
We have recently launched a django site which amongst other things, has a screen representing all sorts of data. A request to the server is sent every 10 seconds to get new data. The average response size is 10kb.
The site is working on approx. 30 clients, meaning every client sends a get request every 10 seconds.
When locally testing, responses came back after 80ms. After deployment with 30~ users, we're taking up to 20 seconds!!
So the initial thought is that my code sucks. I went through all my queries and did everything i can to optimize then and reduce calls to the database (which was hard, nearly everything is somwething like object.filter(id=num) and my tables have less thab 5k rows atm...)
But then i noticed the same issue occurs in the admin panel! Which is clearly optimized and doesn't have my perhaps inefficient code, since I didn't write it. Opening the users tab takes 30 seconds at certain requests!!
So, what is it? Do I argue with the company sysadmins and demand a better server? They say we dont need better hardware (running on dual core 2.67ghz and 4gb ram, which isnt a lot, but still shouldn't be THAT slow)
Doesn't the fact that the admin site is slow imply that this is a hardware issue?