Python/PySerial and CPU usage

Python/PySerial and CPU usage - python

I've created a script to monitor the output of a serial port that receives 3-4 lines of data every half hour - the script runs fine and grabs everything that comes off the port which at the end of the day is what matters...
What bugs me, however, is that the cpu usage seems rather high for a program that's just monitoring a single serial port, 1 core will always be at 100% usage while this script is running.
I'm basically running a modified version of the code in this question: pyserial - How to Read Last Line Sent from Serial Device
I've tried polling the inWaiting() function at regular intervals and having it sleep when inWaiting() is 0 - I've tried intervals from 1 second down to 0.001 seconds (basically, as often as I can without driving up the cpu usage) - this will succeed in grabbing the first line but seems to miss the rest of the data.
Adjusting the timeout of the serial port doesn't seem to have any effect on cpu usage, nor does putting the listening function into it's own thread (not that I really expected a difference but it was worth trying).
Should python/pyserial be using this much cpu? (this seems like overkill)
Am I wasting my time on this quest / Should I just bite the bullet and schedule the script to sleep for the periods that I know no data will be coming?

Maybe you could issue a blocking read(1) call, and when it succeeds use read(inWaiting()) to get the right number of remaining bytes.

Would a system style solution be better? Create the python script and have it executed via Cron/Scheduled Task?
pySerial shouldn't be using that much CPU but if its just sitting there polling for an hour I can see how it may happen. Sleeping may be a better option in conjunction with periodic wakeup and polls.

Related

High iowait on sdcard - process stuck for 5-15 seconds

I have IOT devices (arm64) running Ubuntu with SD cards (formatted as ext4 with Journaling) where my application logging(python logging library) is done to files on that SD card, overall the write speed (as reported by iotop) is around 40KB/s (the device operates 24/7/365)
What I see that once in a while (week or so?) there is a spike in iowait (see attached screenshot from netdata).
When this happens my proccess get stuck for 5-15 seconds which is a lot!
Now I know that I should change my logging to be non-blocking to avoid my process getting stuck if there is an issue with the disk but it seems excessive this amount of time considering the fact the write speed is very low.
It has gotten worse since I increased logging but still it is not a lot of data.
My next steps are:
Use QueueHandler to do logging without blocking
Disable journaling on sdcard
Disable docker logging as it is also writing to disk.
But I want to understand the underlying issue that causes this kind of stalls, what can it be?

Not a full solution but adding QueueHandler made my app survive this high loads.
It is easy to simulate this with slowpokefs or just doing a lot of IO (like taring a big folder) while logging constantly.

Python: handle any and all postgres timeout situations

I am struggling to find a solution to this.
We use Python to monitor our Postgres databases that are running on AWS RDS. This means we have extremely limited control over the server side.
If there is an issue, and the server fails (hardware fault, network fault, you name it) our scripts just hang for sometimes 8-10 minutes (non-SSL) and up to 15-20 minutes (SSL connections). And by the time they recover, and finally reach whatever timeout this random amount of minutes is from, the server has failed over and everything works again.
Obviously, this renders our tools useless, if we can't catch these situations.
We basically run this (pseudo-code):
while True:
try:
query("select current_user")
except:
page("Database X failed")
For the basic use cases, this works just fine. E.g. if an instance is restarted, or something of the sort, no problems.
But, if there is an actual issue, the query just hangs. For minutes and minutes.
We've tried setting the statement_timeout on the psycopg2 connection. But that is a server setting, and if the instance fails and fails over, well, there is no server. So the client ends up waiting indefinitely or until it hits one of those arbitrary timeouts.
I've looked into sockets and tested something like this:
pg.connect('user', 'db-name', instanceName = 'foo')
fd = pg.Connection.fileno()
s = socket.socket(fileno=fd)
print(s)
s.settimeout(0.0000001)
timeval = struct.pack('ll', 0, 1)
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVTIMEO, timeval)
s.setsockopt(socket.SOL_SOCKET, socket.SO_SNDTIMEO, timeval)
data = pg.query("SELECT pg_sleep(10);")
In dumping the socket with the print(s) statement, I can clearly see that we've got the right socket.
But the timeouts I set, do nothing whatsoever.
I've tried many values, and it just has no effect. With the above, it should raise a timeout if more than 1 microsecond has elapsed. Ignoring common sense, I checked with tcpdump and made sure that we definitely do not get a response in 1 microsecond. Yet the thing just sits there and waits for to pg_sleep(10) to complete.
Could someone shed some light on this?
It seems simple enough:
All I want is that ANY CALL made to postgres can NEVER TAKE LONGER than, say, 10 seconds. Regardless of what it is, regardless of what happens. If more than 10 seconds have elapsed, it needs to raise an exception.
From what I can see, the only way would be to use subprocess with a process-timeout. But we run threaded (we monitor hundreds of instances and spawn a persistent connection in a thread for each instance) and I've seen posts saying that this isn't reliable inside threads. It also seems silly to have each thread spawn yet another subprocess. Inception comes to mind. Where does it end?
But I digress. The question seems simple enough, yet my wall is showing clear signs of a large dent developing.
Greatly appreciate any insights
PS: Python 3.6 on Ubuntu LTS
Cheers
Stefan

(AWS) What happens to a python script without enough CPU?

My small AWS EC2 instance runs a two python scripts, one to receive JSON messages as a web-socket(~2msg/ms) and write to csv file, and one to compress and upload the csvs. After testing, the data(~2.4gb/day) recorded by the EC2 instance is sparser than if recorded on my own computer(~5GB). Monitoring shows the EC2 instance consumed all CPU credits and is operating on baseline power. My question is, does the instance drop messages because it cannot write them fast enough?
Thank you to anyone that can provide any insight!

It depends on the WebSocket server.
If your first script cannot run fast enough to match the message generation speed on server side, the TCP receive buffer will become full and the server will slow down on sending packets. Assuming a near-constant message production rate, unprocessed messages will pile up on the server, and the server could be coded to let them accumulate or eventually drop them.
Even if the server never dropped a message, without enough computational power, your instance would never catch up - on 8/15 it could be dealing with messages from 8/10 - so instance upgrade is needed.
Does data rate vary greatly throughout the day (e.g. much more messages in evening rush around 20:00)? If so, data loss may have occurred during that period.
But is Python really that slow? 5GB/day is less than 100KB per second, and even a fraction of one modern CPU core can easily handle it. Perhaps you should stress test your scripts and optimize them (reduce small disk writes, etc.)

Comparing two different processes based on the real, user and sys times

I have been through other answers on SO about real,user and sys times. In this question, apart from the theory, I am interested in understanding the practical implications of the times being reported by two different processes, achieving the same task.
I have a python program and a nodejs program https://github.com/rnanwani/vips_performance. Both work on a set of input images and process them to obtain different outputs. Both using libvips implementations.
Here are the times for the two
Python
real 1m17.253s
user 1m54.766s
sys 0m2.988s
NodeJS
real 1m3.616s
user 3m25.097s
sys 0m8.494s
The real time (the wall clock time as per other answers is lesser for NodeJS, which as per my understanding means that the entire process from input to output, finishes much quicker on NodeJS. But the user and sys times are very high as compared to Python. Also using the htop utility, I see that NodeJS process has a CPU usage of about 360% during the entire process maxing out the 4 cores. Python on the other hand has a CPU usage from 250% to 120% during the entire process.
I want to understand a couple of things
Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?

My guess would be that node is running more than one vips pipeline at once, whereas python is strictly running one after the other. Pipeline startup and shutdown is mostly single-threaded, so if node starts several pipelines at once, it can probably save some time, as you observed.
You load your JPEG images in random access mode, so the whole image will be decompressed to memory with libjpeg. This is a single-threaded library, so you will never see more than 100% CPU use there.
Next, you do resize/rotate/crop/jpegsave. Running through these operations, resize will thread well, with the CPU load increasing as the square of the reduction, the rotate is too simple to have much effect on runtime, and the crop is instant. Although the jpegsave is single-threaded (of course) vips runs this in a separate background thread from a write-behind buffer, so you effectively get it for free.
I tried your program on my desktop PC (six hyperthreaded cores, so 12 hardware threads). I see:
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m2.907s
user 0m9.744s
sys 0m0.784s
That looks like we're seeing 9.7 / 2.9, or about a 3.4x speedup from threading, but that's very misleading. If I set the vips threadpool size to 1, you see something closer to the true single-threaded performance (though it still uses the jpegsave write-behind thread):
$ export VIPS_CONCURRENCY=1
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m18.160s
user 0m18.364s
sys 0m0.204s
So we're really getting 18.1 / 2.97, or a 6.1x speedup.
Benchmarking is difficult and real/user/sys can be hard to interpret. You need to consider a lot of factors:
Number of cores and number of hardware threads
CPU features like SpeedStep and TurboBoost, which will clock cores up and down depending on thermal load
Which parts of the program are single-threaded
IO load
Kernel scheduler settings
And I'm sure many others I've forgotten.
If you're curious, libvips has it's own profiler which can help give more insight into the runtime behaviour. It can show you graphs of the various worker threads, how long they are spending in synchronisation, how long in housekeeping, how long actually processing your pixels, when memory is allocated, and when it finally gets freed again. There's a blog post about it here:
http://libvips.blogspot.co.uk/2013/11/profiling-libvips.html

Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
It doesn't necessarily mean they utilise the processor(s) more efficiently.
The higher user time means that Node is utilising more user space processor time, and in turn complete the task quicker. As stated by Luke Exton, the cpu is spending more time on "Code you wrote/might look at"
The higher sys time means there is more context switching happening, which makes sense from your htop utilisation numbers. This means the scheduler (kernel process) is jumping between Operating system actions, and user space actions. This is the time spent finding a CPU to schedule the task onto.
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?
The question of implementation is a long one, and has many caveats. I would assume from the python vs Node numbers that the Python threads are longer, and in turn doing more processing inline. Another thing to note is the GIL in python. Essentially python is a single threaded application, and you can't easily break out of this. This could be a contributing factor to the Node implementation being quicker (using real threads).
The Node appears to be written to be correctly threaded and to split many tasks out. The advantages of the highly threaded application will have a tipping point where you will spend MORE time trying to find a free cpu for a new thread, than actually doing the work. When this happens your python implementation might start being faster again.

The higher user+sys time means that the process had more running threads and as you've noticed by looking at 360% used almost all available CPU resources of your 4-cores. That means that NodeJS process is already limited by available CPU resources and unable to process more requests. Also any other CPU intensive processes that you could eventually run on that machine will hit your NodeJS process. On the other hand Python process doesn't take all available CPU resources and probably could scale with a number of requests.

So these times are not reliable in and of themselves, they say how long the process took to perform an action on the CPU. This is coupled very tightly to whatever else was happening at the same time on that machine and could fluctuate wildly based entirely on physical resources.
In terms of these times specifically:
real = Wall Clock time (Start to finish time)
user = Userspace CPU time (i.e. Code you wrote/might look at) e.g. node/python libs/your code
sys = Kernel CPU time (i.e. Syscalls, e.g Open a file from the OS.)
Specifically, small real time means it actually finished faster. Does it mean it did it better for sure, NO. There could have been less happening on the machine at the same time for instance.
In terms of scale, these numbers are a little irrelevant, and it depends on the architecture/bottlenecks. For instance, in terms of scale and specifically, cloud compute, it's about efficiently allocating resources and the relevant IO for each, generally (compute, disk, network). Does processing this image as fast as possible help with scale? Maybe? You need to examine bottlenecks and specifics to be sure. It could for instance overwhelm your network link and then you are constrained there, before you hit compute limits. Or you might be constrained by how quickly you can write to the disk.

One potentially important aspect of this which no one mention is the fact that your library (vips) will itself launch threads:
http://www.vips.ecs.soton.ac.uk/supported/current/doc/html/libvips/using-threads.html
When libvips calculates an image, by default it will use as many
threads as you have CPU cores. Use vips_concurrency_set() to change
this.
This explains the thing that surprised me initially the most -- NodeJS should (to my understanding) be pretty single threaded, just as Python with its GIL. It being all about asynchronous processing and all.
So perhaps Python and Node bindings for vips just use different threading settings. That's worth investigating.
(that said, a quick look doesn't find any evidence of changes to the default concurrency levels in either library)

Console output consuming much CPU? (about 140 lines per second)

I am doing my bachelor's thesis where I wrote a program that is distributed over many servers and exchaning messages via IPv6 multicast and unicast. The network usage is relatively high but I think it is not too high when I have 15 servers in my test where there are 2 requests every second that are going like that:
Server 1 requests information from server 3-15 via multicast. every of 3-15 must respond. if one response is missing after 0.5 sec, the multicast is resent, but only the missing servers must respond (so in most cases this is only one server)
Server 2 does exactly the same. If there are missing results after 5 retries the missing servers are marked as dead and the change is synced with the other server (1/2)
So there are 2 multicasts every second and 26 unicasts every second. I think this should not be too much?
Server 1 and 2 are running python web servers which I use to do the request every second on each server (via a web client)
The whole szenario is running in a mininet environment which is running in a virtual box ubuntu that has 2 cores (max 2.8ghz) and 1GB RAM. While running the test, i see via htop that the CPUs are at 100% while the RAM is at 50%. So the CPU is the bottleneck here.
I noticed that after 2-5 minutes (1 minute = 60 * (2+26) messages = 1680 messages) there are too many missing results causing too many sending repetitions while new requests are already coming in, so that the "management server" thinks the client servers (3-15) are down and deregisters them. After syncing this with the other management server, all client servers are marked as dead on both management servers which is not true...
I am wondering if the problem could be my debug outputs? I am printing 3-5 messages for every message that is sent and received. So that are about (let's guess it are 5 messages per sent/recvd msg) (26 + 2)*5 = 140 lines that are printed on the console.
I use python 2.6 for the servers.
So the question here is: Can the console output slow down the whole system that simple requests take more than 0.5 seconds to complete 5 times in a row? The request processing is simple in my test. No complex calculations or something like that. basically it is something like "return request_param in ["bla", "blaaaa", ...] (small list of 5 items)"
If yes, how can I disable the output completely without having to comment out every print statement? Or is there even the possibility to output only lines that contain "Error" or "Warning"? (not via grep, because when grep becomes active all the prints already have finished... I mean directly in python)
What else could cause my application to be that slow? I know this is a very generic question, but maybe someone already has some experience with mininet and network applications...

I finally found the real problem. It was not because of the prints (removing them improved performance a bit, but not significantly) but because of a thread that was using a shared lock. This lock was shared over multiple CPU cores causing the whole thing being very slow.
It even got slower the more cores I added to the executing VM which was very strange...
Now the new bottleneck seems to be the APScheduler... I always get messages like "event missed" because there is too much load on the scheduler. So that's the next thing to speed up... :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.