Here is the way to measure peak memory usage of current process since the start of the process.
process= psutil.Process(os.getpid())
process.memory_full_info().peak_wset
But what if I want to do few measurements for different parts(functions) of the program? How can I get memory used by program in any desired moment to check difference before and after?
Maybe there is the way to reset peak_wset?
Currently you no longer need os.getpid() when inspecting current process. Just use psutil.Process()
1) To measure if the peak memory increased or not (never decreased) before a function call, invoke this before and after a function call and take the difference:
On Windows:
psutil.Process().memory_info().peak_wset
psutil.Process().memory_full_info().peak_wset
On Linux:
I read here the resource.getrusage(resource.RUSAGE_SELF).ru_maxrss would do the trick but I haven't tested.
2) To measure the current memory variation before and after a function call, invoke it before and after a function call and take the difference:
On Windows and Linux:
psutil.Process().memory_info().rss
psutil.Process().memory_full_info().rss
psutil.Process().memory_full_info().uss
From docs:
memory_full_info() returns the same information as memory_info(), plus, on some platform (Linux, macOS, Windows), also provides additional metrics (USS, PSS and swap). The additional metrics provide a better representation of “effective” process memory consumption (in case of USS) as explained in detail in this blog post. It does so by passing through the whole process address. As such it usually requires higher user privileges than memory_info() and is considerably slower. On platforms where extra fields are not implemented this simply returns the same metrics as memory_info().
uss (Linux, macOS, Windows): aka “Unique Set Size”, this is the memory which is unique to a process and which would be freed if the process was terminated right now.
Note: uss is probably the most representative metric for determining how much memory is actually being used by a process.
3) To measure the specific memory a function call used during its execution, before any garbage collection takes place, you need a memory profiler.
Related
I've been slowly learning to use the multiprocessing library in Python these last few days, and I've come to a point where I'm asking myself this question, and I can't find an answer to this.
I understand that the answer might vary depending on the application, so I'll explain what my application is.
I've created a scheduler in my main process that control when multiple processes execute (these processes are spawned earlier, loop continuously and execute code when my scheduler raises a flag in a shared Value). Using counters in my scheduler, I can have multiple processes executing code at different frequencies (in the 100-400 Hz range), and they are all perfectly synchronized.
For example, one process executes a dynamic model of a quadcopter (ode) at 400 Hz and updates the quadcopter's state. My other processes (command generation and trajectory generation) run at lower frequencies (200 Hz and 100 Hz), but require the updated state. I see currently 2 methods of doing this:
With Pipes: This requires separate Pipes for the dynamics/control and dynamics/trajectory connections. Furthermore, I need the control and trajectory processes to use the latest calculated quadcopter's state, so I need to flush the Pipes until the last value in them. This works, but doesn't look very clean.
With a shared Value/Array : I would only need one Array for the state, my dynamics process would write to it, while my other processes would read from it. I would probably have to implement locks (can I read a shared Value/Array from 2 processes at the same time without a lock?). This hasn't been tested yet, but would probably be cleaner.
I've read around that it is a bad practice to use shared memory too much (why is that?). Yes, I'll be updating it at 400 Hz and reading it at 200 and 100 Hz, but it's not going to be such a large array (10-ish float or doubles). However, I've also read that shared memory is faster that Pipes/Queues, and I would like to prioritize speed in my code, if it's not too much of an issue to use shared memory.
Mind you, I'll have to send generated commands to my dynamics process (another 5-ish floats), and generated desired states to my control process (another 10-ish floats), so that's either more shared Arrays, of more Pipes.
So I was wondering, for my application, what are the pros and cons of both methods. Thanks!
I have implemented kalman filter. I want to find out how much of cpu energy is being consumed by my script. I have checked other posts on Stackoverflow and following them I downloaded psutil library. Now, I am unaware of where to put the statements to get the correct answer. Here is my code:
if __name__ == "__main__":
#kalman code
pid = os.getpid()
py = psutil.Process(pid)
current_process = psutil.Process();
memoryUse = py.memory_info()[0]/2.**30 # memory use in GB...I think
print('memory use:', memoryUse)
print(current_process.cpu_percent())
print(psutil.virtual_memory()) # physical memory usage
Please inform whether I am headed in the right direction or not.
The above code generated following results.
('memory use:', 0.1001129150390625)
0.0
svmem(total=6123679744, available=4229349376, percent=30.9, used=1334358016, free=3152703488, active=1790803968, inactive=956125184, buffers=82894848, cached=1553723392, shared=289931264, slab=132927488)
Edit: Goal: find out energy consumed by CPU while running this script
I'm not sure whether you're trying to find the peak CPU usage during execution of that #kalman code, or the average, or the total, or something else?
But you can't get any of those by calling cpu_percent() after it's done. You're just measuring the CPU usage of calling cpu_percent().
Some of these, you can get by calling cpu_times(). This will tell you how many seconds were spent doing various things, including total "user CPU time" and "system CPU time". The docs indicate exactly which values you get on each platform, and what they mean.
In fact, if this is all you want (including only needing those two values), you don't even need psutils, you can just use os.times() in the standard library.
Or you can just use the time command from outside your program, by running time python myscript.py instead of python myscript.py.
For others, you need to call cpu_percent while your code is running. One simple solution is to run a background process with multiprocessing or concurrent.futures that just calls cpu_percent on your main process. To get anything useful, the child process may need to call it periodically, aggregating the results (e.g., to find the maximum), until it's told to stop, at which point it can return the aggregate.
Since this is not quite trivial to write, and definitely not easy to explain without knowing how much familiarity you have with multiprocessing, and there's a good chance cpu_times() is actually what you want here, I won't go into details unless asked.
I have been through other answers on SO about real,user and sys times. In this question, apart from the theory, I am interested in understanding the practical implications of the times being reported by two different processes, achieving the same task.
I have a python program and a nodejs program https://github.com/rnanwani/vips_performance. Both work on a set of input images and process them to obtain different outputs. Both using libvips implementations.
Here are the times for the two
Python
real 1m17.253s
user 1m54.766s
sys 0m2.988s
NodeJS
real 1m3.616s
user 3m25.097s
sys 0m8.494s
The real time (the wall clock time as per other answers is lesser for NodeJS, which as per my understanding means that the entire process from input to output, finishes much quicker on NodeJS. But the user and sys times are very high as compared to Python. Also using the htop utility, I see that NodeJS process has a CPU usage of about 360% during the entire process maxing out the 4 cores. Python on the other hand has a CPU usage from 250% to 120% during the entire process.
I want to understand a couple of things
Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?
My guess would be that node is running more than one vips pipeline at once, whereas python is strictly running one after the other. Pipeline startup and shutdown is mostly single-threaded, so if node starts several pipelines at once, it can probably save some time, as you observed.
You load your JPEG images in random access mode, so the whole image will be decompressed to memory with libjpeg. This is a single-threaded library, so you will never see more than 100% CPU use there.
Next, you do resize/rotate/crop/jpegsave. Running through these operations, resize will thread well, with the CPU load increasing as the square of the reduction, the rotate is too simple to have much effect on runtime, and the crop is instant. Although the jpegsave is single-threaded (of course) vips runs this in a separate background thread from a write-behind buffer, so you effectively get it for free.
I tried your program on my desktop PC (six hyperthreaded cores, so 12 hardware threads). I see:
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m2.907s
user 0m9.744s
sys 0m0.784s
That looks like we're seeing 9.7 / 2.9, or about a 3.4x speedup from threading, but that's very misleading. If I set the vips threadpool size to 1, you see something closer to the true single-threaded performance (though it still uses the jpegsave write-behind thread):
$ export VIPS_CONCURRENCY=1
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m18.160s
user 0m18.364s
sys 0m0.204s
So we're really getting 18.1 / 2.97, or a 6.1x speedup.
Benchmarking is difficult and real/user/sys can be hard to interpret. You need to consider a lot of factors:
Number of cores and number of hardware threads
CPU features like SpeedStep and TurboBoost, which will clock cores up and down depending on thermal load
Which parts of the program are single-threaded
IO load
Kernel scheduler settings
And I'm sure many others I've forgotten.
If you're curious, libvips has it's own profiler which can help give more insight into the runtime behaviour. It can show you graphs of the various worker threads, how long they are spending in synchronisation, how long in housekeeping, how long actually processing your pixels, when memory is allocated, and when it finally gets freed again. There's a blog post about it here:
http://libvips.blogspot.co.uk/2013/11/profiling-libvips.html
Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
It doesn't necessarily mean they utilise the processor(s) more efficiently.
The higher user time means that Node is utilising more user space processor time, and in turn complete the task quicker. As stated by Luke Exton, the cpu is spending more time on "Code you wrote/might look at"
The higher sys time means there is more context switching happening, which makes sense from your htop utilisation numbers. This means the scheduler (kernel process) is jumping between Operating system actions, and user space actions. This is the time spent finding a CPU to schedule the task onto.
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?
The question of implementation is a long one, and has many caveats. I would assume from the python vs Node numbers that the Python threads are longer, and in turn doing more processing inline. Another thing to note is the GIL in python. Essentially python is a single threaded application, and you can't easily break out of this. This could be a contributing factor to the Node implementation being quicker (using real threads).
The Node appears to be written to be correctly threaded and to split many tasks out. The advantages of the highly threaded application will have a tipping point where you will spend MORE time trying to find a free cpu for a new thread, than actually doing the work. When this happens your python implementation might start being faster again.
The higher user+sys time means that the process had more running threads and as you've noticed by looking at 360% used almost all available CPU resources of your 4-cores. That means that NodeJS process is already limited by available CPU resources and unable to process more requests. Also any other CPU intensive processes that you could eventually run on that machine will hit your NodeJS process. On the other hand Python process doesn't take all available CPU resources and probably could scale with a number of requests.
So these times are not reliable in and of themselves, they say how long the process took to perform an action on the CPU. This is coupled very tightly to whatever else was happening at the same time on that machine and could fluctuate wildly based entirely on physical resources.
In terms of these times specifically:
real = Wall Clock time (Start to finish time)
user = Userspace CPU time (i.e. Code you wrote/might look at) e.g. node/python libs/your code
sys = Kernel CPU time (i.e. Syscalls, e.g Open a file from the OS.)
Specifically, small real time means it actually finished faster. Does it mean it did it better for sure, NO. There could have been less happening on the machine at the same time for instance.
In terms of scale, these numbers are a little irrelevant, and it depends on the architecture/bottlenecks. For instance, in terms of scale and specifically, cloud compute, it's about efficiently allocating resources and the relevant IO for each, generally (compute, disk, network). Does processing this image as fast as possible help with scale? Maybe? You need to examine bottlenecks and specifics to be sure. It could for instance overwhelm your network link and then you are constrained there, before you hit compute limits. Or you might be constrained by how quickly you can write to the disk.
One potentially important aspect of this which no one mention is the fact that your library (vips) will itself launch threads:
http://www.vips.ecs.soton.ac.uk/supported/current/doc/html/libvips/using-threads.html
When libvips calculates an image, by default it will use as many
threads as you have CPU cores. Use vips_concurrency_set() to change
this.
This explains the thing that surprised me initially the most -- NodeJS should (to my understanding) be pretty single threaded, just as Python with its GIL. It being all about asynchronous processing and all.
So perhaps Python and Node bindings for vips just use different threading settings. That's worth investigating.
(that said, a quick look doesn't find any evidence of changes to the default concurrency levels in either library)
I have a batch that launches a few executables .exe and .py (python) to process some data.
With start /affinity X mybatch.bat
it will work as it should only if X equals to 0, 2, 4 or 8 (the individual cores)
But if I will use a multicore X like 15 or F or 0xF (meaning in my opinion all 4 cores) it will still run only on the first core.
Does it have to do with the fact that the batch is calling .exe files that maybe cannot be affinity controled this way?
OS:Windows 7 64bit
This is more of an answer to a question that arose in comments, but I hope it might help. I have to add it as an answer only because it grew too large for the comment limits:
There seems to be a misconception about two things here: what "processor affinity" actually means, and how the Windows scheduler actually works.
What this SetProcessAffinityMask(...) means is "which processors can this process (i.e. "all threads within the process") can run on,"
whereas
SetThreadAffinityMask(...) is distinctly thread-specific.
The Windows scheduler (at the most base level) makes absolutely no distinction between threads and processes - a "process" is simply a container that contains one or more threads. IOW (and over-simplified) - there is no such thing as a process to the scheduler, "threads" are schedulable things: processes have nothing to do with this ("processes" are more life-cycle-management issues about open handles, resources, etc.)
If you have a single-threaded process, it does not matter much what you set the "process" affinity mask to: that one thread will be scheduled by the scheduler (for whatever masked processors) according to 1) which processor it was last bound to - ideal case, less overhead, 2) whichever processor is next available for a given runnable thread of the same priority (more complicated than this, but the general idea), and 3) possibly transient issues about priority inversion, waitable objects, kernel APC events, etc.
So to answer your question (much more long-windedly than expected):
"But if I will use a multicore X like 15 or F or 0xF (meaning in my opinion all 4 cores) it will still run only on the first core"
What I said earlier about the scheduler attempting to use the most-recently-used processor is important here: if you have a (or an essentially) single-threaded process, the scheduling algorithm goes for the most-optimistic approach: previously-bound CPU for the switchback (likely cheaper for CPU/main memory cache, prior branch-prediction eval, etc). This explains why you'll see an app (regardless of process-level affinity) with only one (again, caveats apply here) thread seemingly "stuck" to one CPU/core.
So:
What you are effectively doing with the "/affinity X" switch is
1) constraining the scheduler to only schedule your threads on a subset of CPU cores (i.e. not all), and
2) limit them to a subset of what the scheduler kernel considers "available for next runnable thread switch-to", and
3) if they are not multithreaded apps (and capable of taking advantage of that), "more cores" does not really help anything - you might just be bouncing that single thread of execution around to different cores (although the scheduler tries to minimize this, as described above).
That is why your threads are "sticky" - you are telling the scheduler to make them so.
AFAIK /AFFINITY uses a hex values as parameter. Here's more information on this topic: Set affinity with start /AFFINITY command on Windows 7
EDIT: 0xF should be the correct parameter for you as you want to load all 4 CPUs which requires the bit-mask 00001111 which is F in hex.
I have a py script which reads data from db and loads into CSV file as below
with open(filePath, 'a') as myfile:
myfile.write(myline)
Once the file is ready, my py script will initiate SQLLDR from subprocess.call(...) and loads all of them.
I will have to do all of the above steps with very low priority. I have set os.nice(19) and passed it to all my subprocess.call(...) function. Looks good this for step 2 above.
How can set "niceness" while creating and writing files on server? Is it possible to pass os (after setting os.nice(19)) while writing/reading files that way "nice" will make sure it run with low priority?
By default, on linux >2.6, the io priority is computed from the nice value, meaning that low priority process will automatically have a lower IO impact, but you can specify the io priority of a process with the ionice command, if you want to set it to another value.
In general, nice value affects the scheduling of your process on the CPU (i.e. the share of CPU cycles you are getting and the priority), so it will hardly have any impact for the process that just writes data to a file. Why not run the process from the get go with an appropriately low nice priority?
As it says in the manual for setpriority(2) upon which nice is built:
A child created by fork(2) inherits its parent's nice value. The nice
value is preserved across execve(2).
so any Subprocesses will inherit the current nice setting. Although as Alexander notes, you won't see much difference for a process that spends most of its time waiting for disk writes to complete.