How can I profile a multithreaded program?

How can I profile a multithreaded program? - python

I have a program that is performing waaaay under par, and I would like to profile it. However, it is multithreaded, so I can't seem to find a good way to profile this thing. Any advice?
I've tried yappi, but it segfaults on OS X :(
EDIT: This is in python, sorry for putting it under profiling...

Are you multithreading or multiprocessing? If you are just multithreading, then that is the problem. Python currently has problems with multithreading on a multiprocessor system because of the Global Interpreter Lock (GIL). They are working on fixing it for Python 3.2 - at least so that your program will run as fast on a single core as on multiple cores.
If you aren't convinced take a look at the shootout results for the thread-ring program. Running with a single core is faster than running with quad cores.
Now, if you use multiprocessing instead, profiling can be difficult as well because then you have to run CProfiler from each separate process. There are some questions that point you in the right direction though.

Depending on how far you've come in your troubleshooting, there are some tools that might point you in the right direction.
"top" is a helpful start to show you if your problem is burning CPU time or simply waiting for stuff.
"dtruss -c" can show you where you spend time and what system calls takes most of your time.
Both these can give you a hint without knowing anything about python.
If you just want to use yappi, it isn't too much work to set up a virtualbox and install some sort of Linux on your machine. I find myself doing that from time to time when I want to try something.
There might of course be things I don't know about that makes it impossible or not worth the effort. Also, profiling on another OS running virtualized might not give the exact same results, but it might still be helpful.

Related

Python3 Search the virtual memory of a running windows process

begin TLDR;
I want to write a python3 script to scan through the memory of a running windows process and find strings.
end TLDR;
This is for a CTF binary. It's a typical Windows x86 PE file. The goal is simply to get a flag from the processes memory as it runs. This is easy with ProcessHacker you can search through the strings in the memory of the running application and find the flag with a regex. Now because I'm a masochistic geek I strive to script out solutions for CTFs (for everything really). Specifically I want to use python3, C# is also an option but would really like to keep all of the solution scripts in python.
Thought this would be a very simple task. You know... pip install some library written by someone that's already solved the problem and use it. Couldn't find anything that would let me do what I need for this task. Here are the libraries I tried out already.
ctypes - This was the first one I used, specifically ReadProcessMemory. Kept getting 299 errors which was because the buffer I was passing in was larger than that section of memory so I made a recursive function that would catch that exception, divide the buffer length by 2 until it got something THEN would read one byte at a time until it hit a 299 error. May have been on the right track there but I wasn't able to get the flag. I WAS able to find the flag only if I knew the exact address of the flag (which I'd get from process hacker). I may make a separate question on SO to address that, this one is really just me asking the community if something already exists before diving into this.
pymem - A nice wrapper for ctypes but had the same issues as above.
winappdbg - python2.x only. I don't want to use python 2.x.
haystack - Looks like this depends on winappdbg which depends on python 2.x.
angr - This is a possibility, Only scratched the surface with it so far. Looks complicated and it's on the to learn list but don't want to dive into something right now that's not going to solve the issue.
volatility - Looks like this is meant for working with full RAM dumps not for hooking into currently running processes and reading the memory.
My plan at the moment is to dive a bit more into angr to see if that will work, go back to pymem/ctypes and try more things. If all else fails ProcessHacker IS opensource. I'm not fluent in C so it'll take time to figure out how they're doing it. Really hoping there's some python3 library I'm missing or maybe I'm going about this the wrong way.

Ended up writing the script using the frida library. Also have to give soutz to rootbsd because his or her code in the fridump3 project helped greatly.

The way pyglet swaps front and back buffers (`flip()`, wrapper for OpenGL's wglSwapLayerBuffers) on Windows can be 100 times too slow

Out of the blue (although I might have missed some automated update), the flip() method of pyglet on my P.C. became about 100 times slower (my script goes from about 20 to 0.2 FPS, and profiling shows that flip() is to blame).
I don't understand this fully but since my OS is windows 10, the method seems to just be a way to run the wglSwapLayerBuffers OpenGL double-buffering cycle in python. Everything else seems to have a normal speed, including programs that use OpenGL. This has happened before and fixed itself after a restart, so I didn't really look further into it at the time.
Now, restarting doesn't change anything. I updated my GPU driver, I tried disabling vsync, I looked for unrelated processes that might use a lot of memory and/or GPU memory. I re-installed the latest stable version of pyglet.
Now I have no idea how to even begin troubleshooting this...
Here's a minimal example that prints me 0.2s instead of 20s.
from pyglet.gl import *
def timing(dt):
print(1/dt)
game_window = pyglet.window.Window(1,1)
if __name__ == '__main__':
pyglet.clock.schedule_interval(timing, 1/20.0)
pyglet.app.run()
(Within pyglet.app.run(), profiling shows me that it's the flip() method that takes basically all the time).
Edit: my real script, which displays frequently updated images using pyglet, causes no increase in GPU usage whatsoever (I also checked the effect of a random program (namely Minecraft) to make sure the GPU monitoring tool I use works, and it does cause an increase). I think this rules out the possibility that I somehow don't have enough computing power available due to some unrelated issue.

OK, I found a way to solve my issue in this google groups conversation about a different problem with the same method: https://groups.google.com/forum/#!topic/pyglet-users/7yQ9viOu75Y (the changes suggested in claudio canepa's reply, namely making flip() link to the GDI version of the same function instead of wglSwapLayerBuffers, brings things back to normal).
I'm still not sure why wglSwapLayerBuffers behaved so oddly in my case. I guess problems like mine are part of the reason why the GDI version is "recommended". However understanding why my problem is even possible would still be nice, if someone gets what's going on... And having to meddle with a relatively reliable and respected library just to perform one of its most basic tasks feels really, really dirty, there must be a more sensible solution.

python on HPC cluster computer

I asked a question very close to this, but it wasn't answered and since then I hope I have learned to better ask the question.
I was curious as to how run many jobs serially on a Cray XE6 machine. You usually qsub things with a ccmrun (for a serial job) or an aprun (instead of mpirun or mpiexec). I first wanted to use the Pool() function, but due to it not being SMP based hardware it would be limited to 32 processors. Even an mpi4py application of something like a pool wouldn't work, because I am not giving the main program all of the processors. I would be running that script 64 times if I were to say aprun -n 64 mpipool.py, whereas it does work if I do something like aprun -n 1 -d 32 pool.py.
I've had a look at the https://wiki.python.org/moin/ParallelProcessing website and was wondering if anyone had any experience running multiple serial jobs on a cluster computing machine with any of them. I did write an mpi4py code that basically had rank 0 doing all of the job selection, and then giving them out to the the other processors. It didn't want to play nice on the machine since I needed to use subprocess to launch the giant amount of C code. So, one last caveat is that it would have to play nice with subprocess.
I would like to have it look at the amount of nodes chosen, and then basically do something along the lines of:
ccmrun jobscheduler.py &
ccmrun jobrunner.py 63 & # given that I started the job with 64 processors. I may have to do a bash loop here, but that's no problem.
Once started I would want them to be able to communicate between one another, but without MPI I'm not sure of an efficient way of doing this. If anyone could get me started on the right path I would greatly appreciate it. Maybe doing pickle dumps and locking them and deleting them when a jobrunner picks it up. There might be a really simple way of doing this, but I'm very new to this.
Thanks!

I don't know anything about Cray machines but I'll take a stab at this anyway. I noticed you mentioned qsub which makes me think that the system is using PBS or Torque. Both seem to support Job Arrays which may be along the lines of what you are looking for.
Job Arrays would make the queue system responsible for job management. Each subjob would be assigned an array id out of a range you specify and would be assigned whatever resources you requested with -l. In Torque, '#PBS -l nodes=1' and '#PBS -t 1-64' would create 64 subjobs with indexes from 1 to 64 each being assigned a single node. Man pages and Google will be a good resource and from what I've seen Torque and PBS differ on syntax. If that doesn't work, you can look at using pbsdsh inside of a single, larger job.
Also, I want to mention that advice from strangers on the Internet will only take you so far. Your local admin may have limits or scheduling policies in place that you may limit your options. You may also be able to get some advice from the admin about other, proven ways that you can solve your problem.

Optimal way of matlab to python communication

So I am working on a Matlab application that has to do some communication with a Python script. The script that is called is a simple client software. As a side note, if it would be possible to have a Matlab client and a Python server communicating this would solve this issue completely but I haven't found a way to work that out.
Anyhow, after searching the web I have found two ways to call Python scripts, either by the system() command or editing the perl.m file to call Python scripts instead. Both ways are too slow though (tic tocing them to > 20ms and must run faster than 6ms) as this call will be in a loop that is very time sensitive.
As a solution I figured I could instead save a file at a certain location and have my Python script continuously check for this file and when finding it executing the command I want it to. Now after timing each of these steps and summing them up I found it to be incredibly much faster (almost 100x so for sure fast enough) and I cant really believe that, or rather I cant understand why calling python scripts is so slow (not that I have more than a superficial knowledge of the subject). I also found this solution to be really messy and ugly so just wanted to check that, first, is it a good idea and second, is there a better one?
Finally, I realize that the Python time.time() and Matlab tic, toc might not be precise enough to measure time correctly on that scale so also a reason why I ask.

Spinning up new instances of the Python interpreter takes a while. If you spin up the interpreter once, and reuse it, this cost is paid only once, rather than for every run.
This is normal (expected) behaviour, since startup includes large numbers of allocations and imports. For example, on my machine, the startup time is:
$ time python -c 'import sys'
real 0m0.034s
user 0m0.022s
sys 0m0.011s

Pythonistas, please help convert this to utilize Python Threading concepts

Update : For anyone wondering what I went with at the end -
I divided the result-set into 4 and ran 4 instances of the same program with one argument each indicating what set to process. It did the trick for me. I also consider PP module. Though it worked, it prefer the same program. Please pitch in if this is a horrible implementation! Thanks..
Following is what my program does. Nothing memory intensive. It is serial processing and boring. Could you help me convert this to more efficient and exciting process? Say, I process 1000 records this way and with 4 threads, I can get it to run in 25% time!
I read articles on how python threading can be inefficient if done wrong. Even python creator says the same. So I am scared and while I am reading more about them, want to see if bright folks on here can steer me in the right direction. Muchos gracias!
def startProcessing(sernum, name):
'''
Bunch of statements depending on result,
will write to database (one update statement)
Try Catch blocks which upon failing,
will call this function until the request succeeds.
'''
for record in result:
startProc = startProcessing(str(record[0]), str(record[1]))

Python threads can't run at the same time due to the Global Interpreter Lock; you want new processes instead. Look at the multiprocessing module.
(I was instructed to post this as an answer =p.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.