Statistical Profiling in Python

Statistical Profiling in Python - python

I would like to know that the Python interpreter is doing in my production environments.
Some time ago I wrote a simple tool called live-trace which runs a daemon thread which collects stacktraces every N milliseconds.
But signal handling in the interpreter itself has one disadvantage:
Although Python signal handlers are called asynchronously as far as the Python user is concerned, they can only occur between the “atomic” instructions of the Python interpreter. This means that signals arriving during long calculations implemented purely in C (such as regular expression matches on large bodies of text) may be delayed for an arbitrary amount of time.
Source: https://docs.python.org/2/library/signal.html
How could I work around above constraint and get a stacktrace, even if the interpreter is in some C code for several seconds?
Related: https://github.com/23andMe/djdt-flamegraph/issues/5

I use py-spy with speedscope now. It is a very cool combination.
py-spy works on Windows/Linux/macOS, can output flame graphs by its own and is actively deployed, eg. subprocess profiling support was added in October 2019.

Have you tried Pyflame? It's based on ptrace, so it shouldn't be affected by CPython's signal handling subtleties.

Maybe the perf-tool from Brendan Gregg can help

Related

Django 1.8 startup delay troubleshooting

I'm trying to discover the cause of delays in Django 1.8 startup, especially, but not only, when run in a debugger (WingIDE 5 and 6 in my case).
Minimal test case: the Django 1.8 tutorial "poll" example, completed just to the first point where 'manage.py runserver' works. All default configuration, using SQLite. Python 3.5.2 with Django 1.8.14, in a fresh venv.
From the command line, on Linux (Mint 18) and Windows (7-64), this may run as fast as 2 seconds to reach the "Starting development server" message. But on Windows it sometimes takes 10+ secs. And in the debugger on both machines, it can take 40 secs.
One specific issue: By placing print statements at begin and end of django/__init__.py setup(), I note that this function is called twice before the "Starting... " message, and again after that message; the first two times contribute half the delay each. This suggests that django is getting started three times. What is the purpose of that, or does it indicate a problem?
(I did find that I could get rid of one of the first two startup()s using the runserver --noreload option. But why does it happen in the first place? And there's still a startup() call after the "Starting..." message.)
To summarize the question:
-- Any insights into what might be responsible for the delay?
-- Why does django need to start three times? (Or twice, even with --noreload).

A partial answer.
After some time with WingIDE IDE's debugger, and some profiling with cProfile, I have located the main CPU hogging issue.
During initial django startup there's a cascade of imports, in which module validators.py prepares some compiled regular expressions for later use. One in particular, URLValidator.regex, is complicated and also involves five instances of the unicode character set (variable ul). This causes re.compile to perform a large amount of processing, notably in sre_compile.py _optimize_charset() and in large number of calls to the fixup() function.
As it happens, the particular combination of calls and data structure apparently hit a special slowness in WingIDE 6.0b2 debugger. It's considerably faster in WingIDE 5.1 debugger (though still much slower than when run from command line). Not sure why yet, but Wingware is looking into it.
This doesn't explain the occasional slowness when launched from the command line on Windows; there's an outside change this was waiting for a sleeping drive to awaken. Still observing.

How to trace random MemoryError in python script?

I have a python script, which is used to perform a lab measurement using several devices. The whole setup is rather involved, including communication over serial devices, API calls as well as the use of self-written and commercial drivers. In the end, however, everything boils down to two nested loops, which vary some parameters, collect data and write it to a file.
My problem is that I observe random occurences of a MemoryError, typically after 10 hours, equivalent to ~15k runs of the loops. At the moment, I don't have an idea, where it comes from or how I can trace it further. So I would be happy for suggestions, how to work on my problem. My observations up to this moment are as follows.
The error occurs at random states of the program. Different runs will throw the MemoryError at different lines of my script.
There is never any helpful error message. Python only says MemoryError without any error string. The traceback leads me to some point in the script, where memory is needed (e.g. when building a list), but it appears to be no specific instruction, which is the problem.
My RAM is far from full. The python process in question typically consumes some ten MB of RAM when viewed in the task manager. In addition, the RAM usage appears to be stable for hours. Usually, it increases slowly for some time, just to drop to down to the previous level quickly, which I interpret as the garbage collector kicking in periodically.
So far I did not find any indications for a memory leak. I used memory_profiler to trace the memory usage of my functions and found it to be stable. In addition, I followed this blog entry to observe what the garbage collector does in detail. Again, I could not find any hints for undeleted objects.
I am stuck to Win7 x86 due to a driver, which will only work on a 32bit system. So I cannot follow suggestions like this to go to a 64 bit version of Windows. Anyway, I do not see, how this would help in my situation.
The iPython console, from which the script is being launched, often behaves strange after the error occurred. Sometimes, a new MemoryError is thrown even for very simple operations. Often, the console is marked by Windows as "not responding" after some time. A menu pops up, where besides the usual options to wait for the process or to terminate it, there is a third option to "restore" the program (whatever that means). Doing so usually causes the console to work normal again.
At this point, I am somewhat out of ideas on how to proceed. The general receipe to comment out parts of the script until it works is highly undesirable in my case. As stated above, each test run will take several hours, meaning a potential downtime of weeks for my lab equipment. Going that direction, appears unfeasable to me. Is there any more direct approach to learn, what is crashing behind the scenes? How can I understand that python apparently fails to malloc?

how to get a real callback from os in python3

i wrote actionscript and javascript. add callback to invoke a piece of code is pretty normal in everyday life.
but cames to python it seems not quit an easy job. i can hardly see things writing in callback style.i mean real callback,not a fake one,here's a fake callback example:
a list of file for download,you can write:
urls = []
def downloadfile(url,callback):
//download the file
callback()
def downloadNext():
if urls:
downloadfile(urls.pop(),downloadNext)
downloadNext()
this works but would finally meet the maximum recursion limit.while a really callback won't.
A real callback,as far as i understand,can't not come from program, it's must come from physics,like CPU clock,or some hardware IO state change,this would invoke some interception to CPU ,CPU interrupt current operating flow and check if the runtime registered any code about this int,if has,run it,the OS wrapped it as signal or event or something else ,and finally pass it to application.(if i'm wrong ,please point it out)thus would avoid the function calling stack pile up to overflow,otherwise you'll drop into infinite recursion .
there was something like coroutine in python to handle multi tasks,but must be very carefully.if in any of the routine you are blocked,all tasks would be blocked
there's some third party libs like twisted or gevent,but seems very troublesome to get and install,platform limited,not well supported in python 3,it's not good for writing a simple app and distribute.
multiprocessing, heavy and only works on linux
threading,because of GIL, never be the first choice,and it seems a psuedo one.
why not python give an implementation in standard libraries?and is there other easy way to get the real callback i want?

Your example code is just a complicated way of sequentially downloading all files.
If you really want to do asyncronous downloading, using a multiprocessing.Pool, especially the Pool.map_async member function. is the best way to go. Note that this uses callbacks.
According to the documentation for multiprocessing:
"It runs on both Unix and Windows."
But it is true that multiprocessing on ms windows has some extra restrictions.

What can change my floating point control word behind my back?

I have a 32 bit Windows application, written primarily in Delphi, which performs floating point numerical simulations using the 8087 FPU. I have recently added the ability to link in external Python code using the Python API through python2x.dll. This recent change has led to some very strange behaviour.
The application has a batch mode of operation where it performs multiple simulations in parallel to take advantage of multi-core architectures. As soon as Python code has been executed in the process, I start to see changes to the 8087 control word on different threads. I've checked this very carefully and I have observed the control word having changed even in a thread which has never called into the Python DLL.
I know this sounds fantastical, but, as I have discovered, there are mechanisms for this behaviour to manifest. I have learnt about signals. I first hypothesised that the Python DLL was setting process wide signal handlers (by calling signal()) and these signal handlers were responsible for changing the control word. For example, a thread, unrelated to the Python code, could perhaps cause SIGFPE and that may, in turn, modify the control word.
I have rather come to the conclusion that signal() is not the mechanism. I arranged to execute the Python code at startup. Then I set of the signal handlers back to SIG_DFL. Then I started the simulations. But still the control word changes occurred.
My question (finally) is whether anyone knows of another mechanism by which the control word could be changed in such a manner. I'm looking for interrupts, APCs etc., I think!
Update
The control word is being changed to 0x037F which is the Intel default value. This differs from the MSVC/Windows default of 0x027F. I hypothesise that something is calling FPINIT.
I also discovered Py_InitializeEx which allows the caller to stop Python setting signal handlers. The control word changes occur even if I use this approach to initialisation so I'm even more convinced that is not the mechanism.

For example, a DllMain call with DLL_THREAD_ATTACH flag, see msdn
Update
I have found a link to similar problem - DLL Load "Poisons" FPU Control Word for New Threads. But yes, it is about the threads created after Dll load.

If I remember correctly, that's Delphi's problem. There are some discussions of the issue here and here. I remember bumping into it when trying to write some VST plugins in Delphi.

I have seen a case like this where the printer driver of the default printer changed the control word in my back. When I changed the default printer, the problem went away.
To circumvent this problem I set the control word to the default value at approriate places with:
_control87(_CW_DEFAULT, _CW_DEFAULT);
I have also seen the same problem on all machines of a customer that had Norton Security 2011 installed, but the problem went away with the fix for the printer driver, so I'm not really sure if Norton was really the cause.

Numeric GUI bottleneck

I've made a GUI to set up and start a numerical integrator using PyQT4, Wing, QT, and Python 2.6.6, on my Mac. The thing is, when I run the integrator form the GUI, it takes very many times longer than when I crudely run the integrator from the command line.
As an example, a 1000 year integration took 98 seconds on the command line and ~570 seconds from the GUI.
In the GUI, the integration runs from a thread and then returns. It uses a a queue to communicate back to the GUI.
Does anyone have any ideas as to where the bottleneck is? I suspect that others may be experiencing something like this just on a smaller scale.
t = threading.Thread(target=self.threadsafe_start_thread, args=(self.queue, self.selected))
t.start()

In general it is not a good idea to use python threads within a pyqt application. Instead use a QThread.
Both python and QThreads call the same underlying mechanisms, but they don't play well together. I don't know if this will solve your problem or not, but it might be part of the issue.

Is your thread code mostly Python code? If yes, then you might be a victim of the Global Interpreter Lock.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.