IronPython vs Python speed with numpy

IronPython vs Python speed with numpy - python

I have some Python source code that manipulates lists of lists of numbers (say about 10,000 floating point numbers) and does various calculations on these numbers, including a lot of numpy.linalg.norm for example.
Run time had not been an issue until we recently started using this code from a C# UI (running this Python code from C# via IronPython). I extracted a set of function calls (doing things as described in the 1st paragraph) and found that this code takes about 4x longer to run in IronPython compared to Python 2.7 (and this is after excluding the startup/setup time in C#/IronPython). I'm using a C# stopwatch around the repeated IronPython calls from C# and using the timeit module around an execfile in Python 2.7 (so the Python time results include more operation like loading the file, creating the objects, ... whereas the C# doesn't). The former requires about 4.0 seconds while the latter takes about 0.9 seconds.
Would you expect this kind of difference? Any ideas how I might resolve this issue? Other comments?
Edit:
Here is a simple example of code that runs about 10x slower on IronPython on my machine (4 seconds in Python 2.7 and 40 seconds in IronPython):
n = 700
for i in range(n-1):
for j in range(i, n):
dist = np.linalg.norm(np.array([i, i, i]) - np.array([j, j, j]))

You're using NUMPY?! You're lucky it works in IronPython at all! The support is being added literally as we speak!
To be exact, there's a CPython-extension-to-IronPython interface project and there's a native CLR port or numpy. I dunno which one you're using but both ways are orders of magnitude slower that working with the C version in CPython.
UPDATE:
The Scipy for IronPython port by Enthought that you're apparently using looks abandoned: last commits in the repos linked are a few years old and it's missing from http://www.scipy.org/install.html, too. Judging by the article, it was a partial port with interface in .NET and core in C linked with a custom interface. The previous paragraph stands for it, too.
Using the information from Faster alternatives to numpy.argmax/argmin which is slow , you may get some speedup if you limit data passing back and forth between the CLR and the C core.

Related

Py_EndInterpreter in Python 3 is very slow

The application I am working on recently migrated from embedding python 2.7 to python 3.8
We noticed a significant slowdown when calling Py_EndInterpreter in Python 3.8, when using many sub-interpreters.
Looking at the CPU usage I can see that all the time is spent doing garbage collection.
Py_EndInterpreter -> PyImport_Cleanup -> _PyGC_CollectNoFail -> collect.
99% of the CPU time is spent in the collect method of _PyGC_CollectNoFail
Calling Py_EndInterprter when there are 500 sub-interpreters, each call to Py_EndInterpreter takes 2seconds!! For a total of ~3minutes to end the 500 sub-interpreters.
Comparatively in python 2.7 each call to Py_EndInterpreter takes 1 or 2ms independently of how many subinterpreters are alive, for a total of ~500ms to close all sub-interpreters.
When using few sub-interpreters (less than 20), the performance is almost identical between python 2.7 and 3.8.
I tried looking at other applications using many sub-interpreters but it seems like a very rare use case and could not find anyone else having the same issue.
Is there anyone else using many sub-interpreters having similar troubles?
It seems that currently my options are
Take the performance hit...
Leak a bunch of memory and not call Py_EndInterpreter
Fundamentally change how my application embeds python and not use sub-interpreters
??

If I have a normal python script, how can I get it to run on my computers GPU, not CPU?

Assume i have a really basic script, that requires a lot of calculation:
c = 2
result = 0
for i in range(0,10000):
c += 5
c = i*c
print(c) //just added this, sorry for confusion!
...takes about 15 seconds in IDLE on my mac book pro.
How can I get this exact script to run on the gpu not cpu, for faster results? Also, wondering how code (if at all) would need to change in order to work for gpu?
UPDATE: sorry, meant 15 seconds with the print statement at the end there. Turns out this is a bad example because IDLE executes this unusually slow - just tried in Terminal and it was instant.

I would recommend looking into using something like gnumpy if you're mainly looking to do simple but repetitive mathematical calculations.

Optimal way of matlab to python communication

So I am working on a Matlab application that has to do some communication with a Python script. The script that is called is a simple client software. As a side note, if it would be possible to have a Matlab client and a Python server communicating this would solve this issue completely but I haven't found a way to work that out.
Anyhow, after searching the web I have found two ways to call Python scripts, either by the system() command or editing the perl.m file to call Python scripts instead. Both ways are too slow though (tic tocing them to > 20ms and must run faster than 6ms) as this call will be in a loop that is very time sensitive.
As a solution I figured I could instead save a file at a certain location and have my Python script continuously check for this file and when finding it executing the command I want it to. Now after timing each of these steps and summing them up I found it to be incredibly much faster (almost 100x so for sure fast enough) and I cant really believe that, or rather I cant understand why calling python scripts is so slow (not that I have more than a superficial knowledge of the subject). I also found this solution to be really messy and ugly so just wanted to check that, first, is it a good idea and second, is there a better one?
Finally, I realize that the Python time.time() and Matlab tic, toc might not be precise enough to measure time correctly on that scale so also a reason why I ask.

Spinning up new instances of the Python interpreter takes a while. If you spin up the interpreter once, and reuse it, this cost is paid only once, rather than for every run.
This is normal (expected) behaviour, since startup includes large numbers of allocations and imports. For example, on my machine, the startup time is:
$ time python -c 'import sys'
real 0m0.034s
user 0m0.022s
sys 0m0.011s

Running Bulk Synchronous Parallel Model (BSP) in Python

The BSP Parallel Programming Model has several benefits - the programmer need not explicitly care about synchronization, deadlocks become impossible and reasoning about speed becomes much easier than with traditional methods. There is a Python interface to the BSPlib in the SciPy:
import Scientific.BSP
I wrote a little program to test BSP. The Program is a simple random experiment which "calculates" the probalbility that throwing n dice yields a sum of k:
from Scientific.BSP import ParSequence, ParFunction, ParRootFunction
from sys import argv
from random import randint
n = int(argv[1]) ; m = int(argv[2]) ; k = int(argv[3])
def sumWuerfe(ws): return len([w for w in ws if sum(w)==k])
glb_sumWuerfe= ParFunction(sumWuerfe)
def ausgabe(result): print float(result)/len(wuerfe)
glb_ausgabe = ParRootFunction(output)
wuerfe = [[randint(1,6) for _ in range(n)] for _ in range(m)]
glb_wuerfe = ParSequence(wuerfe)
# The parallel calc:
ergs = glb_sumWuerfe(glb_wuerfe)
# collecting the results in Processor 0:
ergsGesamt= results.reduce(lambda x,y:x+y, 0)
glb_output(ergsGesamt)
The program works fine, but: It uses just one process!
My Question: Anyone knows how to tell this Pythonb-BSP-Script to use 4 (or 8 or 16) Processes? I thought this BSP Implementation woould use MPI, but starting the script via mpiexe -n 4 randExp.py doesnt work.

A minor thing, but Scientific Python != SciPy in your question...
If you download the ScientificPython sources you'll see a README.BSP, a README.MPI, and a README.BSPlib. Unfortunately, there's not really much mention made of the information there on the online webpages.
The README.BSP is pretty explicit about what you need to do to get the BSP stuff working in real Parallel:
In order to use the module
Scientific.BSP using more than one
real processor, you must compile
either the BSPlib or the MPI
interface. See README.BSPlib and
README.MPI for installation details.
The BSPlib interface is probably more
efficient (I haven't done extensive
tests yet), and allows the use of the
BSP toolset, on the other hand MPI is
more widely available and might thus
already be installed on your machine.
For serious use, you should probably
install both and make comparisons for
your own applications. Application
programs do not have to be modified to
switch between MPI and BSPlib, only
the method to run the program on a
multiprocessor machine must be
adapted.
To execute a program in parallel mode,
use the mpipython or bsppython
executable. The manual for your MPI or
BSPlib installation will tell you how
to define the number of processors.
and the README.MPI tells you what to do to get MPI support:
Here is what you have to do to get MPI
support in Scientific Python:
1) Build and install Scientific Python
as usual (i.e. "python setup.py
install" in most cases).
2) Go to the directory Src/MPI.
3) Type "python compile.py".
4) Move the resulting executable
"mpipython" to a directory on your
system's execution path.
So you have to build more BSP stuff explicitly to take advantage of real parallelism. The good news is you shouldn't have to change your program. The reason for this is that different systems have different parallel libraries installed, and libraries that go on top of those have to have a configuration/build step like this to take advantage of whatever is available.

python on win32: how to get absolute timing / CPU cycle-count

I have a python script that calls a USB-based data-acquisition C# dotnet executable. The main python script does many other things, e.g. it controls a stepper motor. We would like to check the relative timing of various operations, for that purpose the dotnet exe generates a log with timestamps from C# Stopwatch.GetTimestamp(), which as far as I know yields the same number as calls to win32 API QueryPerformanceCounter().
Now I would like to get similar numbers from the python script. time.clock() returns such values, unfortunately it subtracts the value obtained at the time of 1st call to time.clock(). How can I get around this? Is it easy to call QueryPerformanceCounter() from some existing python module or do I have to write my own python extension in C?
I forgot to mention, the python WMI module by Tim Golden does this:
wmi.WMI().Win32_PerfRawData_PerfOS_System()[0].Timestamp_PerfTime
, but it is too slow, some 48ms overhead. I need something with <=1ms overhead. time.clock() seems to be fast enough, as is c# Stopwatch.GetTimestamp().
TIA,
Radim

Have you tried using ctypes?
from ctypes import *
val = c_int64()
windll.Kernel32.QueryPerformanceCounter(byref(val))
print val.value

You could just call the C# StopWatch class directly from Python couldn't you? Maybe a small wrapper is needed (don't know Python/C# interop details - sorry) - if you are already using C# for data acquisition, doing the same for timings via Stopwatch should be simpler than anything else you can do.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.