I have to repeatedly move a large array from Lua to Python. Currently, I run the Lua code as a subprocess from Python and read the array from its stdout. This is much slower than I'd like, and the bottleneck seems to be almost entirely the Python p.stdout.read([byte size of array]) calls, as running the Lua code in isolation is much faster.
From what I have read, the only way to improve over pipes is to use shared memory, but this is (almost) always discussed in regards to multiprocessing between different Python processes instead of between Python and a subprocess.
Is there a reasonable way to share memory between Python and Lua? Related answers have suggested using direct calls to shm_open but I'd rather use prebuilt modules/packages if they exist.
Before going down the path of looking into shared memory I would suggest doing some profiling experiments to identify exactly where the time is being spent.
If your experiments prove you're spending too much time serializing/deserializing data between the processes then using shared memory along with a format designed to avoid that cost like Cap'n Proto could be a good solution.
A quick search turned up these two libraries:
lua-capnproto -
Lua-capnp is a pure lua implementation of capnproto based on luajit.
pycapnp - This is a python wrapping of the C++ implementation of the Cap’n Proto library.
But definitely do the profiling first.
Also is there a reason lupa wouldn't work for you?.
Here's the solution I found using Torch from Lua and NumPy from Python. The Lua code is run from Python using lupa.
In main.lua:
require 'torch'
data_array = torch.FloatTensor(256, 256)
function write_data()
return tonumber(torch.data(data_array:contiguous(), true))
end
From Python:
import ctypes
import lupa
import numpy as np
data_shape = (256, 256)
lua = lupa.LuaRuntime()
with open('main.lua') as f: lua.execute(f.read())
data_array = np.ctypeslib.as_array(ctypes.cast(ctypes.c_void_p(lua.globals().write_data()), ctypes.POINTER(ctypes.c_float)), shape=data_shape)
data_array is constructed to point at the storage of the Torch tensor.
Related
I've a program which process a zip file using zipfile. It works with an iterator, since the uncompressed file is bigger than 2GB and it can become a memory problem.
with zipfile.Zipfile(BytesIO(my_file)) as myzip:
for file_inside in myzip.namelist():
with myzip.open(file_inside) as file:
# Process here
# for loop ....
Then I noticed that this process was being extremely slow to process my file. And I can understand that it may take some time, but at least it should use my machine resources: lets say the python process should use at 100% the core where it lives.
Since it doesn't, I started researching the possible root causes. I'm not an expert in compression matters, so first considered basic things:
Resources seem not to be the problem, there's plenty RAM available even if my coding approach wouldn't use it.
CPU is not in high level usage, not even for one core.
The file being open is just about 80MB when compressed, so disk reading should not be a slowing issue either.
This made me to think that the bottleneck could be in the most invisible parameters: RAM bandwidth. However I have no idea how could I measure this.
Then on the software side, I found on the zipfile docs:
Decryption is extremely slow as it is implemented in native Python rather than C.
I guess that if it's using native Python, it's not even using OpenGL acceleration so another point for slowliness. I'm also curious about how this method works, again because of the low CPU usage.
So my question is of course, how could I work in a similar way (not having the full uncompress file in RAM), but uncompressing in a faster way in Python? Is there another library or maybe another approach to overcome this slowliness?
There is this lib for python to handle zipping files without memory hassle.
Quoted from the docs:
Buzon - ZipFly
ZipFly is a zip archive generator based on zipfile.py. It was created by Buzon.io to generate very large ZIP archives for immediate sending out to clients, or for writing large ZIP archives without memory inflation.
Never used but can help.
I've done some research and found the following:
You could "pip install czipfile", more information at https://pypi.org/project/czipfile/
Another solution is to use "Cython", a variant of python -https://www.reddit.com/r/Python/comments/cksvp/whats_a_python_zip_library_with_fast_decryption/
Or you could outsource to 7-Zip, as explained here: Faster alternative to Python's zipfile module?
It's quite stupid that Python doesn't implement zip decryption in pure c.
So I make it in cython, which is 17 times faster.
Just get the dezip.pyx and setup.py from this gist.
https://gist.github.com/zylo117/cb2794c84b459eba301df7b82ddbc1ec
And install cython and build a cython library
pip3 install cython
python3 setup.py build_ext --inplace
Then run the original script with two more lines.
import zipfile
# add these two lines
from dezip import _ZipDecrypter_C
setattr(zipfile, '_ZipDecrypter', _ZipDecrypter_C)
z = zipfile.ZipFile('./test.zip', 'r')
z.extractall('/tmp/123', None, b'password')
I am using subprocess to run a Rscript. The script returns a R matrix. I am using subprocess.check_output in Python and get a string. But is there a way to get directly the output matrix in Python?
Thanks
Exchanging objects between two languages is not an easy task.
The generic solution
This solution works for all languages:
You launch your script
After computation you write your results in a generic format. For example .csv or .txt or .json
You reload the result in the other language
Regarding R and python
There is an existing package to do that: rpy but it might be tricky to use, and some times errors are not quite explicit (because as I said, it is tricky to exchange object between two languages).
I am using the Python based Sage Mathematics software to create a very long list of vectors. The list contains roughly 100,000,000 elements and sys.getsizeof() tells me that it is of size a little less than 1GB.
This list I pickle into a file (which already takes a long time -- but fair enough). Only when I unpickle this list it gets annoying. The RAM usage increases from 1.15GB to 4.3GB, and I am wondering what's going on?
How can I find out in Sage what all the memory is used for? And do you have any ideas how to optimize this by maybe applying Python tricks?
This is a reply to the comment of kcrisman.
The exact code I cannot post since it would be too long. But here is a simple example where the phenomena can be observed. I am working on Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64 GNU/Linux.
Start Sage and execute:
import pickle
L = [vector([1,2,3]) for k in range(1000000)]
f = open("mylist", 'w')
pickle.dump(L, f)
On my system the list is 8697472 bytes big, and the file I pickled into has roughly 130MB. Now close Sage and watch your memory (with htop, for example). Then execute the following lines:
import pickle
f = open("mylist", 'r')
pickle.load(f)
Without sage my Linux system uses 1035MB of memory, when Sage is running the usage increases to 1131MB. After I unpickled the file it uses 2535MB which I find odd.
It's probably better to not use python's pickle module directly. cPickle is already a bit better, but a lot of pickling in sage assumes protocol 2, which (c)Pickle doesn't default to. You can use sage's own wrappers of pickle. If I do your example with
sage: open("mylist",'w').write(dumps(L))
and then load it in a fresh session via
sage: L = loads(open("mylist",'r').read())
I observe no problems.
Note that the above interface is not the best one to pickle/unpickle in sage to a file. You'd be better off using save/load. I just did it that way to stay as close as possible to your example.
The BSP Parallel Programming Model has several benefits - the programmer need not explicitly care about synchronization, deadlocks become impossible and reasoning about speed becomes much easier than with traditional methods. There is a Python interface to the BSPlib in the SciPy:
import Scientific.BSP
I wrote a little program to test BSP. The Program is a simple random experiment which "calculates" the probalbility that throwing n dice yields a sum of k:
from Scientific.BSP import ParSequence, ParFunction, ParRootFunction
from sys import argv
from random import randint
n = int(argv[1]) ; m = int(argv[2]) ; k = int(argv[3])
def sumWuerfe(ws): return len([w for w in ws if sum(w)==k])
glb_sumWuerfe= ParFunction(sumWuerfe)
def ausgabe(result): print float(result)/len(wuerfe)
glb_ausgabe = ParRootFunction(output)
wuerfe = [[randint(1,6) for _ in range(n)] for _ in range(m)]
glb_wuerfe = ParSequence(wuerfe)
# The parallel calc:
ergs = glb_sumWuerfe(glb_wuerfe)
# collecting the results in Processor 0:
ergsGesamt= results.reduce(lambda x,y:x+y, 0)
glb_output(ergsGesamt)
The program works fine, but: It uses just one process!
My Question: Anyone knows how to tell this Pythonb-BSP-Script to use 4 (or 8 or 16) Processes? I thought this BSP Implementation woould use MPI, but starting the script via mpiexe -n 4 randExp.py doesnt work.
A minor thing, but Scientific Python != SciPy in your question...
If you download the ScientificPython sources you'll see a README.BSP, a README.MPI, and a README.BSPlib. Unfortunately, there's not really much mention made of the information there on the online webpages.
The README.BSP is pretty explicit about what you need to do to get the BSP stuff working in real Parallel:
In order to use the module
Scientific.BSP using more than one
real processor, you must compile
either the BSPlib or the MPI
interface. See README.BSPlib and
README.MPI for installation details.
The BSPlib interface is probably more
efficient (I haven't done extensive
tests yet), and allows the use of the
BSP toolset, on the other hand MPI is
more widely available and might thus
already be installed on your machine.
For serious use, you should probably
install both and make comparisons for
your own applications. Application
programs do not have to be modified to
switch between MPI and BSPlib, only
the method to run the program on a
multiprocessor machine must be
adapted.
To execute a program in parallel mode,
use the mpipython or bsppython
executable. The manual for your MPI or
BSPlib installation will tell you how
to define the number of processors.
and the README.MPI tells you what to do to get MPI support:
Here is what you have to do to get MPI
support in Scientific Python:
1) Build and install Scientific Python
as usual (i.e. "python setup.py
install" in most cases).
2) Go to the directory Src/MPI.
3) Type "python compile.py".
4) Move the resulting executable
"mpipython" to a directory on your
system's execution path.
So you have to build more BSP stuff explicitly to take advantage of real parallelism. The good news is you shouldn't have to change your program. The reason for this is that different systems have different parallel libraries installed, and libraries that go on top of those have to have a configuration/build step like this to take advantage of whatever is available.
I am attempting to profile my project in python, but I am running out of memory.
My project itself is fairly memory intensive, but even half-size runs are dieing with "MemoryError" when run under cProfile.
Doing smaller runs is not a good option, because we suspect that the run time is scaling super-linearly, and we are trying to discover which functions are dominating during large runs.
Why is cProfile taking so much memory? Can I make it take less? Is this normal?
Updated: Since cProfile is built into current versions of Python (the _lsprof extension) it should be using the main allocator. If this doesn't work for you, Python 2.7.1 has a --with-valgrind compiler option which causes it to switch to using malloc() at runtime. This is nice since it avoids having to use a suppressions file. You can build a version just for profiling, and then run your Python app under valgrind to look at all allocations made by the profiler as well as any C extensions which use custom allocation schemes.
(Rest of original answer follows):
Maybe try to see where the allocations are going. If you have a place in your code where you can periodically dump out the memory usage, you can use guppy to view the allocations:
import lxml.html
from guppy import hpy
hp = hpy()
trees = {}
for i in range(10):
# do something
trees[i] = lxml.html.fromstring("<html>")
print hp.heap()
# examine allocations for specific objects you suspect
print hp.iso(*trees.values())