In a recent question I asked about the fastest way to convert a large numpy array to a delimited string. My reason for asking was because I wanted to take that plain text string and transmit it (over HTTP for instance) to clients written in other programming languages. A delimited string of numbers is obviously something that any client program can work with easily. However, it was suggested that because string conversion is slow, it would be faster on the Python side to do base64 encoding on the array and send it as binary. This is indeed faster.
My question now is, (1) how can I make sure my encoded numpy array will travel well to clients on different operating systems and different hardware, and (2) how do I decode the binary data on the client side.
For (1), my inclination is to do something like the following
import numpy as np
import base64
x = np.arange(100, dtype=np.float64)
base64.b64encode(x.tostring())
Is there anything else I need to do?
For (2), I would be happy to have an example in any programming language, where the goal is to take the numpy array of floats and turn them into a similar native data structure. Assume we have already done base64 decoding and have a byte array, and that we also know the numpy dtype, dimensions, and any other metadata which will be needed.
Thanks.
You should really look into OPeNDAP to simplify all aspects of scientific data networking. For Python, check out Pydap.
You can directly store your NumPy arrays into HDF5 format via h5py (or NetCDF), then stream the data to clients over HTTP using OPeNDAP.
For something a little lighter-weight than HDF (though admittedly also more ad-hoc), you could also use JSON:
import json
import numpy as np
x = np.arange(100, dtype=np.float64)
print json.dumps(dict(data=x.tostring(),
shape=x.shape,
dtype=str(x.dtype)))
This would free your clients from needing to install HDF wrappers, at the expense of having to deal with a nonstandard protocol for data exchange (and possibly also needing to install JSON bindings !).
The tradeoff would be up to you to evaluate for your situation.
I'd recommend using an existing data format for interchange of scientific data/arrays, such as NetCDF or HDF. In Python, you can use the PyNIO library which has numpy bindings, and there are several libraries for other languages. Both formats are built for handling large data and take care of language, machine representation problems, etc. They also work well with message passing, for example in parallel computing, so I suspect your use case is satisfied.
What the tostring method of numpy arrays does is basically give you a dump of the memory used by the array's data (not the object wrapper for Python, but just the data of the array.) This is similar to the struct stdlib module. Base64-encoding that string and sending it across should be quite good enough, although you may also need to send across the actual datatype used, as well as the dimensions if it's a multidimensional array, as you won't be able to tell those just from the data.
On the other side, how to read the data depends a little on the language. Most languages have a way of addressing such a block of memory as a particular type of array. For example, in C, you could simply base64-decode the string, assign it to (in the case of your example) a float64 * and index away. This doesn't give you any of the built-in safeguards and functions and other operations that numpy arrays have in Python, but that's because C is quite a different language in that respect.
Related
Here is a piece of code (running on Linux CentOS 7.7.1908, x86_64)
import torch #v1.3.0
import numpy as np #v1.14.3
import matplotlib.pyplot as plt
from astropy.io.fits import getdata #v3.0.2
data, hdr = getdata("afile.fits", 0, header=True) #gives dtype=float32 2d array
plt.imshow(data)
plt.show()
This gives a nice 512x512 image
Now, I would like to convert "data" into a PyTorch tensor:
a = torch.from_numpy(data)
Although, PyTorch raises:
ValueError: given numpy array has byte order different from the native
byte order. Conversion between byte orders is currently not supported.
Well, I have tried different manipulations with no success: ie. byteswap(), copy()
An idea?
PS: the same error occurs when I transfer my data to Mac OSX (Mojave) while still matplotlib is ok.
FITS stores data in big-endian byte ordering (at the time FITS was developed this was a more common machine architecture; sadly the standard has never been updated to allow flexibility on this, although it could easily be done with a single header keyword to indicate endianness of the data...)
According to the Numpy docs Numpy arrays report the endianness of the underlying data as part of its dtype (e.g. a dtype of '>i' means big-endian ints, and 'and change the array's dtype to reflect the new byte order.
Your solution of calling .astype(np.float32) should work, but that's because the np.float32 dtype is explicitly little-endian, so .astype(...) copies an existing array and converts the data in that array, if necessary, to match that dtype. I just wanted to explain exactly why that works, since it might be otherwise unclear why you're doing that.
As for matplotlib it doesn't really have much to do with your question. Numpy arrays can transparently perform operations on data that does not match the endianness of your machine architecture, by automatically performing byte swaps as necessary. Matplotlib and many other scientific Python libraries work directly with Numpy arrays and thus automatically benefit from its transparent handling of endianness.
It just happens that PyTorch (in part because of its very high performance and GPU-centric data handling model) requires you to hand it data that's already in little-endian order, perhaps just to avoid ambiguity. But that's particular to PyTorch and is not specifically a contrast with matplotlib.
Well, I found a workaround after reading the data array from FITS
data = data.astype(np.float32)
a = torch.from_numpy(data)
No error is thrown and everything is ok...
I'm using multiprocessing.Queue to pass numpy arrays of float64 between python processes. This is working fine, but I'm worried it may not be as efficient as it could be.
According to the documentation of multiprocessing, objects placed on the Queue will be pickled. calling pickle on a numpy array results in a text representation of the data, so null bytes get replaced by the string "\\x00".
>>> pickle.dumps(numpy.zeros(10))
"cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I10\ntp6\ncnumpy\ndtype\np7\n(S'f8'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np13\ntp14\nb."
I'm concerned that this means my arrays are being expensively converted into something 4x the original size and then converted back in the other process.
Is there a way to pass the data through the queue in a raw unaltered form?
I know about shared memory, but if that is the correct solution, I'm not sure how to build a queue on top of it.
Thanks!
The issue isn't with numpy, but the default settings for how pickle represents data (as strings so the output is human readable). You can change the default settings for pickle to produce binary data instead.
import numpy
import cPickle as pickle
N = 1000
a0 = pickle.dumps(numpy.zeros(N))
a1 = pickle.dumps(numpy.zeros(N), protocol=-1)
print "a0", len(a0) # 32155
print "a1", len(a1) # 8133
Also, note, that if you want to decrease processor work and time, you should probably use cPickle instead of pickle (but the space savings due to using the binary protocol work regardless of pickle version).
On shared memory:
On the question of shared memory, there are a few things to consider. Shared data typically adds a significant amount of complexity to code. Basically, for every line of code that uses that data, you will need to worry about whether some other line of code in another process is simultaneously using that data. How hard this will be will depend on what you're doing. The advantages are that you save time sending the data back and forth. The question that Eelco cites is for a 60GB array, and for this there's really no choice, it has to be shared. On the other hand, for most reasonably complex code, deciding to share data simply to save a few microseconds or bytes would probably be one of the worst premature optimizations one could make.
Share Large, Read-Only Numpy Array Between Multiprocessing Processes
That should cover it all. Pickling of uncompressible binary data is a pain regardless of the protocol used, so this solution is much to be preferred.
http://docs.python.org/2/c-api/buffer.html
int ndim
The number of dimensions the memory represents as a multi-dimensional array. If it is 0, strides and suboffsets must be NULL.
What's the real world usage for this? Is it used for scatter gather vector buffers?
Using ndim and shape is primarily for multidimensional fixed-shape arrays. For example, if you wanted to build something like NumPy from scratch, you might build it around the buffer API. There are also variations to make things easy for NumPy, PIL, and modules that wrap typical C and Fortran array-processing libraries.
If you read a bit further down, the next two values both say "See complex arrays for more information." If you click that link, it gives you an example of doing something like NumPy, and describes how it works.
Also see PEP 3118 for some rationale.
It's not (primarily) for jagged-shaped arrays, like the scatter/gather use. While you can use PIL-style suboffsets for that, it's generally simpler to just use a list or array of buffers (unless you're trying to interface with PIL, of course).
(The old-style buffer API did support a mode designed specifically for scatter/gather-like use, but it was dropped in Python 3.x, and deprecated in 2.6+ once the 3.x API was backported, basically because nobody ever used it.)
I have a little C program that's continuously acquiring a stream of data and sending it via UDP, and in real time, to a different computer. The basic framework for what I originally set out to do has been laid. In addition, however, I'd like to visualize in real time the data that's being acquired. To that end, I was thinking of using Python and its various plotting libraries. My question is how difficult it would be to let Python have access to what is essentially a first in, first out circular buffer of my C program. For concreteness, let's assume there are 1024 samples in this buffer. Does the idea of "letting Python have a continuous peek at dynamic C array" even sound reasonable/possible? If not, what sort of plotting options are best suited to this problem?
Thanks.
You can quite easily listen to your UDP port with the standard socket module. Examples are available.
As a first step, your data could go in a simple Python list, as lists are optimized for appending data. Removing the first elements takes much more time, so you might want to only do this from time to time, and only plot, in the mean time, the last 1024 (or whatever) elements of the list.
Plotting can then conveniently be done with the famous Matplotlib plotting library: matplotlib.pyplot.plot(data_list). Since you want real time, you might find the animation examples useful.
If you need to optimize the data acquisition speed, you can have the (also famous) NumPy array-manipulation library directly interpret the data from the stream as an array of numbers (Matplotlib can plot such arrays), with the numpy.frombuffer() function.
It is possible, but not too simple.
You should inform yourself about the API and maybe have a look at some implementations.
If you have done so, you can maybe provide a function which not only gives you a peek at the raw array, but maybe even reassembles it into the right order and length (if it is a circular buffer). This might be very convenient as you nevertheless have to copy the data.
I am trying to implement an image classification algorithm in Python. The problem is that python takes very long with looping through the array. That's why I decided to write a Delphi dll which performs the array-processing. My problem is that I don't know how to pass the multidimensional python-array to my dll-function.
Delphi dll extract: (I use this function only for testing)
type
TImgArray = array of array of Integer;
function count(a: TImgArray): Integer; cdecl;
begin
result:= high(a);
end;
relevant Python code:
arraydll = cdll.LoadLibrary("C:\\ArrayFunctions.dll")
c_int_p = ctypes.POINTER(ctypes.c_int32)
data = valBD.ReadAsArray()
data = data.astype(np.int32)
data_p = data.ctypes.data_as(c_int_p)
print arraydll.count(data_p)
The value returned by the dll-function is not the right one (it is 2816 instead of
7339). That's why I guess that there's somethin wrong with my type-conversion :(
Thanks in advance,
Mario
What you're doing won't work, and is likely to corrupt memory too.
A Delphi dynamic array is implemented under the hood as a data structure that holds some metadata about the array, including the length. But what you're passing to it is a C-style pointer-as-array, which is a pointer, not a Delphi dynamic array data structure. That structure is specific to Delphi and you can't use it in other languages. (Even if you did manage to implement the exact same structure in another language, you still couldn't pass it to a Delphi DLL and have it work right, because Delphi's memory manager is involved under the hood. Doing it this way is just asking for heap corruption and/or exceptions being raised by the memory manager.)
If you want to pass a C-style array into a DLL, you have to do the C way, by passing a second parameter that includes the length of the array. Your Python code should already know the length, and it shouldn't take any time to calculate it. You can definitely use Delphi to speed up image processing. But the array dimensions have to come from the Python side. There's no shortcut you can take here.
Your Delphi function declaration should look something like this:
type
TImgArray = array[0..MAXINT] of Integer;
PImgArray = ^TImgArray;
function ProcessSomething(a: PImgArray; size: integer): Integer; cdecl;
Normal Python arrays are normally called "Lists". A numpy.array type in Python is a special type that is more memory efficient than a normal Python list of normal Python floating point objects. Numpy.array wraps a standard block of memory that is accessed as a native C array type. This in turn, does NOT map cleanly to Delphi array types.
As David says, if you're willing to use C, this will all be easier. If you want to use Delphi and access Numpy.array, I suspect that the easiest way to do it would be to find a way to export some simple C functions that access the Numpy.array type. In C I would import the numpy headers, and then write functions that I can call from Pascal. Then I would import these functions from a DLL:
function GetNumpyArrayValue( arrayObj:Pointer; index:Integer):Double;
I haven't written any CPython wrapper code in a while. This would be easier if you wanted to simply access CORE PYTHON types from Delphi. The existing Python-for-delphi wrappers will help you. Using numpy with Delphi is just a lot more work.
Since you're only writing a DLL and not a whole application, I would seriously advise you forget about Delphi and write this puppy in plain C, which is what Python extensions (which is what you're writing) really should be written in.
In short, since you're writing a DLL, in Pascal, you're going to need at least another small DLL in C, just to bridge the types between the Python extension types (numpy.array) and the Python floating point values. And even then, you're not going to easily (quickly) get an array value you could read in Delphi as a native delphi array type.
The very fastest access mechanism I can think of is this:
type
PDouble = ^Double;
function GetNumpyArrayValue( arrayObj:Pointer; var doubleVector:PDouble; var doubleVectorLen:Integer):Boolean;
You could then use doubleVector (pointer) type math to access the underlying C array memory type.