fastest method to dump numpy array into string - python

I need to organized a data file with chunks of named data. Data is NUMPY arrays. But I don't want to use numpy.save or numpy.savez function, because in some cases, data have to be sent on a server over a pipe or other interface. So I want to dump numpy array into memory, zip it, and then, send it into a server.
I've tried simple pickle, like this:
try:
import cPickle as pkl
except:
import pickle as pkl
import ziplib
import numpy as np
def send_to_db(data, compress=5):
send( zlib.compress(pkl.dumps(data),compress) )
.. but this is extremely slow process.
Even with compress level 0 (without compression), the process is very slow and just because of pickling.
Is there any way to dump numpy array into string without pickle? I know that numpy allows to get buffer numpy.getbuffer, but it isn't obvious to me, how to use this dumped buffer to obtaine an array back.

You should definitely use numpy.save, you can still do it in-memory:
>>> import io
>>> import numpy as np
>>> import zlib
>>> f = io.BytesIO()
>>> arr = np.random.rand(100, 100)
>>> np.save(f, arr)
>>> compressed = zlib.compress(f.getbuffer())
And to decompress, reverse the process:
>>> np.load(io.BytesIO(zlib.decompress(compressed)))
array([[ 0.80881898, 0.50553303, 0.03859795, ..., 0.05850996,
0.9174782 , 0.48671767],
[ 0.79715979, 0.81465744, 0.93529834, ..., 0.53577085,
0.59098735, 0.22716425],
[ 0.49570713, 0.09599001, 0.74023709, ..., 0.85172897,
0.05066641, 0.10364143],
...,
[ 0.89720137, 0.60616688, 0.62966729, ..., 0.6206728 ,
0.96160519, 0.69746633],
[ 0.59276237, 0.71586014, 0.35959289, ..., 0.46977027,
0.46586237, 0.10949621],
[ 0.8075795 , 0.70107856, 0.81389246, ..., 0.92068768,
0.38013495, 0.21489793]])
>>>
Which, as you can see, matches what we saved earlier:
>>> arr
array([[ 0.80881898, 0.50553303, 0.03859795, ..., 0.05850996,
0.9174782 , 0.48671767],
[ 0.79715979, 0.81465744, 0.93529834, ..., 0.53577085,
0.59098735, 0.22716425],
[ 0.49570713, 0.09599001, 0.74023709, ..., 0.85172897,
0.05066641, 0.10364143],
...,
[ 0.89720137, 0.60616688, 0.62966729, ..., 0.6206728 ,
0.96160519, 0.69746633],
[ 0.59276237, 0.71586014, 0.35959289, ..., 0.46977027,
0.46586237, 0.10949621],
[ 0.8075795 , 0.70107856, 0.81389246, ..., 0.92068768,
0.38013495, 0.21489793]])
>>>

THe default pickle method provides a pure ascii output. To get (much) better performance, use the latest version available. Versions 2 and above are binary and, if memory serves me right, allows numpy arrays to dump their buffer directly into the stream without addtional operations.
To select version to use, add the optional argument while pickling (no need to specify it while unpickling), for instance pkl.dumps(data, 2).
To pick the latest possible version, use pkl.dumps(data, -1)
Note that if you use different python versions, you need to specify the lowest supported version.
See Pickle documentation for details on the different versions

There is a method tobytes which, according to my benchmarks is faster than other alternatives.
Take with a grain of salt, as some of my experiments may be misguided or plainly wrong, but it is a method of dumping numpy array into strings.
Keep in mind that you will need to have some additional data out of band, mainly the data type of the array and also its shape. That may be a deal breaker or it may not be rellevant. It's easy to recover the original shape by calling .fromstring(..., dtype=...).reshape(...).
Edit: A maybe incomplete example
##############
# Generation #
##############
import numpy as np
arr = np.random.randint(1, 7, (4,6))
arr_dtype = arr.dtype.str
arr_shape = arr.shape
arr_data = arr.tobytes()
# Now send / store arr_dtype, arr_shape, arr_data, where:
# arr_dtype is string
# arr_shape is tuple of integers
# arr_data is bytes
############
# Recovery #
############
arr = np.frombuffer(arr_data, dtype=arr_dtype).reshape(arr_shape)
I am not considering the column/row ordering, because I know that numpy supports things about that but I have never used it. If you want to support / need to have the memory arranged in a specific fashion --regarding row/column for multidimensional arrays-- you may need to take that into account at some point.
Also: frombuffer doesn't copy the buffer data, it creates the numpy structure as a view (maybe not exactly that, but you know what I mean). If that's undesired behaviour you can use fromstring (which is deprecated but seems to work on 1.19) or use frombuffer followed by a np.copy.

Related

Vectorize QR in Numpy Python

Hi I am trying to vectorise the QR decomposition in numpy as the documentation suggests here, however I keep getting dimension issues. I am confused as to what I am doing wrong as I believe the following follows the documentation. Does anyone know what is wrong with this:
import numpy as np
X = np.random.randn(100,50,50)
vecQR = np.vectorize(np.linalg.qr)
vecQR(X)
From the doc: "By default, pyfunc is assumed to take scalars as input and output.".
So you need to give it a signature:
vecQR = np.vectorize(np.linalg.qr, signature='(m,n)->(m,p),(p,n)')
How about just map np.linalg.qr to the 1st axis of the arr?:
In [35]: np.array(list(map(np.linalg.qr, X)))
Out[35]:
array([[[[-3.30595447e-01, -2.06613421e-02, 2.50135751e-01, ...,
2.45828025e-02, 9.29150994e-02, -5.02663489e-02],
[-1.04193390e-01, -1.95327811e-02, 1.54158438e-02, ...,
2.62127499e-01, -2.21480958e-02, 1.94813279e-01],
[ 1.62712767e-01, -1.28304663e-01, -1.50172509e-01, ...,
1.73740906e-01, 1.31272690e-01, -2.47868876e-01]

Convert a string of ndarray to ndarray

I have a string of ndarray. I want to convert it back to ndarray.
I tried newval = np.fromstring(val, dtype=float). But it gives ValueError: string size must be a multiple of element size
Also I tried newval = ast.literal_eval(val). This gives
File "<unknown>", line 1
[-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
^
SyntaxError: invalid syntax
String of ndarray
'[-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
-9.70948339e-02 -1.75551832e-01 -7.24434480e-02 1.19182713e-01
-4.54084426e-02 -9.23779532e-02 8.87222588e-02 1.05331177e-02
-1.31792471e-01 3.50326337e-02 -6.58577830e-02 1.02670217e+00
-5.29987812e-02 2.09167395e-02 -1.19845152e-01 2.30511073e-02
2.89404951e-02 4.17387672e-02 -2.08203331e-01 2.34342851e-02]'
How can I convert this back to ndarray?
To expand upon my comment:
If you're trying to parse a human-readable string representation of a NumPy array you've acquired from somewhere, you're already doing something you shouldn't.
Instead use numpy.save() and numpy.load() to persist NumPy arrays in an efficient binary format.
Maybe use .savetxt() if you need human readability at the expense of precision and processing speed... but never consider str(arr) to be something you can ever parse again.
However, to answer your question, if you're absolutely desperate and don't have a way to get the array into a better format...
>>> data = '''
... [-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
... -9.70948339e-02 -1.75551832e-01 -7.24434480e-02 1.19182713e-01
... -4.54084426e-02 -9.23779532e-02 8.87222588e-02 1.05331177e-02
... -1.31792471e-01 3.50326337e-02 -6.58577830e-02 1.02670217e+00
... -5.29987812e-02 2.09167395e-02 -1.19845152e-01 2.30511073e-02
... 2.89404951e-02 4.17387672e-02 -2.08203331e-01 2.34342851e-02]
... '''.strip()
>>> list_of_floats = [float(x) for x in data.strip('[]').split(None)]
[-0.145181984, 0.151671678, 0.159053639, -0.102861412, -0.0970948339, -0.175551832, -0.072443448, 0.119182713, -0.0454084426, -0.0923779532, 0.0887222588, 0.0105331177, -0.131792471, 0.0350326337, -0.065857783, 1.02670217, -0.0529987812, 0.0209167395, -0.119845152, 0.0230511073, 0.0289404951, 0.0417387672, -0.208203331, 0.0234342851]
EDIT: For the case OP mentioned in the comments,
I am storing these arrays in LevelDB as key value pairs. The arrays are fasttext vectors. In levelDB vector (value) for each ngram (key) are stored. Is what you mentioned above applicable here?
Yes – you'd use BytesIO from the io module to emulate an in-memory "file" NumPy can write into, then put that buffer into LevelDB, and reverse the process (read from LevelDB into an empty BytesIO and pass it to NumPy) to read:
bio = io.BytesIO()
np.save(bio, my_array)
ldb.put('my-key', bio.getvalue())
# ...
bio = io.BytesIO(ldb.get('my-key'))
my_array = np.load(bio)

Most computationally efficient way (in terms of RAM, Not storage space) to save many (millions) of varying length numpy arrays in a specific order

I need save a list containing numpy arrays in a specific order , using the most computationally efficient method. When I mean 'computationally efficient', I mean takes up the least amount of RAM, Not storage space.
For example, would
numpy.savez('test.pnz', *listofNumpyArrays)
preserve the order of numpy arrays in listofNumpyArrays while also spending minimal RAM in the process?
Order is very important because I have another single dimension int array, and each component in that array needs to match with a specific array in listofNumpyArrays
I originally put all these numbers into the same numpy variable and it looked like this, with the single int as the fist component of each row, and the corresponding array as the second component of each row (first 10 rows displayed below):
array([[0,
array([10158697, 5255434, 9860860, 3677049, 3451292, 7225330])],
[1,
array([ 5985929, 7356938, 5232932, 4623077, 10461651, 6629144,
2738221, 7672279, 3197654, 11678039, 1912097, 6581279,
8141689, 6694817, 6139889, 7946369, 3995629, 3169031,
3793217, 6990097, 11298098, 6120907, 5336712, 7366785,
7363171, 3933563, 6484209, 4243394, 6371367, 4361218,
11469370, 6166715, 11519607, 11602639, 10759034, 6432476,
5327726, 11390220, 7009744, 10225744, 3781058, 1305863,
462965, 1158562, 2620006, 73896, 4945223, 11780201,
3044821])],
[2,
array([10847593, 8665775, 341568, 4164850, 6509965, 8227738])],
[3,
array([ 9105020, 1896456, 2757197, 5911741, 8123078, 10629261,
5646782, 5255907, 8802504, 3735293, 5496511, 1612181,
10029269, 8911733, 8035123, 4855475, 2226494, 10448630,
2041328, 532211, 10049766, 7320606, 7783187, 11536583,
9192742, 8965808, 7750786, 2462038, 111935, 4306882,
11193228])],
[4,
array([11406300, 9947761, 2539951, 1928472, 1286647, 1360522,
9680046, 1304518, 2577907, 5903319, 6304940, 8249558,
11156695, 5704721, 9753227, 465481, 8849435, 5040956,
8124190, 11094867, 9225419, 10531161, 3796335, 6660230,
823696, 3271428, 9167574])],
[5, array([3535672, 9474011, 4708696, 9700618, 4762633])],
[6,
array([ 1352149, 6408648, 3218823, 977256, 2488662, 603435,
6576555, 2555278, 5487362, 10008975, 5066785, 5294573,
1381729])],
[7,
array([ 7078697, 2981709, 2426786, 676729, 2688166, 3353437,
7244406, 7804172, 9142652, 10594869, 95474, 3867698,
6645756, 1936281, 3728779, 10800050, 8270358, 2336174,
6785259, 7282204, 7485619, 8041965, 2445126, 5681877,
8383953, 9431775, 11442616, 9835489, 3940227, 1289543,
87876, 6148249, 3026723, 1651607, 4135482, 916739,
10408908, 9178863, 11775813, 1652898, 1190197, 1913270,
3759101, 7491500, 5183607, 8476053, 8482428, 11354398,
2370455, 6820229, 3952122, 9633865, 4498499, 1475415,
11303906, 2958509, 10639369])],
[8,
array([ 67999, 3847504, 8303130, 748628, 1321792, 3453436,
2813805, 6915028, 8459024, 10499123, 8855154, 9783163,
8580897, 3092270, 3438170, 9067374, 11068581, 4926186,
6137797, 2207028, 2806970, 8601327, 5183768, 10655752,
2775719, 6653472, 3995496, 2488194, 3610797, 5508794,
6524258, 2368954, 2070, 9522863, 1765121, 7842059,
11473248, 3077500, 130699, 1407101, 2434487, 1129155,
5643519, 7780188])],
[9, array([8528119, 6930255, 630592, 5476242])]], dtype=object)
However, when I tried to save it, my enviroment crashed because I ran out of memory. I later learned that numpy is optimized for fixed length rows, so as you can see the overall dtype becomes an object, which is not computationally optimal for numpy arrays, and which is probably guessing my enviroment (Google Colaboratory) to crash.
I have also tried it where the second component is a list instead of an array, and that actually is able to save to a .npy file using numpy.save. However, when I tried to load the file, it crashes again.
As far as I have searched, there is no computationally optimized way to save such data in one numpy variable. So it looks like I have to save the first column (containing the single ints) in one numpy array. And then for each of those ints, there is another array of numbers that I have to save separately in that particular order, while optimizing the RAM for that process.

string(array) vs string(list) in python

I was constructing a database for a deep learning algorithm. The points I'm interested in are these:
with open(fname, 'a+') as f:
f.write("intens: " + str(mean_intensity_this_object) + "," + "\n")
f.write("distances: " + str(dists_this_object) + "," + "\n")
Where mean_intensity_this_object is a list and dists_this_object is a numpy.array, something I didn't pay enough attention to to begin with. After I opened the file, I found out that the second variable, distances, looks very different to intens: The former is
distances: [430.17802963 315.2197058 380.33997833 387.46190951 41.93648858
221.5210474 488.99452579],
and the latter
intens: [0.15381262,..., 0.13638344],
The important bit is that the latter is a standard list, while the former is very hard to read: multiple lines without delimiters and unclear rules for starting a new line. Essentially as a result I had to rerun the whole tracking algorithm and change str(dists_this_object) to str(dists_this_object.tolist()) which increased the file size.
So, my question is: why does this happen? Is it possible to save np.array objects in a more readable format, like lists?
In an interactive Python session:
>>> import numpy as np
>>> x = np.arange(10)/.33 # make an array of floats
>>> x
array([ 0. , 3.03030303, 6.06060606, 9.09090909,
12.12121212, 15.15151515, 18.18181818, 21.21212121,
24.24242424, 27.27272727])
>>> print(x)
[ 0. 3.03030303 6.06060606 9.09090909 12.12121212
15.15151515 18.18181818 21.21212121 24.24242424 27.27272727]
>>> print(x.tolist())
[0.0, 3.0303030303030303, 6.0606060606060606, 9.09090909090909, 12.121212121212121, 15.15151515151515, 18.18181818181818, 21.21212121212121, 24.242424242424242, 27.27272727272727]
The standard display for a list is with [] and ,. The display for an array is without ,. If there are over 1000 items, the array display employs an ellipsis
>>> print(x)
[ 0. 3.03030303 6.06060606 ..., 3024.24242424
3027.27272727 3030.3030303 ]
while the list display continues to show every value.
In this line, did you add the ..., or is that part of the print?
intens: [0.15381262,..., 0.13638344],
Or doing the same with a file write:
In [299]: with open('test.txt', 'w') as f:
...: f.write('array:'+str(x)+'\n')
...: f.write('list:'+str(x.tolist())+'\n')
In [300]: cat test.txt
array:[ 0. 3.33333333 6.66666667 10. 13.33333333
16.66666667 20. 23.33333333 26.66666667 30. ]
list:[0.0, 3.3333333333333335, 6.666666666666667, 10.0, 13.333333333333334, 16.666666666666668, 20.0, 23.333333333333336, 26.666666666666668, 30.0]
np.savetxt gives more control over the formatting of an array, for example:
In [312]: np.savetxt('test.txt',[x], fmt='%10.6f',delimiter=',')
In [313]: cat test.txt
0.000000, 3.333333, 6.666667, 10.000000, 13.333333, 16.666667, 20.000000, 23.333333, 26.666667, 30.000000
The default array print is aimed mainly at interactive work, where you want to see enough of the values to see whether they are right or not, but you don't intend to reload them. The savetxt/loadtxt pair are better for that.
The savetxt does, roughly:
for row in x:
f.write(fmt%tuple(row))
where fmt is constructed from your input paramater and the number of items in the row, e.g. ', '.join(['%10.6f']*10)+'\n'
In [320]: print('[%s]'%', '.join(['%10.6f']*10)%tuple(x))
[ 0.000000, 3.333333, 6.666667, 10.000000, 13.333333, 16.666667, 20.000000, 23.333333, 26.666667, 30.000000]
Actually python converts both in the same way: str(object) calls object.__str__() or object.__repr__() if the former does not exist. From that point it is the responsibility of object to provide its string representation.
Python lists and numpy arrays are different objects, designed and implemented by different people to serve different needs so it is to be expected that their __str__ and __repr__ methods do not behave the same.

Array operations on dask arrays

I have got two dask arrays i.e., a and b. I get dot product of a and b as below
>>>z2 = da.from_array(a.dot(b),chunks=1)
>>> z2
dask.array<from-ar..., shape=(3, 3), dtype=int32, chunksize=(1, 1)>
But when i do
sigmoid(z2)
Shell stops working. I can't even kill it.
Sigmoid is given as below:
def sigmoid(z):
return 1/(1+np.exp(-z))
When working with Dask Arrays, it is normally best to use functions provided in dask.array. The problem with using NumPy functions directly is they will pull of the data from the Dask Array into memory, which could be the cause of the shell freezing that you experienced. The functions provided in dask.array are designed to avoid this by lazily chaining computations until you wish to evaluate them. In this case, it would be better to use da.exp instead of np.exp. Provided an example of this below.
Have provided a modified version of your code to demonstrate how this would be done. In the example I have called .compute(), which also pulls the full result into memory. It is possible that this could also cause issues for you if your data is very large. Hence I have demonstrated taking a small slice of the data before calling compute to keep the result small and memory friendly. If your data is large and you wish to keep the full result, would recommend storing it to disk instead.
Hope this helps.
In [1]: import dask.array as da
In [2]: def sigmoid(z):
...: return 1 / (1 + da.exp(-z))
...:
In [3]: d = da.random.uniform(-6, 6, (100, 110), chunks=(10, 11))
In [4]: ds = sigmoid(d)
In [5]: ds[:5, :6].compute()
Out[5]:
array([[ 0.0067856 , 0.31701817, 0.43301395, 0.23188129, 0.01530903,
0.34420555],
[ 0.24473798, 0.99594466, 0.9942868 , 0.9947099 , 0.98266004,
0.99717379],
[ 0.92617922, 0.17548207, 0.98363658, 0.01764361, 0.74843615,
0.04628735],
[ 0.99155315, 0.99447542, 0.99483032, 0.00380505, 0.0435369 ,
0.01208241],
[ 0.99640952, 0.99703901, 0.69332886, 0.97541982, 0.05356214,
0.1869447 ]])
Got it... I tried and it worked!
ans = z2.map_blocks(sigmoid)

Categories

Resources