Hi I am trying to vectorise the QR decomposition in numpy as the documentation suggests here, however I keep getting dimension issues. I am confused as to what I am doing wrong as I believe the following follows the documentation. Does anyone know what is wrong with this:
import numpy as np
X = np.random.randn(100,50,50)
vecQR = np.vectorize(np.linalg.qr)
vecQR(X)
From the doc: "By default, pyfunc is assumed to take scalars as input and output.".
So you need to give it a signature:
vecQR = np.vectorize(np.linalg.qr, signature='(m,n)->(m,p),(p,n)')
How about just map np.linalg.qr to the 1st axis of the arr?:
In [35]: np.array(list(map(np.linalg.qr, X)))
Out[35]:
array([[[[-3.30595447e-01, -2.06613421e-02, 2.50135751e-01, ...,
2.45828025e-02, 9.29150994e-02, -5.02663489e-02],
[-1.04193390e-01, -1.95327811e-02, 1.54158438e-02, ...,
2.62127499e-01, -2.21480958e-02, 1.94813279e-01],
[ 1.62712767e-01, -1.28304663e-01, -1.50172509e-01, ...,
1.73740906e-01, 1.31272690e-01, -2.47868876e-01]
I have a string of ndarray. I want to convert it back to ndarray.
I tried newval = np.fromstring(val, dtype=float). But it gives ValueError: string size must be a multiple of element size
Also I tried newval = ast.literal_eval(val). This gives
File "<unknown>", line 1
[-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
^
SyntaxError: invalid syntax
String of ndarray
'[-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
-9.70948339e-02 -1.75551832e-01 -7.24434480e-02 1.19182713e-01
-4.54084426e-02 -9.23779532e-02 8.87222588e-02 1.05331177e-02
-1.31792471e-01 3.50326337e-02 -6.58577830e-02 1.02670217e+00
-5.29987812e-02 2.09167395e-02 -1.19845152e-01 2.30511073e-02
2.89404951e-02 4.17387672e-02 -2.08203331e-01 2.34342851e-02]'
How can I convert this back to ndarray?
To expand upon my comment:
If you're trying to parse a human-readable string representation of a NumPy array you've acquired from somewhere, you're already doing something you shouldn't.
Instead use numpy.save() and numpy.load() to persist NumPy arrays in an efficient binary format.
Maybe use .savetxt() if you need human readability at the expense of precision and processing speed... but never consider str(arr) to be something you can ever parse again.
However, to answer your question, if you're absolutely desperate and don't have a way to get the array into a better format...
>>> data = '''
... [-1.45181984e-01 1.51671678e-01 1.59053639e-01 -1.02861412e-01
... -9.70948339e-02 -1.75551832e-01 -7.24434480e-02 1.19182713e-01
... -4.54084426e-02 -9.23779532e-02 8.87222588e-02 1.05331177e-02
... -1.31792471e-01 3.50326337e-02 -6.58577830e-02 1.02670217e+00
... -5.29987812e-02 2.09167395e-02 -1.19845152e-01 2.30511073e-02
... 2.89404951e-02 4.17387672e-02 -2.08203331e-01 2.34342851e-02]
... '''.strip()
>>> list_of_floats = [float(x) for x in data.strip('[]').split(None)]
[-0.145181984, 0.151671678, 0.159053639, -0.102861412, -0.0970948339, -0.175551832, -0.072443448, 0.119182713, -0.0454084426, -0.0923779532, 0.0887222588, 0.0105331177, -0.131792471, 0.0350326337, -0.065857783, 1.02670217, -0.0529987812, 0.0209167395, -0.119845152, 0.0230511073, 0.0289404951, 0.0417387672, -0.208203331, 0.0234342851]
EDIT: For the case OP mentioned in the comments,
I am storing these arrays in LevelDB as key value pairs. The arrays are fasttext vectors. In levelDB vector (value) for each ngram (key) are stored. Is what you mentioned above applicable here?
Yes – you'd use BytesIO from the io module to emulate an in-memory "file" NumPy can write into, then put that buffer into LevelDB, and reverse the process (read from LevelDB into an empty BytesIO and pass it to NumPy) to read:
bio = io.BytesIO()
np.save(bio, my_array)
ldb.put('my-key', bio.getvalue())
# ...
bio = io.BytesIO(ldb.get('my-key'))
my_array = np.load(bio)
I need save a list containing numpy arrays in a specific order , using the most computationally efficient method. When I mean 'computationally efficient', I mean takes up the least amount of RAM, Not storage space.
For example, would
numpy.savez('test.pnz', *listofNumpyArrays)
preserve the order of numpy arrays in listofNumpyArrays while also spending minimal RAM in the process?
Order is very important because I have another single dimension int array, and each component in that array needs to match with a specific array in listofNumpyArrays
I originally put all these numbers into the same numpy variable and it looked like this, with the single int as the fist component of each row, and the corresponding array as the second component of each row (first 10 rows displayed below):
array([[0,
array([10158697, 5255434, 9860860, 3677049, 3451292, 7225330])],
[1,
array([ 5985929, 7356938, 5232932, 4623077, 10461651, 6629144,
2738221, 7672279, 3197654, 11678039, 1912097, 6581279,
8141689, 6694817, 6139889, 7946369, 3995629, 3169031,
3793217, 6990097, 11298098, 6120907, 5336712, 7366785,
7363171, 3933563, 6484209, 4243394, 6371367, 4361218,
11469370, 6166715, 11519607, 11602639, 10759034, 6432476,
5327726, 11390220, 7009744, 10225744, 3781058, 1305863,
462965, 1158562, 2620006, 73896, 4945223, 11780201,
3044821])],
[2,
array([10847593, 8665775, 341568, 4164850, 6509965, 8227738])],
[3,
array([ 9105020, 1896456, 2757197, 5911741, 8123078, 10629261,
5646782, 5255907, 8802504, 3735293, 5496511, 1612181,
10029269, 8911733, 8035123, 4855475, 2226494, 10448630,
2041328, 532211, 10049766, 7320606, 7783187, 11536583,
9192742, 8965808, 7750786, 2462038, 111935, 4306882,
11193228])],
[4,
array([11406300, 9947761, 2539951, 1928472, 1286647, 1360522,
9680046, 1304518, 2577907, 5903319, 6304940, 8249558,
11156695, 5704721, 9753227, 465481, 8849435, 5040956,
8124190, 11094867, 9225419, 10531161, 3796335, 6660230,
823696, 3271428, 9167574])],
[5, array([3535672, 9474011, 4708696, 9700618, 4762633])],
[6,
array([ 1352149, 6408648, 3218823, 977256, 2488662, 603435,
6576555, 2555278, 5487362, 10008975, 5066785, 5294573,
1381729])],
[7,
array([ 7078697, 2981709, 2426786, 676729, 2688166, 3353437,
7244406, 7804172, 9142652, 10594869, 95474, 3867698,
6645756, 1936281, 3728779, 10800050, 8270358, 2336174,
6785259, 7282204, 7485619, 8041965, 2445126, 5681877,
8383953, 9431775, 11442616, 9835489, 3940227, 1289543,
87876, 6148249, 3026723, 1651607, 4135482, 916739,
10408908, 9178863, 11775813, 1652898, 1190197, 1913270,
3759101, 7491500, 5183607, 8476053, 8482428, 11354398,
2370455, 6820229, 3952122, 9633865, 4498499, 1475415,
11303906, 2958509, 10639369])],
[8,
array([ 67999, 3847504, 8303130, 748628, 1321792, 3453436,
2813805, 6915028, 8459024, 10499123, 8855154, 9783163,
8580897, 3092270, 3438170, 9067374, 11068581, 4926186,
6137797, 2207028, 2806970, 8601327, 5183768, 10655752,
2775719, 6653472, 3995496, 2488194, 3610797, 5508794,
6524258, 2368954, 2070, 9522863, 1765121, 7842059,
11473248, 3077500, 130699, 1407101, 2434487, 1129155,
5643519, 7780188])],
[9, array([8528119, 6930255, 630592, 5476242])]], dtype=object)
However, when I tried to save it, my enviroment crashed because I ran out of memory. I later learned that numpy is optimized for fixed length rows, so as you can see the overall dtype becomes an object, which is not computationally optimal for numpy arrays, and which is probably guessing my enviroment (Google Colaboratory) to crash.
I have also tried it where the second component is a list instead of an array, and that actually is able to save to a .npy file using numpy.save. However, when I tried to load the file, it crashes again.
As far as I have searched, there is no computationally optimized way to save such data in one numpy variable. So it looks like I have to save the first column (containing the single ints) in one numpy array. And then for each of those ints, there is another array of numbers that I have to save separately in that particular order, while optimizing the RAM for that process.
I was constructing a database for a deep learning algorithm. The points I'm interested in are these:
with open(fname, 'a+') as f:
f.write("intens: " + str(mean_intensity_this_object) + "," + "\n")
f.write("distances: " + str(dists_this_object) + "," + "\n")
Where mean_intensity_this_object is a list and dists_this_object is a numpy.array, something I didn't pay enough attention to to begin with. After I opened the file, I found out that the second variable, distances, looks very different to intens: The former is
distances: [430.17802963 315.2197058 380.33997833 387.46190951 41.93648858
221.5210474 488.99452579],
and the latter
intens: [0.15381262,..., 0.13638344],
The important bit is that the latter is a standard list, while the former is very hard to read: multiple lines without delimiters and unclear rules for starting a new line. Essentially as a result I had to rerun the whole tracking algorithm and change str(dists_this_object) to str(dists_this_object.tolist()) which increased the file size.
So, my question is: why does this happen? Is it possible to save np.array objects in a more readable format, like lists?
In an interactive Python session:
>>> import numpy as np
>>> x = np.arange(10)/.33 # make an array of floats
>>> x
array([ 0. , 3.03030303, 6.06060606, 9.09090909,
12.12121212, 15.15151515, 18.18181818, 21.21212121,
24.24242424, 27.27272727])
>>> print(x)
[ 0. 3.03030303 6.06060606 9.09090909 12.12121212
15.15151515 18.18181818 21.21212121 24.24242424 27.27272727]
>>> print(x.tolist())
[0.0, 3.0303030303030303, 6.0606060606060606, 9.09090909090909, 12.121212121212121, 15.15151515151515, 18.18181818181818, 21.21212121212121, 24.242424242424242, 27.27272727272727]
The standard display for a list is with [] and ,. The display for an array is without ,. If there are over 1000 items, the array display employs an ellipsis
>>> print(x)
[ 0. 3.03030303 6.06060606 ..., 3024.24242424
3027.27272727 3030.3030303 ]
while the list display continues to show every value.
In this line, did you add the ..., or is that part of the print?
intens: [0.15381262,..., 0.13638344],
Or doing the same with a file write:
In [299]: with open('test.txt', 'w') as f:
...: f.write('array:'+str(x)+'\n')
...: f.write('list:'+str(x.tolist())+'\n')
In [300]: cat test.txt
array:[ 0. 3.33333333 6.66666667 10. 13.33333333
16.66666667 20. 23.33333333 26.66666667 30. ]
list:[0.0, 3.3333333333333335, 6.666666666666667, 10.0, 13.333333333333334, 16.666666666666668, 20.0, 23.333333333333336, 26.666666666666668, 30.0]
np.savetxt gives more control over the formatting of an array, for example:
In [312]: np.savetxt('test.txt',[x], fmt='%10.6f',delimiter=',')
In [313]: cat test.txt
0.000000, 3.333333, 6.666667, 10.000000, 13.333333, 16.666667, 20.000000, 23.333333, 26.666667, 30.000000
The default array print is aimed mainly at interactive work, where you want to see enough of the values to see whether they are right or not, but you don't intend to reload them. The savetxt/loadtxt pair are better for that.
The savetxt does, roughly:
for row in x:
f.write(fmt%tuple(row))
where fmt is constructed from your input paramater and the number of items in the row, e.g. ', '.join(['%10.6f']*10)+'\n'
In [320]: print('[%s]'%', '.join(['%10.6f']*10)%tuple(x))
[ 0.000000, 3.333333, 6.666667, 10.000000, 13.333333, 16.666667, 20.000000, 23.333333, 26.666667, 30.000000]
Actually python converts both in the same way: str(object) calls object.__str__() or object.__repr__() if the former does not exist. From that point it is the responsibility of object to provide its string representation.
Python lists and numpy arrays are different objects, designed and implemented by different people to serve different needs so it is to be expected that their __str__ and __repr__ methods do not behave the same.
I have got two dask arrays i.e., a and b. I get dot product of a and b as below
>>>z2 = da.from_array(a.dot(b),chunks=1)
>>> z2
dask.array<from-ar..., shape=(3, 3), dtype=int32, chunksize=(1, 1)>
But when i do
sigmoid(z2)
Shell stops working. I can't even kill it.
Sigmoid is given as below:
def sigmoid(z):
return 1/(1+np.exp(-z))
When working with Dask Arrays, it is normally best to use functions provided in dask.array. The problem with using NumPy functions directly is they will pull of the data from the Dask Array into memory, which could be the cause of the shell freezing that you experienced. The functions provided in dask.array are designed to avoid this by lazily chaining computations until you wish to evaluate them. In this case, it would be better to use da.exp instead of np.exp. Provided an example of this below.
Have provided a modified version of your code to demonstrate how this would be done. In the example I have called .compute(), which also pulls the full result into memory. It is possible that this could also cause issues for you if your data is very large. Hence I have demonstrated taking a small slice of the data before calling compute to keep the result small and memory friendly. If your data is large and you wish to keep the full result, would recommend storing it to disk instead.
Hope this helps.
In [1]: import dask.array as da
In [2]: def sigmoid(z):
...: return 1 / (1 + da.exp(-z))
...:
In [3]: d = da.random.uniform(-6, 6, (100, 110), chunks=(10, 11))
In [4]: ds = sigmoid(d)
In [5]: ds[:5, :6].compute()
Out[5]:
array([[ 0.0067856 , 0.31701817, 0.43301395, 0.23188129, 0.01530903,
0.34420555],
[ 0.24473798, 0.99594466, 0.9942868 , 0.9947099 , 0.98266004,
0.99717379],
[ 0.92617922, 0.17548207, 0.98363658, 0.01764361, 0.74843615,
0.04628735],
[ 0.99155315, 0.99447542, 0.99483032, 0.00380505, 0.0435369 ,
0.01208241],
[ 0.99640952, 0.99703901, 0.69332886, 0.97541982, 0.05356214,
0.1869447 ]])
Got it... I tried and it worked!
ans = z2.map_blocks(sigmoid)