Array operations on dask arrays

Array operations on dask arrays - python

I have got two dask arrays i.e., a and b. I get dot product of a and b as below
>>>z2 = da.from_array(a.dot(b),chunks=1)
>>> z2
dask.array<from-ar..., shape=(3, 3), dtype=int32, chunksize=(1, 1)>
But when i do
sigmoid(z2)
Shell stops working. I can't even kill it.
Sigmoid is given as below:
def sigmoid(z):
return 1/(1+np.exp(-z))

When working with Dask Arrays, it is normally best to use functions provided in dask.array. The problem with using NumPy functions directly is they will pull of the data from the Dask Array into memory, which could be the cause of the shell freezing that you experienced. The functions provided in dask.array are designed to avoid this by lazily chaining computations until you wish to evaluate them. In this case, it would be better to use da.exp instead of np.exp. Provided an example of this below.
Have provided a modified version of your code to demonstrate how this would be done. In the example I have called .compute(), which also pulls the full result into memory. It is possible that this could also cause issues for you if your data is very large. Hence I have demonstrated taking a small slice of the data before calling compute to keep the result small and memory friendly. If your data is large and you wish to keep the full result, would recommend storing it to disk instead.
Hope this helps.
In [1]: import dask.array as da
In [2]: def sigmoid(z):
...: return 1 / (1 + da.exp(-z))
...:
In [3]: d = da.random.uniform(-6, 6, (100, 110), chunks=(10, 11))
In [4]: ds = sigmoid(d)
In [5]: ds[:5, :6].compute()
Out[5]:
array([[ 0.0067856 , 0.31701817, 0.43301395, 0.23188129, 0.01530903,
0.34420555],
[ 0.24473798, 0.99594466, 0.9942868 , 0.9947099 , 0.98266004,
0.99717379],
[ 0.92617922, 0.17548207, 0.98363658, 0.01764361, 0.74843615,
0.04628735],
[ 0.99155315, 0.99447542, 0.99483032, 0.00380505, 0.0435369 ,
0.01208241],
[ 0.99640952, 0.99703901, 0.69332886, 0.97541982, 0.05356214,
0.1869447 ]])

Got it... I tried and it worked!
ans = z2.map_blocks(sigmoid)

Related

Improving loop in loops with Numpy

I am using numpy arrays aside from pandas for speed purposes. However, I am unable to advance my codes using broadcasting, indexing etc. Instead, I am using loop in loops as below. It is working but seems so ugly and inefficient to me.
Basically what I am doing is, I am trying to imitate groupby of pandas at the step mydata[mydata[:,1]==i]. You may consider it as a firm id number. Then with respect to the lookup data, I am checking if it is inside the selected firm or not at the step all(np.isin(lookup[u],d[:,3])). But as I denoted at the beginning, I feel so uncomfortable about this.
out = []
for i in np.unique(mydata[:,1]):
d = mydata[mydata[:,1]==i]
for u in range(0,len(lookup)):
control = all(np.isin(lookup[u],d[:,3]))
if(control):
out.append(d[np.isin(d[:,3],lookup[u])])
It takes about 0.27 seconds. However there must exist some clever alternatives.
I also tried Numba jit() but it does not work.
Could anyone help me about that?
Thanks in advance!
Fake Data:
a = np.repeat(np.arange(100)+5000, np.random.randint(50, 100, 100))
b = np.random.randint(100,200,len(a))
c = np.random.randint(10,70,len(a))
index = np.arange(len(a))
mydata = np.vstack((index,a, b,c)).T
lookup = []
for i in range(0,60):
lookup.append(np.random.randint(10,70,np.random.randint(3,6,1) ))

I had some problems getting the goal of your Program, but I got a decent performance improvement, by refactoring your second for loop. I was able to compress your code to 3 or 4 lines.
f = (
lambda lookup: out1.append(d[np.isin(d[:, 3], lookup)])
if all(np.isin(lookup, d[:, 3]))
else None
)
out = []
for i in np.unique(mydata[:, 1]):
d = mydata[mydata[:, 1] == i]
list(map(f, lookups))
This resolves to the same output list you received previously and the code runs almost twice as quick (at least on my machine).

Vectorize QR in Numpy Python

Hi I am trying to vectorise the QR decomposition in numpy as the documentation suggests here, however I keep getting dimension issues. I am confused as to what I am doing wrong as I believe the following follows the documentation. Does anyone know what is wrong with this:
import numpy as np
X = np.random.randn(100,50,50)
vecQR = np.vectorize(np.linalg.qr)
vecQR(X)

From the doc: "By default, pyfunc is assumed to take scalars as input and output.".
So you need to give it a signature:
vecQR = np.vectorize(np.linalg.qr, signature='(m,n)->(m,p),(p,n)')

How about just map np.linalg.qr to the 1st axis of the arr?:
In [35]: np.array(list(map(np.linalg.qr, X)))
Out[35]:
array([[[[-3.30595447e-01, -2.06613421e-02, 2.50135751e-01, ...,
2.45828025e-02, 9.29150994e-02, -5.02663489e-02],
[-1.04193390e-01, -1.95327811e-02, 1.54158438e-02, ...,
2.62127499e-01, -2.21480958e-02, 1.94813279e-01],
[ 1.62712767e-01, -1.28304663e-01, -1.50172509e-01, ...,
1.73740906e-01, 1.31272690e-01, -2.47868876e-01]

How is timeit affected by the length of a list literal?

Update: Apparently I'm only timing the speed with which Python can read a list. This doesn't really change my question, though.
So, I read this post the other day and wanted to compare what the speeds looked like. I'm new to pandas so any time I see an opportunity to do something moderately interesting, I jump on it. Anyway, I initially just tested this out with 100 numbers, thinking that would be sufficient to satisfy my itch to play with pandas. But this is what that graph looked like:
Notice that there are 3 different runs. These runs were run in sequential order, but they all had a spike at the same two spots. The spots were approximately 28 and 64. So my initial thought was it had something to do with bytes, specifically 4. Maybe the first byte contains additional information about it being a list, and then the next byte is all data and every 4 bytes after that causes a spike in speed, which kinda made sense. So I needed to test it with more numbers. So I created a DataFrame of 3 sets of arrays, each with 1000 lists ranging in length from 0-999. I then timed them all in the same manner, that is:
Run 1: 0, 1, 2, 3, ...
Run 2: 0, 1, 2, 3, ...
Run 3: 0, 1, 2, 3, ...
What I expected to see was a dramatic increase approximately every 32 items in the array, but instead there's no recurrence to the pattern(I did zoom in and look for spikes):
However, you'll notice, that they all vary a lot between the numbers 400 and 682. Oddly, 1 run always a spike in the same place making the pattern harder to distinguish in the 28 and 64 points in this graph. The green line is all over the place really. Shameful.
Question: What's happening at the initial two spikes and why does it get "fuzzy" on the graph between 400 and 682? I just finished running a test over the 0-99 sets but this time did simple addition to each item in the array and the result was exactly linear, so I think it has something to do with strings.
I tested with other methods first, and got the same results, but the graph was messed up because I joined the results wrong, so I ran it again overnight(this took a long time) using this code to make sure the times were correctly aligned with their indexes and the runs were performed in the correct order:
import statistics as s
import timeit
df = pd.DataFrame([[('run_%s' % str(x + 1)), r, np.random.choice(100, r).tolist()]
for r in range(0, 1000) for x in range(3)],
columns=['run', 'length', 'array']).sort_values(['run', 'length'])
df['time'] = df.array.apply(lambda x: s.mean(timeit.repeat(str(x))))
# Graph
ax = df.groupby(['run', 'length']).mean().unstack('run').plot(y='time')
ax.set_ylabel('Time [ns]')
ax.set_xlabel('Array Length')
ax.legend(loc=3)
I also have the dataframe pickled if you'd like to see the raw data.

You are severely overcomplicating things using pandas and .apply here. There is no need - it is simply inefficient. Just do it the vanilla Python way:
In [3]: import timeit
In [4]: setup = "l = list(range({}))"
In [5]: test = "str(l)"
Note, timeit functions take a number parameter, which is the number of times everything is run. It defaults to 1000000, so let's make that more reasonable, by using number=100, so we don't have to wait around for forever...
In [8]: data = [timeit.repeat(test, setup.format(n), number=100) for n in range(0, 10001, 100)]
In [9]: import statistics
In [10]: mean_data = list(map(statistics.mean, data))
Visual inspection of the results:
In [11]: mean_data
Out[11]:
[3.977467228348056e-05,
0.0012597616684312622,
0.002014552320664128,
0.002637979011827459,
0.0034494600258767605,
0.0046060653403401375,
0.006786816345993429,
0.006134035007562488,
0.006666974319765965,
0.0073876206879504025,
0.008359026357841989,
0.008946725012113651,
0.01020014965130637,
0.0110439983351777,
0.012085124345806738,
0.013095536657298604,
0.013812023680657148,
0.014505649354153624,
0.015109792332320163,
0.01541508767210568,
0.018623976677190512,
0.018014412683745224,
0.01837641668195526,
0.01806374565542986,
0.01866597666715582,
0.021138361655175686,
0.020885809014240902,
0.023644315680333722,
0.022424093661053728,
0.024507874331902713,
0.026360396664434422,
0.02618172235088423,
0.02721496132047226,
0.026609957004742075,
0.027632603014353663,
0.029077719994044553,
0.030218352350251127,
0.03213361800105,
0.0321545610204339,
0.032791375007946044,
0.033749551337677985,
0.03418213398739075,
0.03482868466138219,
0.03569800598779693,
0.035460735321976244,
0.03980560234049335,
0.0375820419867523,
0.03880414469555641,
0.03926491799453894,
0.04079093333954612,
0.0420664346893318,
0.044861480011604726,
0.045125720323994756,
0.04562378901755437,
0.04398221097653732,
0.04668888701902082,
0.04841196699999273,
0.047662509993339576,
0.047592316346708685,
0.05009777001881351,
0.04870589632385721,
0.0532167866670837,
0.05079756366709868,
0.05264475334358091,
0.05531930166762322,
0.05283398299555605,
0.055121281009633094,
0.056162080339466534,
0.05814277834724635,
0.05694748067374652,
0.05985202432687705,
0.05949359833418081,
0.05837553597909088,
0.05975819365509475,
0.06247356999665499,
0.061310798317814864,
0.06292542165222888,
0.06698586166991542,
0.06634997764679913,
0.06443380867131054,
0.06923895300133154,
0.06685209332499653,
0.06864909763680771,
0.06959929631557316,
0.06832000267847131,
0.07180017333788176,
0.07092387134131665,
0.07280202202188472,
0.07342300032420705,
0.0745120863430202,
0.07483605532130848,
0.0734497313387692,
0.0763389469939284,
0.07811927401538317,
0.07915793966579561,
0.08072184936221068,
0.08046915601395692,
0.08565403800457716,
0.08061318534115951,
0.08411134833780427,
0.0865995019945937]
This looks pretty darn linear to me. Now, pandas is a handy way to graph things, especially if you want a convenient wrapper around matplotlib's API:
In [14]: import pandas as pd
In [15]: df = pd.DataFrame({'time': mean_data, 'n':list(range(0, 10001, 100))})
In [16]: df.plot(x='n', y='time')
Out[16]: <matplotlib.axes._subplots.AxesSubplot at 0x1102a4a58>
And here is the result:
This should get you on the right track to actually time what you've been trying to time. What you wound up timing, as I explained in the comments:
You are timing the result of str(x) which results in some list-literal,
so you are timing the interpretation of list literals, not the
conversion of list->str
I can only speculate as to the patterns you are seeing as the result of that, but that is likely interpreter/hardware dependent. Here are my findings on my machine:
In [18]: data = [timeit.repeat("{}".format(str(list(range(n)))), number=100) for n in range(0, 10001, 100)]
And using a range that isn't so large:
In [23]: data = [timeit.repeat("{}".format(str(list(range(n)))), number=10000) for n in range(0, 101)]
And the results:
Which I guess sort of looks like yours. Perhaps that is better suited for it's own question, though.

fastest method to dump numpy array into string

I need to organized a data file with chunks of named data. Data is NUMPY arrays. But I don't want to use numpy.save or numpy.savez function, because in some cases, data have to be sent on a server over a pipe or other interface. So I want to dump numpy array into memory, zip it, and then, send it into a server.
I've tried simple pickle, like this:
try:
import cPickle as pkl
except:
import pickle as pkl
import ziplib
import numpy as np
def send_to_db(data, compress=5):
send( zlib.compress(pkl.dumps(data),compress) )
.. but this is extremely slow process.
Even with compress level 0 (without compression), the process is very slow and just because of pickling.
Is there any way to dump numpy array into string without pickle? I know that numpy allows to get buffer numpy.getbuffer, but it isn't obvious to me, how to use this dumped buffer to obtaine an array back.

You should definitely use numpy.save, you can still do it in-memory:
>>> import io
>>> import numpy as np
>>> import zlib
>>> f = io.BytesIO()
>>> arr = np.random.rand(100, 100)
>>> np.save(f, arr)
>>> compressed = zlib.compress(f.getbuffer())
And to decompress, reverse the process:
>>> np.load(io.BytesIO(zlib.decompress(compressed)))
array([[ 0.80881898, 0.50553303, 0.03859795, ..., 0.05850996,
0.9174782 , 0.48671767],
[ 0.79715979, 0.81465744, 0.93529834, ..., 0.53577085,
0.59098735, 0.22716425],
[ 0.49570713, 0.09599001, 0.74023709, ..., 0.85172897,
0.05066641, 0.10364143],
...,
[ 0.89720137, 0.60616688, 0.62966729, ..., 0.6206728 ,
0.96160519, 0.69746633],
[ 0.59276237, 0.71586014, 0.35959289, ..., 0.46977027,
0.46586237, 0.10949621],
[ 0.8075795 , 0.70107856, 0.81389246, ..., 0.92068768,
0.38013495, 0.21489793]])
>>>
Which, as you can see, matches what we saved earlier:
>>> arr
array([[ 0.80881898, 0.50553303, 0.03859795, ..., 0.05850996,
0.9174782 , 0.48671767],
[ 0.79715979, 0.81465744, 0.93529834, ..., 0.53577085,
0.59098735, 0.22716425],
[ 0.49570713, 0.09599001, 0.74023709, ..., 0.85172897,
0.05066641, 0.10364143],
...,
[ 0.89720137, 0.60616688, 0.62966729, ..., 0.6206728 ,
0.96160519, 0.69746633],
[ 0.59276237, 0.71586014, 0.35959289, ..., 0.46977027,
0.46586237, 0.10949621],
[ 0.8075795 , 0.70107856, 0.81389246, ..., 0.92068768,
0.38013495, 0.21489793]])
>>>

THe default pickle method provides a pure ascii output. To get (much) better performance, use the latest version available. Versions 2 and above are binary and, if memory serves me right, allows numpy arrays to dump their buffer directly into the stream without addtional operations.
To select version to use, add the optional argument while pickling (no need to specify it while unpickling), for instance pkl.dumps(data, 2).
To pick the latest possible version, use pkl.dumps(data, -1)
Note that if you use different python versions, you need to specify the lowest supported version.
See Pickle documentation for details on the different versions

There is a method tobytes which, according to my benchmarks is faster than other alternatives.
Take with a grain of salt, as some of my experiments may be misguided or plainly wrong, but it is a method of dumping numpy array into strings.
Keep in mind that you will need to have some additional data out of band, mainly the data type of the array and also its shape. That may be a deal breaker or it may not be rellevant. It's easy to recover the original shape by calling .fromstring(..., dtype=...).reshape(...).
Edit: A maybe incomplete example
##############
# Generation #
##############
import numpy as np
arr = np.random.randint(1, 7, (4,6))
arr_dtype = arr.dtype.str
arr_shape = arr.shape
arr_data = arr.tobytes()
# Now send / store arr_dtype, arr_shape, arr_data, where:
# arr_dtype is string
# arr_shape is tuple of integers
# arr_data is bytes
############
# Recovery #
############
arr = np.frombuffer(arr_data, dtype=arr_dtype).reshape(arr_shape)
I am not considering the column/row ordering, because I know that numpy supports things about that but I have never used it. If you want to support / need to have the memory arranged in a specific fashion --regarding row/column for multidimensional arrays-- you may need to take that into account at some point.
Also: frombuffer doesn't copy the buffer data, it creates the numpy structure as a view (maybe not exactly that, but you know what I mean). If that's undesired behaviour you can use fromstring (which is deprecated but seems to work on 1.19) or use frombuffer followed by a np.copy.

Numpy ifft error

I'm having a really frustrating problem using numpy's inverse fast fourier transform function. I know the fft function works well based on my other results. Error seems to be introduced after calling ifft. The following should be printing zeros for example:
temp = Eta[50:55]
print(temp)
print(temp-np.fft.fft(np.fft.ifft(temp)))
Output:
[ -4.70429130e+13 -3.15161484e+12j -2.45515846e+13 +5.43230842e+12j -2.96326088e+13 -4.55029496e+12j 2.99158889e+13 -3.00718375e+13j -3.87978563e+13 +9.98287428e+12j]
[ 0.00781250+0.00390625j -0.02734375+0.01757812j 0.05078125-0.02441406j 0.01171875-0.01171875j -0.01562500+0.015625j ]
Please help!

You are seeing normal floating point imprecision. Here's what I get with your data:
In [58]: temp = np.array([ -4.70429130e+13 -3.15161484e+12j, -2.45515846e+13 +5.43230842e+12j, -2.96326088e+13 -4.55029496e+12j, 2.99158889e+13 -3.00718375e+13j, -3.87978563e+13 +9.98287428e+12j])
In [59]: delta = temp - np.fft.fft(np.fft.ifft(temp))
In [60]: delta
Out[60]:
array([ 0.0000000+0.00390625j, -0.0312500+0.01953125j,
0.0390625-0.02539062j, 0.0078125-0.015625j , -0.0156250+0.015625j ])
Relative to the input, those values are, in fact, "small", and reasonable for 64 bit floating point calculations:
In [61]: np.abs(delta)/np.abs(temp)
Out[61]:
array([ 8.28501685e-17, 1.46553699e-15, 1.55401584e-15,
4.11837758e-16, 5.51577805e-16])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Array operations on dask arrays - python

Got it... I tried and it worked! ans = z2.map_blocks(sigmoid)

Related

Improving loop in loops with Numpy

Vectorize QR in Numpy Python

How is timeit affected by the length of a list literal?

fastest method to dump numpy array into string

Numpy ifft error

Categories

Resources