Convert cumsum() output to binary array in xarray - python

I have a 3D x-array that computes the cumulative sum for specific time periods and I'd like to detect which time periods meet a certain condition (and set to 1) and those which do not meet this condition (set to zero). I'll explain using the code below:
import pandas as pd
import xarray as xr
import numpy as np
# Create demo x-array
data = np.random.rand(20, 5, 5)
times = pd.date_range('2000-01-01', periods=20)
lats = np.arange(10, 0, -2)
lons = np.arange(0, 10, 2)
data = xr.DataArray(data, coords=[times, lats, lons], dims=['time', 'lat', 'lon'])
data.values[6:12] = 0 # Ensure some values are set to zero so that the cumsum can reset between valid time steps
data.values[18:] = 0
# This creates an xarray whereby the cumsum is calculated but resets each time a zero value is found
cumulative = data.cumsum(dim='time')-data.cumsum(dim='time').where(data.values == 0).ffill(dim='time').fillna(0)
print(cumulative[:,0,0])
>>> <xarray.DataArray (time: 20)>
array([0.13395 , 0.961934, 1.025337, 1.252985, 1.358501, 1.425393, 0. ,
0. , 0. , 0. , 0. , 0. , 0.366988, 0.896463,
1.728956, 2.000537, 2.316263, 2.922798, 0. , 0. ])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-20
lat int64 10
lon int64 0
The print statement shows that the cumulative sum resets each time a zero is encountered on the time dimension. I need a solution to identify, which of the two periods exceeds a value of 2 and convert to a binary array to confirm where the conditions are met.
So my expected output would be (for this specific example):
<xarray.DataArray (time: 20)>
array([0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 1. , 1. ,
1. , 1. , 1. , 1. , 0. , 0. ])

Solved this using some masking and the backfill functionality:
# make something to put results in
out = xr.full_like(cumulative, fill_value=0.0)
# find the points which have met the criteria
out.values[cumulative.values > 3] = 1
# fill the other valid sections over 0, with nans so we can fill them
out.values[(cumulative.values>0) & (cumulative.values<3)] = np.nan
# backfill it, so the ones that have not reached 2 are filled with 0
# and the ones that have are filled with 1
out_ds = out.bfill(dim='time').fillna(1)
print ('Cumulative array:')
print (cumulative.values[:,0,0])
print (' ')
print ('Binary array')
print (out_ds.values[:,0,0])

Related

Condensing an array where some rows differ only by one column (to one with unique rows but more columns)

I have a long array (could be pandas or numpy, as convenient) where some rows have the first two columns identical (x-y position), and the third is unique (time), eg:
x y t
0. 0. 10.
0. 0. 11.
0. 0. 12.
0. 1. 13.
0. 1. 14.
1. 1. 15.
Positions are grouped, but there may be 1, 2 or 3 time values listed for each, meaning there may be 1, 2 or 3 columns with identical x and y. The array needs to be reshaped/condensed such that each position has its own row, with min and max values of time - i.e., target is:
x y t1 t2
0. 0. 10. 12.
0. 1. 13. 14.
1. 1. 15. inf
Is there a simple/elegant way of doing this in pandas or numpy? I've tried loops but they're messy and terribly inefficient, and I've tried using np.unique:
target_array = np.unique(initial_array[:, 0:2], axis=0)
That yields
x y
0. 0.
0. 1.
1. 1.
which is a good start, but then I'm stuck on generating the last two columns.
IIUC, you can use
out = (df.groupby(['x', 'y'])['t']
.agg(t1='min', t2='max', c='count')
.reset_index()
.pipe(lambda df: df.assign(t2=df['t2'].mask(df['c'].eq(1), np.inf)) )
.drop(columns='c')
)
print(out)
x y t1 t2
0 0.0 0.0 10.0 12.0
1 0.0 1.0 13.0 14.0
2 1.0 1.0 15.0 inf

Is there a python function for assigning values to several elements in a list?

I used the code shown below to create a list of lists.
Code:
num = 782
sol=4
pop_size= [sol, num]
initial_population_1 = np.random.uniform(low=0.0, high=0.0, size=pop_size)
The list of lists is shown below:
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
How can I randomly assign five values that are greater than 0 but less than 10 to five elements in each list in the list?
Thank you very much!
So, you have a list of lists, specifically a list of 4 lists, each of them containing 782 elements, all 0.0, and you want to set 5 elements at random to 1.0.
I'd like to mention that, as you are using Numpy, there is np.zeros(shape) that provides you with a zero-filled array, but whatever…
From your question it's not clear if you want to avoid to use twice the same location, but let's assume that you want to assign a random value to exactly 5 entries in each row
for row in initial_population_1:
locations_used_in_this_row = 0
while locations_used_in_this_row != 5:
column = np.random.randint(num)
if row[column] == 0.0:
row[column] = np.random.rand()*10
locations_used_in_this_row += 1

How to calculate formula for every value in an array?

Im trying to get to understand how to use numpy for calculating a formula for different times. The way the code is written gives all the values where y is bigger than 0. I am experimenting how to get the values for all y's.
Is there someone who can explain me the part: ft = t * [y >= 0.0 ]. How do i use the parts within the brackets?
from numpy import *
g = 10.0
h0 = 10.0
t = arange(0, 10.1 ,0.1)
y = h0 - 0.5*g*t*t
ft = t * [y >= 0.0 ]
print(ft)
This is the output, but I would like to see all the values calculated. So i experimented a bit but i could not figure it out how to do it and how the; [y >= 0.0] part exactly works.
[[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. 1.1 1.2 1.3 1.4 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]]
If i use [y] instead of [y >= 0.0] i get the following:
[[ 0.000000e+00 9.950000e-01 1.960000e+00 2.865000e+00 3.680000e+00
4.375000e+00 4.920000e+00 5.285000e+00 5.440000e+00 5.355000e+00
5.000000e+00 4.345000e+00 3.360000e+00 2.015000e+00 2.800000e-01
-1.875000e+00 -4.480000e+00 -7.565000e+00 -1.116000e+01 -1.529500e+01
-2.000000e+01 -2.530500e+01 -3.124000e+01 -3.783500e+01 -4.512000e+01
-5.312500e+01 -6.188000e+01 -7.141500e+01 -8.176000e+01 -9.294500e+01
-1.050000e+02 -1.179550e+02 -1.318400e+02 -1.466850e+02 -1.625200e+02
-1.793750e+02 -1.972800e+02 -2.162650e+02 -2.363600e+02 -2.575950e+02
-2.800000e+02 -3.036050e+02 -3.284400e+02 -3.545350e+02 -3.819200e+02
-4.106250e+02 -4.406800e+02 -4.721150e+02 -5.049600e+02 -5.392450e+02
-5.750000e+02 -6.122550e+02 -6.510400e+02 -6.913850e+02 -7.333200e+02
-7.768750e+02 -8.220800e+02 -8.689650e+02 -9.175600e+02 -9.678950e+02
-1.020000e+03 -1.073905e+03 -1.129640e+03 -1.187235e+03 -1.246720e+03
-1.308125e+03 -1.371480e+03 -1.436815e+03 -1.504160e+03 -1.573545e+03
-1.645000e+03 -1.718555e+03 -1.794240e+03 -1.872085e+03 -1.952120e+03
-2.034375e+03 -2.118880e+03 -2.205665e+03 -2.294760e+03 -2.386195e+03
-2.480000e+03 -2.576205e+03 -2.674840e+03 -2.775935e+03 -2.879520e+03
-2.985625e+03 -3.094280e+03 -3.205515e+03 -3.319360e+03 -3.435845e+03
-3.555000e+03 -3.676855e+03 -3.801440e+03 -3.928785e+03 -4.058920e+03
-4.191875e+03 -4.327680e+03 -4.466365e+03 -4.607960e+03 -4.752495e+03
-4.900000e+03]]
I would like to know how i can use numpy to calculate at once all the outcomes of a formula for different time intervals.
Thanks,
y >= 0.0 gives you an array of Booleans which contain True/False depending on the fulfillment of the condition y >= 0.0. When you enclose it within [] as [y >= 0.0], you get a list which contains a single array of Booleans, as pointed out by #nicola in the comments below.
[array([ True, True, True, True, True, False, False, False,...
... False, False, False, False])]
Now you multiply this with your arange array which will give you 0 when the right hand side of * operator is False and will give you the actual value from the arange when the right hand side of * operator is True
The array [y >= 0.0] produces and array of booleans. i.e. 1 if y>=0 and 0 if not. That array of 1's and 0's is then multiplied by t.
It is not clear to me from your question however, what you are trying to do with it.

python numpy - improve efficiency on column-wise cosine similarity

I am fairly new to programming and I never used numpy before.
So, I have a matrix with 19001 x 19001 dimensions. It contains a lot of zeros, so it is relatively sparse. I wrote some code to compute the pairwise cosine similarity of the columns if the item in the row is non-zero. I add all the pairwise similarity values of one row and do some mathematical operations on them to obtain one value for each row of the matrix in the end (see code below). It does what it is supposed to, however as dealing with a great number of dimensions, it is really slow. Is there any way to modify my code to make it more efficient?
import numpy as np
from scipy.spatial.distance import cosine
row_number = 0
out_file = open('outfile.txt', 'w')
for row in my_matrix:
non_zeros = np.nonzero(my_matrix[row_number])[0]
non_zeros = list(non_zeros)
cosine_sim = []
for item in non_zeros:
if len(non_zeros) <= 1:
break
x = non_zeros[0]
y = non_zeros[1]
similarity = 1 - cosine(my_matrix[:, x], my_matrix[:, y])
cosine_sim.append(similarity)
non_zeros.pop(0)
summing = np.sum(cosine_sim)
mean = summing / len(cosine_sim)
log = np.log(mean)
out_file_value = log * -1
out_file.write(str(row_number) + " " + str(out_file_value) + "\n")
if row_number <= 19000:
row_number += 1
else:
break
I know that there are some function to actually compute the cosine similarity even between columns (from sklearn.metrics.pairwise import cosine_similarity), so I tried it. However, the output is kind of the same but on the same time really confusing to me even though I read the documentation and the posts on this page referring to the issue.
For instance:
my_matrix =[[0. 0. 7. 0. 5.]
[0. 0. 11. 0. 0.]
[0. 2. 0. 0. 0.]
[0. 0. 2. 11. 5.]
[0. 0. 5. 0. 0.]]
transposed = np.transpose(my_matrix)
sim_matrix = cosine_similarity(transposed)
# resulting similarity matrix
sim_matrix =[[0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0.14177624 0.45112924]
[0. 0. 0.14177624 1. 0.70710678]
[0. 0. 0.45112924 0.70710678 1.]]
If I compute the cosine similarity with my code above, it returns 0.45112924 for the 1st row ([0]) and 0.14177624 and 0.70710678 for row 4 ([3]).
out_file.txt
0 0.796001425306
1 nan
2 nan
3 0.856981065776
4 nan
I greatly appreciate any help or suggestions to my question!
You can consider using scipy instead. However, it doesn't take sparse matrix input. You have to provide numpy array.
import scipy.sparse as sp
from scipy.spatial.distance import cdist
X = np.random.randn(10000, 10000)
D = cdist(X, X.T, metric='cosine') # cosine distance matrix between 2 columns
Here is the speed that I got for 10000 x 10000 random array.
%timeit cdist(X, X.T, metric='cosine')
16.4 s ± 325 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Try on small array
X = np.array([[1,0,1], [0, 3, 2], [1,0,1]])
D = cdist(X, X.T, metric='cosine')
This will give
[[ 1.11022302e-16 1.00000000e+00 4.22649731e-01]
[ 6.07767730e-01 1.67949706e-01 9.41783727e-02]
[ 1.11022302e-16 1.00000000e+00 4.22649731e-01]]
For example D[0, 2] is the cosine distance between column 0 and 2
from numpy.linalg import norm
1 - np.dot(X[:, 0], X[:,2])/(norm(X[:, 0]) * norm(X[:,2])) # give 0.422649

Theano: how to efficiently undo/reverse max-pooling

I'm using Theano 0.7 to create a convolutional neural net which uses max-pooling (i.e. shrinking a matrix down by keeping only the local maxima).
In order to "undo" or "reverse" the max-pooling step, one method is to store the locations of the maxima as auxiliary data, then simply recreate the un-pooled data by making a big array of zeros and using those auxiliary locations to place the maxima in their appropriate locations.
Here's how I'm currently doing it:
import numpy as np
import theano
import theano.tensor as T
minibatchsize = 2
numfilters = 3
numsamples = 4
upsampfactor = 5
# HERE is the function that I hope could be improved
def upsamplecode(encoded, auxpos):
shp = encoded.shape
upsampled = T.zeros((shp[0], shp[1], shp[2] * upsampfactor))
for whichitem in range(minibatchsize):
for whichfilt in range(numfilters):
upsampled = T.set_subtensor(upsampled[whichitem, whichfilt, auxpos[whichitem, whichfilt, :]], encoded[whichitem, whichfilt, :])
return upsampled
totalitems = minibatchsize * numfilters * numsamples
code = theano.shared(np.arange(totalitems).reshape((minibatchsize, numfilters, numsamples)))
auxpos = np.arange(totalitems).reshape((minibatchsize, numfilters, numsamples)) % upsampfactor # arbitrary positions within a bin
auxpos += (np.arange(4) * 5).reshape((1,1,-1)) # shifted to the actual temporal bin location
auxpos = theano.shared(auxpos.astype(np.int))
print "code:"
print code.get_value()
print "locations:"
print auxpos.get_value()
get_upsampled = theano.function([], upsamplecode(code, auxpos))
print "the un-pooled data:"
print get_upsampled()
(By the way, in this case I have a 3D tensor, and it's only the third axis that gets max-pooled. People who work with image data might expect to see two dimensions getting max-pooled.)
The output is:
code:
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
locations:
[[[ 0 6 12 18]
[ 4 5 11 17]
[ 3 9 10 16]]
[[ 2 8 14 15]
[ 1 7 13 19]
[ 0 6 12 18]]]
the un-pooled data:
[[[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 2. 0.
0. 0. 0. 0. 3. 0.]
[ 0. 0. 0. 0. 4. 5. 0. 0. 0. 0. 0. 6. 0. 0.
0. 0. 0. 7. 0. 0.]
[ 0. 0. 0. 8. 0. 0. 0. 0. 0. 9. 10. 0. 0. 0.
0. 0. 11. 0. 0. 0.]]
[[ 0. 0. 12. 0. 0. 0. 0. 0. 13. 0. 0. 0. 0. 0.
14. 15. 0. 0. 0. 0.]
[ 0. 16. 0. 0. 0. 0. 0. 17. 0. 0. 0. 0. 0. 18.
0. 0. 0. 0. 0. 19.]
[ 20. 0. 0. 0. 0. 0. 21. 0. 0. 0. 0. 0. 22. 0.
0. 0. 0. 0. 23. 0.]]]
This method works but it's a bottleneck, taking most of my computer's time (I think the set_subtensor calls might imply cpu<->gpu data copying). So: can this be implemented more efficiently?
I suspect there's a way to express this as a single set_subtensor() call which may be faster, but I don't see how to get the tensor indexing to broadcast properly.
UPDATE: I thought of a way of doing it in one call, by working on the flattened tensors:
def upsamplecode2(encoded, auxpos):
shp = encoded.shape
upsampled = T.zeros((shp[0], shp[1], shp[2] * upsampfactor))
add_to_flattened_indices = theano.shared(np.array([ [[(y + z * numfilters) * numsamples * upsampfactor for x in range(numsamples)] for y in range(numfilters)] for z in range(minibatchsize)], dtype=theano.config.floatX).flatten(), name="add_to_flattened_indices")
upsampled = T.set_subtensor(upsampled.flatten()[T.cast(auxpos.flatten() + add_to_flattened_indices, 'int32')], encoded.flatten()).reshape(upsampled.shape)
return upsampled
get_upsampled2 = theano.function([], upsamplecode2(code, auxpos))
print "the un-pooled data v2:"
ups2 = get_upsampled2()
print ups2
However, this is still not good efficiency-wise because when I run this (added on to the end of the above script) I find out that the Cuda libraries can't currently do the integer index manipulation efficiently:
ERROR (theano.gof.opt): Optimization failure due to: local_gpu_advanced_incsubtensor1
ERROR (theano.gof.opt): TRACEBACK:
ERROR (theano.gof.opt): Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/theano/gof/opt.py", line 1493, in process_node
replacements = lopt.transform(node)
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/opt.py", line 952, in local_gpu_advanced_incsubtensor1
gpu_y = gpu_from_host(y)
File "/usr/local/lib/python2.7/dist-packages/theano/gof/op.py", line 507, in __call__
node = self.make_node(*inputs, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/basic_ops.py", line 133, in make_node
dtype=x.dtype)()])
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/type.py", line 69, in __init__
(self.__class__.__name__, dtype, name))
TypeError: CudaNdarrayType only supports dtype float32 for now. Tried using dtype int64 for variable None
I don't know whether this is faster, but it may be a little more concise. See if it is useful for your case.
import numpy as np
import theano
import theano.tensor as T
minibatchsize = 2
numfilters = 3
numsamples = 4
upsampfactor = 5
totalitems = minibatchsize * numfilters * numsamples
code = np.arange(totalitems).reshape((minibatchsize, numfilters, numsamples))
auxpos = np.arange(totalitems).reshape((minibatchsize, numfilters, numsamples)) % upsampfactor
auxpos += (np.arange(4) * 5).reshape((1,1,-1))
# first in numpy
shp = code.shape
upsampled_np = np.zeros((shp[0], shp[1], shp[2] * upsampfactor))
upsampled_np[np.arange(shp[0]).reshape(-1, 1, 1), np.arange(shp[1]).reshape(1, -1, 1), auxpos] = code
print "numpy output:"
print upsampled_np
# now the same idea in theano
encoded = T.tensor3()
positions = T.tensor3(dtype='int64')
shp = encoded.shape
upsampled = T.zeros((shp[0], shp[1], shp[2] * upsampfactor))
upsampled = T.set_subtensor(upsampled[T.arange(shp[0]).reshape((-1, 1, 1)), T.arange(shp[1]).reshape((1, -1, 1)), positions], encoded)
print "theano output:"
print upsampled.eval({encoded: code, positions: auxpos})

Categories

Resources