python silently transposing rank 1 arrays - python

import numpy as np
x1 = np.arange(9.0).reshape((3, 3))
print("x1\n",x1,"\n")
x2 = np.arange(3.0)
print("x2\n",x2)
print(x2.shape,"\n")
print("Here, the shape of x2 is 3 rows by 1 column ")
print("x1#x2\n",x1#x2)
print("")
print("x2#x1 should not be possible\n",x2#x1,"\n"*3)
gives
x1
[[0. 1. 2.]
[3. 4. 5.]
[6. 7. 8.]]
x2
[0. 1. 2.]
(3,)
Here, the shape of x2 is 3 rows by 1 column
x1#x2 =
[ 5. 14. 23.]
x2#x1 should not be possible, BUT
[15. 18. 21.]
Python3 seems to silently convert x2 into (1,3) array so it can be multiplied by x1. Or am I missing some syntax?

The arrays are being broadcasted by Numpy.
To quote the broadcasting documentation:
The term broadcasting describes how numpy treats arrays with different
shapes during arithmetic operations. Subject to certain constraints,
the smaller array is “broadcast” across the larger array so that they
have compatible shapes. Broadcasting provides a means of vectorizing
array operations so that looping occurs in C instead of Python. It
does this without making needless copies of data and usually leads to
efficient algorithm implementations. There are, however, cases where
broadcasting is a bad idea because it leads to inefficient use of
memory that slows computation.
Add the following line to your code where you explicitly set the shape of x2 to (3,1) and you will get an error as follows:
import numpy as np
x1 = np.arange(9.0).reshape((3, 3))
print(x1.shape) # new line added
print("x1\n",x1,"\n")
x2 = np.arange(3.0)
x2 = x2.reshape(3, 1) # new line added
print("x2\n",x2)
print(x2.shape,"\n")
print("Here, the shape of x2 is 3 rows by 1 column ")
print("x1#x2\n",x1#x2)
print("")
print("x2#x1 should not be possible\n",x2#x1,"\n"*3)
Output
(3, 3)
x1
[[0. 1. 2.]
[3. 4. 5.]
[6. 7. 8.]]
x2
[[0.]
[1.]
[2.]]
(3, 1)
Here, the shape of x2 is 3 rows by 1 column
x1#x2
[[ 5.]
[14.]
[23.]]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-c61849986c5c> in <module>
12 print("x1#x2\n",x1#x2)
13 print("")
---> 14 print("x2#x1 should not be possible\n",x2#x1,"\n"*3)
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 3 is different from 1)

Related

Referencing index for vectorization in NumPy

I have a couple of for loops that I want to vectorize in order to improve performance. They operate on 1 x N matrices.
for y in range(1, len(array[0]) + 1):
array[0, y - 1] = np.floor(np.nanmean(otherArray[0, ((y-1)*3):((y-1)*3+3)]))
for i in range(len(array[0])):
array[0, int((i-1)*L+1)] = otherArray[0, i]
The operations are reliant on the index of the array which is given by the for loop. Is there any way to access the index while using numpy.vectorize so that I can rewrite these as vectorized functions?
First loop:
import numpy as np
array = np.zeros((1, 10))
otherArray = np.arange(30).reshape(1, -1)
print(f'array = \n{array}')
print(f'otherArray = \n{otherArray}')
for y in range(1, len(array[0]) + 1):
array[0, y - 1] = np.floor(np.nanmean(otherArray[0, ((y-1)*3):((y-1)*3+3)]))
print(f'array = \n{array}')
array = np.floor(np.nanmean(otherArray.reshape(-1, 3), axis = 1)).reshape(1, -1)
print(f'array = \n{array}')
output:
array =
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
otherArray =
[[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29]]
array =
[[ 1. 4. 7. 10. 13. 16. 19. 22. 25. 28.]]
array =
[[ 1. 4. 7. 10. 13. 16. 19. 22. 25. 28.]]
Second loop:
array = np.zeros((1, 10))
otherArray = np.arange(10, dtype = float).reshape(1, -1)
L = 1
print(f'array = \n{array}')
print(f'otherArray = \n{otherArray}')
for i in range(len(otherArray[0])):
array[0, int((i-1)*L+1)] = otherArray[0, i]
print(f'array = \n{array}')
array = otherArray
print(f'array = \n{array}')
output:
array =
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
otherArray =
[[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]]
array =
[[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]]
array =
[[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]]
It looks like in the first loop you are trying to compute a moving average. This is best done like this:
import numpy as np
window_width = 3
arr = np.arange(12)
out = np.floor(np.nanmean(arr.reshape(-1,window_width) ,axis=-1))
print(out)
Regarding your second loop, I have no clue what it does. You are trying to copy values from otherArray to array with some offset? I’d recommend you look at numpy’s slicing functionality.

Combining two numpy arrays with equations based on both arrays

I have a python numpy 3x4 array A:
A=np.array([[0,1,2,3],[4,5,6,7],[1,1,1,1]])
and a 3x3 array B:
B=np.array([[1,1, 1],[2, 2, 2],[3,3,3]])
I am trying to use a numpy operation to produce array C where each element in C is based on an equation using corresponding elements in A and the entire row in B. A simplified example:
C[row,col] = A[ro1,col] * ( A[row,col] / B[row,0] + B[row,1] + B[row,2) )
My first thoughts were to just simple and just multiply all of A by column in B. Error.
C = A * B[:,0]
Then I thought to try this but it didn't work.
C = A[:,:] * B[:,0]
I am not sure how to use the " : " operator and get access to the specific row, col at the same time. I can do this in regular loops but I wanted something more numpy.
mport numpy as np
A=np.array([[0,1,2,3],[4,5,6,7],[1,1,1,1]])
B=np.array([[1,1, 1],[2, 2, 2],[3,3,3]])
C=np.zeros([3,4])
row,col = A.shape
print(A.shape)
print(A)
print(B.shape)
print(B)
print(C.shape)
print(C)
print(range(row-1))
for row in range(row):
for col in range(col):
C[row,col] = A[row,col] * (( A[row,col] / B[row,0]) + B[row,1] + B[row,2])
print(C)
Which prints:
(3, 4)
[[0 1 2 3]
[4 5 6 7]
[1 1 1 1]]
(3, 3)
[[1 1 1]
[2 2 2]
[3 3 3]]
(3, 4)
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
range(0, 2)
[[ 0. 3. 8. 15. ]
[24. 32.5 42. 0. ]
[ 6.33333333 6.33333333 0. 0. ]]
Suggestions on a better way?
Edited:
Now that I understand broadcasting a bit more, and got that code running, let me expand in a generic way what I am trying to solve. I am trying to map values of a category such as "Air" which can be a range (such as 0-5) that have to be mapped to a shade of a given RGB value. The values are recorded over a time period.
For example, at time 1, the value of Water is 4. The standard RGB color for Water is Blue (0,0,255). There are 5 possible values for Water. In the case of Blue, 255 / 5 = 51. To get the effect of the 4 value on the Blue palette, multiply 51 x 4 = 204. Since we want higher values to be darker, we subtract 255 (white) - 205 yielding 51. The Red and Green components end up being 0. So the value read at time N is a multiply on the weighted R, G and B values. We invert 0 values to be subtracted from 255 so they appear white. Stronger values are darker.
So to calculate the R' G' and B' for time 1 I used:
answer = data[:,1:4] - (data[:,1:4] / data[:,[0]] * data[:,[4]])
I can extract an [R, G, B] from and answer and put into an Image at some x,y. Works good. But I can't figure out how to use Range, R, G and B and calculate new R', G', B' for all Time 1, 2, ... N. Trying to expand the numpy approach if possible. I did it with standard loops as:
for row in range(rows):
for col in range(cols):
r = int(data[row,1] - (data[row,1] / data[row,0] * data[row,col_offset+col] ))
g = int(data[row,2] - (data[row,2] / data[row,0] * data[row,col_offset+col] ))
b = int(data[row,3] - (data[row,3] / data[row,0] * data[row,col_offset+col] ))
almostImage[row,col] = [r,g,b]
I can display the image in matplotlib and save it to .png, etc. So I think next step is to try list comprehension over the time points 2D array, and then refer back to the range and RGB values. Will give it a try.
Try this:
A*(A / B[:,[0]] + B[:,1:].sum(1, keepdims=True))
Output:
array([[ 0. , 3. , 8. , 15. ],
[24. , 32.5 , 42. , 52.5 ],
[ 6.33333333, 6.33333333, 6.33333333, 6.33333333]])
Explanation:
The first operation A/B[:,[0]] utilizes numpy broadcasting.
Then B[:,1:].sum(1, keepdims=True) is just B[:,1] + B[:,2], and keepdims=True allows the dimension to stay the same. Print it to see details.

xarray.align propagate NaNs?

I'm aligning multiple datasets (model and observations) and I thought it would make a lot of sense if xarray.align had a method to propagate NaNs/missing data in one dataset to the others. For now, I'm using xr.dataset.where in combination with np.isfinite, but especially my attempt to generalize this for more than two arrays feels a bit tricky. Is there a better way to do this?
a = xr.DataArray(np.arange(10).astype(float))
b = xr.DataArray(np.arange(10).astype(float))
a[[4, 5]] = np.nan
print(a.values)
print(b.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
# Default behaviour
c, d = xr.align(a, b)
print(c.values)
print(d.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
# Desired behaviour
e, f = xr.align(a.where(np.isfinite(b)), b.where(np.isfinite(a)))
print(e.values)
print(f.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
# Attempt to generalize for multiple arrays
c = b.copy()
c [[1, -1]] = np.nan
def align_better(*dataarrays):
allvalid = np.all(np.array([np.isfinite(x) for x in dataarrays]), axis=0)
return xr.align(*[da.where(allvalid) for da in dataarrays])
g, h, i = align_better(a, b, c)
print(g.values)
print(h.values)
print(i.values)
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
From the xarray docs:
Given any number of Dataset and/or DataArray objects, returns new objects with aligned indexes and dimension sizes.
Array from the aligned objects are suitable as input to mathematical operators, because along each dimension they have the same index and size.
Missing values (if join != 'inner') are filled with NaN.
Nothing about this function deals with the values in the arrays, just the dimensions and coordinates. This function is used for setting up arrays for operations against each other.
If your desired behavior is a function that returns NaN for all arrays where any arrays are NaN, your align_better function seems like a decent way to do it.
The function in my initial attempt was slow because dataarrays were casted to numpy arrays. In this modified version, I first align the datasets. Then I can safely use the .values method. This is much faster.
def align_better(*dataarrays):
""" Align datasets and propage NaNs """
aligned = xr.align(*dataarrays)
allvalid = np.all(np.asarray([np.isfinite(x).values for x in aligned]), axis=0)
return [da.where(allvalid) for da in aligned]

python numpy - improve efficiency on column-wise cosine similarity

I am fairly new to programming and I never used numpy before.
So, I have a matrix with 19001 x 19001 dimensions. It contains a lot of zeros, so it is relatively sparse. I wrote some code to compute the pairwise cosine similarity of the columns if the item in the row is non-zero. I add all the pairwise similarity values of one row and do some mathematical operations on them to obtain one value for each row of the matrix in the end (see code below). It does what it is supposed to, however as dealing with a great number of dimensions, it is really slow. Is there any way to modify my code to make it more efficient?
import numpy as np
from scipy.spatial.distance import cosine
row_number = 0
out_file = open('outfile.txt', 'w')
for row in my_matrix:
non_zeros = np.nonzero(my_matrix[row_number])[0]
non_zeros = list(non_zeros)
cosine_sim = []
for item in non_zeros:
if len(non_zeros) <= 1:
break
x = non_zeros[0]
y = non_zeros[1]
similarity = 1 - cosine(my_matrix[:, x], my_matrix[:, y])
cosine_sim.append(similarity)
non_zeros.pop(0)
summing = np.sum(cosine_sim)
mean = summing / len(cosine_sim)
log = np.log(mean)
out_file_value = log * -1
out_file.write(str(row_number) + " " + str(out_file_value) + "\n")
if row_number <= 19000:
row_number += 1
else:
break
I know that there are some function to actually compute the cosine similarity even between columns (from sklearn.metrics.pairwise import cosine_similarity), so I tried it. However, the output is kind of the same but on the same time really confusing to me even though I read the documentation and the posts on this page referring to the issue.
For instance:
my_matrix =[[0. 0. 7. 0. 5.]
[0. 0. 11. 0. 0.]
[0. 2. 0. 0. 0.]
[0. 0. 2. 11. 5.]
[0. 0. 5. 0. 0.]]
transposed = np.transpose(my_matrix)
sim_matrix = cosine_similarity(transposed)
# resulting similarity matrix
sim_matrix =[[0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0.14177624 0.45112924]
[0. 0. 0.14177624 1. 0.70710678]
[0. 0. 0.45112924 0.70710678 1.]]
If I compute the cosine similarity with my code above, it returns 0.45112924 for the 1st row ([0]) and 0.14177624 and 0.70710678 for row 4 ([3]).
out_file.txt
0 0.796001425306
1 nan
2 nan
3 0.856981065776
4 nan
I greatly appreciate any help or suggestions to my question!
You can consider using scipy instead. However, it doesn't take sparse matrix input. You have to provide numpy array.
import scipy.sparse as sp
from scipy.spatial.distance import cdist
X = np.random.randn(10000, 10000)
D = cdist(X, X.T, metric='cosine') # cosine distance matrix between 2 columns
Here is the speed that I got for 10000 x 10000 random array.
%timeit cdist(X, X.T, metric='cosine')
16.4 s ± 325 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Try on small array
X = np.array([[1,0,1], [0, 3, 2], [1,0,1]])
D = cdist(X, X.T, metric='cosine')
This will give
[[ 1.11022302e-16 1.00000000e+00 4.22649731e-01]
[ 6.07767730e-01 1.67949706e-01 9.41783727e-02]
[ 1.11022302e-16 1.00000000e+00 4.22649731e-01]]
For example D[0, 2] is the cosine distance between column 0 and 2
from numpy.linalg import norm
1 - np.dot(X[:, 0], X[:,2])/(norm(X[:, 0]) * norm(X[:,2])) # give 0.422649

How can I vertically concatenate 1x2 arrays in numpy?

How can I concatenate 1x2 arrays produced in another function?
I have a for loop that produces xa as output, which is a float64 (1L,2L).
xa = [[ 1.17281823 1.210732 ]]
The code I tried for concatenate is
A = []
for i in range(5):
# actually xa = UglyCalculation(**Inputs[i])
xa = np.array([[ i, i+1 ]]) # for our example here
# do something
All I want to do is concatenate/join/append these xa values vertically.
Basically the desired output should be
0 1
1 2
2 3
3 4
4 5
I have tried the following lines of code, but it's not working. Can you help?
A = np.concatenate((A, xa)) # Concatenate the matrices
A = np.concatenate((A,xa),axis=0)
A=np.vstack((A,xa))
Setting a shape on A allows concatenation of similar column sized arrays:
import numpy as np
A = np.array([])
A.shape=(0,2) # set A to be 0 rows x 2 cols matrix
for i in range(5):
xa = np.array([[i, i+1]])
A = np.concatenate( (A,xa) )
print A
output
[[ 0. 1.]
[ 1. 2.]
[ 2. 3.]
[ 3. 4.]
[ 4. 5.]]
without A.shape = (0,2) an Exception will be thrown:
Traceback (most recent call last):
File "./arr.py", line 5, in <module>
A = np.concatenate( (A,xa) )
ValueError: all the input arrays must have same number of dimensions

Categories

Resources