Complex index numpy array or indexing dataframe - python

I have an array (dataframe) with shape 9800, 9800. I need to index it (without labels) like:
x = (9800,9800)
a = x[0:7000,0:7000] (plus) x[7201:9800, 0:7000] (plus) x[0:7000, 7201:9800] (plus) x[7201:9800, 7201:9800]
b = x[7000:7200, 7000:7200]
c = x[7000:7200, 0:7000] (plus) x[7000:7200, 7201:9800]
d = x[0:7000, 7000:7200] (plus) x[7201:9800, 7000:7200]
What I mean by plus, is not a proper addition but more like a concatenation. Like putting the resulting dataframes together one next to the other. See attached image.
Is there any "easy" way of doing this? I need to replicate this to 10,000 dataframes and add them up individually to save memory.

You have np.r_, which basically creates an index array for you, for example:
np.r_[:3,4:6]
gives
array([0, 1, 2, 4, 5])
So in your case:
a_idx = np.r_[0:7000,7200:9000]
a = x[a_idx, a_idx]
c = x[7000:7200, a_idx]

In [167]: x=np.zeros((9800,9800),'int8')
The first list of slices:
In [168]: a = [x[0:7000,0:7000], x[7201:9800, 0:7000],x[0:7000, 7201:9800], x[7201:9800, 7201:9800]]
and their shapes:
In [169]: [i.shape for i in a]
Out[169]: [(7000, 7000), (2599, 7000), (7000, 2599), (2599, 2599)]
Since the shapes vary, you can't simply concatenate them all:
In [170]: np.concatenate(a, axis=0)
Traceback (most recent call last):
File "<ipython-input-170-c111dc665509>", line 1, in <module>
np.concatenate(a, axis=0)
File "<__array_function__ internals>", line 5, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 7000 and the array at index 2 has size 2599
In [171]: np.concatenate(a, axis=1)
Traceback (most recent call last):
File "<ipython-input-171-227af3749524>", line 1, in <module>
np.concatenate(a, axis=1)
File "<__array_function__ internals>", line 5, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 7000 and the array at index 1 has size 2599
You can concatenate subsets:
In [172]: np.concatenate(a[:2], axis=0)
Out[172]:
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int8)
In [173]: _.shape
Out[173]: (9599, 7000)
I won't take the time to construct the other lists, but it looks like you could construct the first column with
np.concatenate([a[0], c[0], a[1]], axis=0)
similarly for the other columns, and then concatenate columns. Or join them by rows first.
np.block([[a[0],d[0],a[2]],[....]]) with an appropriate mix of list elements should do the same (just a difference in notation, same concatenation work).

Related

Python Numpy - Slicing assignment not assigning correctly

I have a 2d numpy array called arm_resets that has positive integers. The first column has all positive integers < 360. For all columns other than the first, I need to replace all values over 360 with the value that is in the same row in the 1st column. I thought this would be a relatively easy thing to do, here's what I have:
i = 300
over_360 = arm_resets[:, [i]] >= 360
print(arm_resets[:, [i]][over_360])
print(arm_resets[:, [0]][over_360])
arm_resets[:, [i]][over_360] = arm_resets[:, [0]][over_360]
print(arm_resets[:, [i]][over_360])
And here's what prints:
[3600 3609 3608 ... 3600 3611 3605]
[ 0 9 8 ... 0 11 5]
[3600 3609 3608 ... 3600 3611 3605]
Since all numbers that are being shown in the first print (first 3 and last 3) are above 360, they should be getting replaced by the 2nd print in the 3rd print. Why is this not working?
edit: reproducible example:
df = pd.DataFrame({"start":[1,2,5,6],"freq":[1,5,6,9]})
periods = 6
arm_resets = df[["start"]].values
freq = df[["freq"]].values
arm_resets = np.pad(arm_resets,((0,0),(0,periods-1)))
for i in range(1,periods):
arm_resets[:,[i]] = arm_resets[:,[i-1]] + freq
#over_360 = arm_resets[:,[i]] >= periods
#arm_resets[:,[i]][over_360] = arm_resets[:,[0]][over_360]
arm_resets
Given commented out code here's what prints:
array([[ 1, 2, 3, 4, 5, 6],
[ 2, 7, 12, 17, 22, 27],
[ 3, 9, 15, 21, 27, 33],
[ 4, 13, 22, 31, 40, 49]])
What I would expect:
array([[ 1, 2, 3, 4, 5, 1],
[ 2, 2, 2, 2, 2, 2],
[ 3, 3, 3, 3, 3, 3],
[ 4, 4, 4, 4, 4, 4]])
Now if it helps, the final 2d array I'm actually trying to create is a 1/0 array that indicates which are filled in, so in this example I'd want this:
array([[ 0, 1, 1, 1, 1, 1],
[ 0, 0, 1, 0, 0, 0],
[ 0, 0, 0, 1, 0, 0],
[ 0, 0, 0, 0, 1, 0]])
The code I use to achieve this from the above arm_resets is this:
fin = np.zeros((len(arm_resets),periods),dtype=int)
for i in range(len(arm_resets)):
fin[i,a[i]] = 1
The slice arm_resets[:, [i]] is a fancy index, and therefore makes a copy of the ith column of the data. arm_resets[:, [i]][over_360] = ... therefore calls __setitem__ on a temporary array that is discarded as soon as the statement executes. If you want to assign to the mask, call __setitem__ on the sliced object directly:
arm_resets[over_360, [i]] = ...
You also don't need to make the index into a list. It's generally better to use simple indices, especially when doing assignments, since they create views rather than copies:
arm_resets[over_360, i] = ...
With slicing, even the following should work, since it calls __setitem__ on a view:
arm_resets[:, i][over_360] = ...
This index does not help you process each row of the data, since i is a column. In fact, you can process the entire matrix in one step, without looping, if you use indices rather than a boolean mask. The reason that indices are useful is that you can match the item from the correct row in the first column:
rows, cols = np.nonzero(arm_resets[:, 1:] >= 360)
arm_resets[rows, cols] = arm_resets[rows, 1]
You can use np.where()
first_col = arm_resets[:,0] # first col
first_col = first_col.reshape(first_col.size,1) #Transfor in 2d array
arm_resets = np.where(arm_resets >= 360,first_col,arm_resets)
You can see in detail how np.where work here, but basically it compare arm_resets >= 360, if true it put first_col value in place (there another detail here with broadcasting) if false it put arm_resets value.
Edit: As suggested by Mad Physicist. You can use arm_resets[:,0,None] directly instead of creating first_col variable.
arm_resets = np.where(arm_resets >= 360,arm_resets[:,0,None],arm_resets)

Numpy pack bits into 32-bit little-endian values

Numpy provides packbits function to convert from values to individual bits. With bitorder='little' I can read them in C as uint8_t values without issues. However, I would like to read them as uint32_t values. This means that I have to reverse the order of each 4 bytes.
I tried to use
import numpy as np
array = np.array([1,0,1,1,0,1,0,1,0,1,0,0,1,0,1,1,0,0,1,1,0,1,0,1,1,0,0,1,0,1,0,1,
1,0,0,1,1,0,1,0,1,1,0,0,1,1,1,0,0,1])
array = np.packbits(array, bitorder='little')
array.dtype = np.uint32
array.byteswap(inplace=True)
print(array)
but have the following error:
Traceback (most recent call last):
File "sample.py", line 5, in <module>
array.dtype = np.uint32
ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.
I have 50 bits in the input. The first chunk of 32 bits written in the little-endian format (earliest input bit is the least significant bit) are 0b10101001101011001101001010101101 = 2846675629, the second is 0b100111001101011001 = 160601. So the expected output is
[2846675629 160601]
My first answer fixes the exception.
This answer, relies on this and this
Pad the array from the right to the nearest power of 2
Reshape to have some arrays, each array of size 32
Pack bits PER ARRAY and only then view as unit32.
import numpy as np
import math
# https://stackoverflow.com/questions/49791312/numpy-packbits-pack-to-uint16-array
# https://stackoverflow.com/questions/36534035/pad-0s-of-numpy-array-to-nearest-power-of-two/36534077
def next_power_of_2(number):
# Returns next power of two following 'number'
return 2**math.ceil(math.log(number, 2))
a = np.array([
1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1
])
# a = np.array([
# 0 for _ in range(31)
# ] + [1])
padding_size = next_power_of_2(len(a)) - len(a)
b = np.concatenate([a, np.zeros(padding_size)])
c = b.reshape((-1, 32)).astype(np.uint8)
d = np.packbits(c, bitorder='little').view(np.uint32)
print(d)
output:
[2846675629 160601]
You can't use array.dtype = np.uint32 as you did, because numpy arrays have to be consecutive in memory.
Instead, you can create a new array of the new type.
import numpy as np
array = np.array([1,0,1,1,0,1,0,1,0,1,0,0,1,0,1,1,0,0,1,1,0,1,0,1,1,0,0,1,0,1,0,1,1,0,0,1,1,0,1,0,1,1,0,0,1,1,1,0,0,1])
array = np.packbits(array, bitorder='little')
array = np.array(array, dtype=np.uint32)
array.byteswap(inplace=True)
print(array)

calculate sum of Nth column of numpy array entry grouped by the indices in first two columns?

I would like to loop over following check_matrix in such a way that code recognize whether the first and second element is 1 and 1 or 1 and 2 etc? Then for each separate class of pair i.e. 1,1 or 1,2 or 2,2, the code should store in the new matrices, the sum of last element (which in this case has index 8) times exp(-i*q(check_matrix[k][2:5]-check_matrix[k][5:8])), where i is iota (complex number), k is the running index on check_matrix and q is a vector defined as given below. So there are 20 q vectors.
import numpy as np
q= []
for i in np.linspace(0, 10, 20):
q.append(np.array((0, 0, i)))
q = np.array(q)
check_matrix = np.array([[1, 1, 0, 0, 0, 0, 0, -0.7977, -0.243293],
[1, 1, 0, 0, 0, 0, 0, 1.5954, 0.004567],
[1, 2, 0, 0, 0, -1, 0, 0, 1.126557],
[2, 1, 0, 0, 0, 0.5, 0.86603, 1.5954, 0.038934],
[2, 1, 0, 0, 0, 2, 0, -0.7977, -0.015192],
[2, 2, 0, 0, 0, -0.5, 0.86603, 1.5954, 0.21394]])
This means in principles I will have to have 20 matrices of shape 2x2, corresponding to each q vector.
For the moment my code is giving only one matrix, which appears to be the last one, even though I am appending in the Matrices. My code looks like below,
for i in range(2):
i = i+1
for j in range(2):
j= j +1
j_list = []
Matrices = []
for k in range(len(check_matrix)):
if check_matrix[k][0] == i and check_matrix[k][1] == j:
j_list.append(check_matrix[k][8]*np.exp(-1J*np.dot(q,(np.subtract(check_matrix[k][2:5],check_matrix[k][5:8])))))
j_11 = np.sum(j_list)
I_matrix[i-1][j-1] = j_11
Matrices.append(I_matrix)
I_matrix is defined as below:
I_matrix= np.zeros((2,2),dtype=np.complex_)
At the moment I get following output.
Matrices = [array([[-0.66071446-0.77603624j, -0.29038112+2.34855023j], [-0.31387562-0.08116629j, 4.2788 +0.j ]])]
But, I desire to get a matrix corresponding to each q value meaning that in total there should be 20 matrices in this case, where each 2x2 matrix element would be containing sums such that elements belong to 1,1 and 1,2 and 2,2 pairs in following manner
array([[11., 12.],
[21., 22.]])
I shall highly appreciate your suggestion to correct it. Thanks in advance!
I am pretty sure you can solve this problem in an easier way and I am not 100% sure that I understood you correctly, but here is some code that does what I think you want. If you have a possibility to check if the results are valid, I would suggest you do so.
import numpy as np
n = 20
q = np.zeros((20, 3))
q[:, -1] = np.linspace(0, 10, n)
check_matrix = np.array([[1, 1, 0, 0, 0, 0, 0, -0.7977, -0.243293],
[1, 1, 0, 0, 0, 0, 0, 1.5954, 0.004567],
[1, 2, 0, 0, 0, -1, 0, 0, 1.126557],
[2, 1, 0, 0, 0, 0.5, 0.86603, 1.5954, 0.038934],
[2, 1, 0, 0, 0, 2, 0, -0.7977, -0.015192],
[2, 2, 0, 0, 0, -0.5, 0.86603, 1.5954, 0.21394]])
check_matrix[:, :2] -= 1 # python indexing is zero based
matrices = np.zeros((n, 2, 2), dtype=np.complex_)
for i in range(2):
for j in range(2):
k_list = []
for k in range(len(check_matrix)):
if check_matrix[k][0] == i and check_matrix[k][1] == j:
k_list.append(check_matrix[k][8] *
np.exp(-1J * np.dot(q, check_matrix[k][2:5]
- check_matrix[k][5:8])))
matrices[:, i, j] = np.sum(k_list, axis=0)
NOTE: I changed your indices to have consistent
zero-based indexing.
Here is another approach where I replaced the k-loop with a vectored version:
for i in range(2):
for j in range(2):
k = np.logical_and(check_matrix[:, 0] == i, check_matrix[:, 1] == j)
temp = np.dot(check_matrix[k, 2:5] - check_matrix[k, 5:8], q[:, :, np.newaxis])[..., 0]
temp = check_matrix[k, 8:] * np.exp(-1J * temp)
matrices[:, i, j] = np.sum(temp, axis=0)
3 line solution
You asked for efficient solution in your original title so how about this solution that avoids nested loops and if statements in a 3 liner, which is thus hopefully faster?
fac=2*(check_matrix[:,0]-1)+(check_matrix[:,1]-1)
grp=np.split(check_matrix[:,8], np.cumsum(np.unique(fac,return_counts=True)[1])[:-1])
[np.sum(x) for x in grp]
output:
[-0.23872600000000002, 1.126557, 0.023742000000000003, 0.21394]
How does it work?
I combine the first two columns into a single index, treating each as "bits" (i.e. base 2)
fac=2*(check_matrix[:,0]-1)+(check_matrix[:,1]-1)
( If you have indexes that exceed 2, you can still use this technique but you will need to use a different base to combine the columns. i.e. if your indices go from 1 to 18, you would need to multiply column 0 by a number equal to or larger than 18 instead of 2. )
So the result of the first line is
array([0., 0., 1., 2., 2., 3.])
Note as well it assumes the data is ordered, that one column changes fastest, if this is not the case you will need an extra step to sort the index and the original check matrix. In your example the data is ordered.
The next step groups the data according to the index, and uses the solution posted here.
np.split(check_matrix[:,8], np.cumsum(np.unique(fac,return_counts=True)[1])[:-1])
[array([-0.243293, 0.004567]), array([1.126557]), array([ 0.038934, -0.015192]), array([0.21394])]
i.e. it outputs the 8th column of check_matrix according to the grouping of fac
then the last line simply sums those... knowing how the first two columns were combined to give the single index allows you to map the result back. Or you could simply add it to check matrix as a 9th column if you wanted.

Matrix Multiplication: Multiply each row of matrix by another 2D matrix in Python

I am trying to remove the loop from this matrix multiplication (and learn more about optimizing code in general), and I think I need some form of np.broadcasting or np.einsum, but after reading up on them, I'm still not sure how to use them for my problem.
A = np.array([[1, 2, 3, 4, 5],
[6, 7, 8, 9, 10],
[11,12,13,14,15]])
#A is a 3x5 matrix, such that the shape of A is (3, 5) (and A[0] is (5,))
B = np.array([[1,0,0],
[0,2,0],
[0,0,3]])
#B is a 3x3 (diagonal) matrix, with a shape of (3, 3)
C = np.zeros(5)
for i in range(5):
C[i] = np.linalg.multi_dot([A[:,i].T, B, A[:,i]])
#Each row of matrix math is [1x3]*[3x3]*[3x1] to become a scaler value in each row
#C becomes a [5x1] matrix with a shape of (5,)
I know I can't just do np.multidot by itself, because that results in a (5,5) array.
I also found this: Multiply matrix by each row of another matrix in Numpy, but I can't tell if it's actually the same problem as mine.
In [601]: C
Out[601]: array([436., 534., 644., 766., 900.])
This is a natural for einsum. I use i as you do, to denote the index that carries through to the result. j and k are indices that are used in the sum of products.
In [602]: np.einsum('ji,jk,ki->i',A,B,A)
Out[602]: array([436, 534, 644, 766, 900])
It probably can also be done with mutmul, though it may require adding a dimension and latter squeezing.
dot approaches that use diag do a lot more work than necessary. The diag throws out a lot of values.
To use matmul we have to make the i dimension the first of 3d arrays. That's the 'passive' one carries over to the result:
In [603]: A.T[:,None,:]#B#A.T[:,:,None]
Out[603]:
array([[[436]], # (5,1,1) result
[[534]],
[[644]],
[[766]],
[[900]]])
In [604]: (A.T[:,None,:]#B#A.T[:,:,None]).squeeze()
Out[604]: array([436, 534, 644, 766, 900])
Or index the extra dimensions away: (A.T[:,None,:]#B#A.T[:,:,None])[:,0,0]
You can chain to calls to dot together, then get the diagonal:
# your original output:
# >>> C
# array([436., 534., 644., 766., 900.])
>>> np.diag(np.dot(np.dot(A.T,B), A))
array([436, 534, 644, 766, 900])
Or equivalently, use your original multi_dot train of thought, but take the diagonal of the resulting 5x5 array. This may have some performance boosts (according to the docs)
>>> np.diag(np.linalg.multi_dot([A.T, B, A]))
array([436, 534, 644, 766, 900])
atTo add to the answers. If you want to make multiply the matrices you can make use of broadcasting. Edit: Note this is element wise multiplication, not dot products. For that you can use the dot methods.
B [...,None] * A
Gives:
array([[[ 1, 2, 3, 4, 5],
[ 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0]],
[[ 0, 0, 0, 0, 0],
[12, 14, 16, 18, 20],
[ 0, 0, 0, 0, 0]],
[[ 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0],
[33, 36, 39, 42, 45]]])

scipy equivalent of numpy.prod() for sparse matrices

I am looking for an equivalent of numpy.prod to be used with the sparse representations that scipy offers (scipy.sparse). Specifically, I'm trying to compute the product along a single axis. I can do it by first converting to dense (M.todense().prod(axis=0)), but am looking for something more efficient.
For prod reduction operation along each column i.e. axis=0, we would only have non-zero output for columns that have all non-zeros. We can use that fact to have one custom rolled out version, like so -
def sparse_prod_axis0(A):
# Valid mask of row length that has all non-zeros along each col
valid_mask = A.getnnz(axis=0)==A.shape[0] # Thanks to #hpaulj on this!
# Initialize o/p array of zeros
out = np.zeros(A.shape[1],dtype=A.dtype)
# Set valid positions with prod of each col from valid ones
out[valid_mask] = np.prod(A[:,valid_mask].A,axis=0)
return np.matrix(out)
Sample run -
In [92]: from scipy.sparse import csr_matrix
...: a = np.random.randint(0,4,(5,10))
...: A = csr_matrix(a)
...:
In [93]: (A.todense().prod(axis=0))
Out[93]: matrix([[ 0, 0, 6, 48, 0, 0, 0, 0, 72, 0]])
In [94]: sparse_prod_axis0(A)
Out[94]: matrix([[ 0, 0, 6, 48, 0, 0, 0, 0, 72, 0]])

Categories

Resources