I have a numpy array which has 100 rows and 16026 columns. I have to find the median of every column. So median for every column will be calculated from 100 observations (100 rows in this case). I am using the following code to achieve this:
for category in categories:
indices = np.random.randint(0, len(os.listdir(filepath + category)) - 1, 100)
tempArray = X_train[indices, ]
medArray = np.median(tempArray, axis=0)
print(medArray.shape)
And here's the output that I get:
(100, 16026)
(100, 16026)
(100, 16026)
(100, 16026)
My question is - why is the shape of medArray 100*16026 and not 1*16026? Because I am calculating the median of every column, I would expect only one row with 16026 columns. What am I missing here?
Please note that X_train is a sparse matrix.
X_train.shape
output:
(2034, 16026)
Any help in this regard is much appreciated.
Edit:
The above problem has been solved by toarray() function.
tempArray = X_train[indices, ].toarray()
I also figured that I was being stupid and also including all the zeroes in my median calculation and that's why I was getting 0 as the median all the time. Is there an easy way of calculating the median by removing/ignoring the zeroes across all the columns?
That's really strange, I think you should get (16026,), are we missing something here:
In [241]:
X_train=np.random.random((1000,16026)) #1000 can be any int.
indices = np.random.randint(0, 60, 100) #60 can be any int.
tempArray = X_train[indices, ]
medArray = np.median(tempArray, axis=0)
print(medArray.shape)
(16026,)
And the only way you can get a 2d array result is:
In [243]:
X_train=np.random.random((100,2,16026))
indices = np.random.randint(0, 60, 100)
tempArray = X_train[indices, ]
medArray = np.median(tempArray, axis=0)
print(medArray.shape)
(2, 16026)
When you have a 3d array input.
When it is a sparse array, a dumb way to get around this might be:
In [319]:
X_train = sparse.rand(112, 16026, 0.5, 'csr') #just make up a random sparse array
indices = np.random.randint(0, 60, 100)
tempArray = X_train[indices, ]
medArray = np.median(tempArray.toarray(), axis=0)
print(medArray.shape)
(16026,)
.toarray() might also go to the 3rd line instead. But either way, this means the 0's are also counted as #zhangxaochen pointed out.
Out of ideas, there may be better explanations for it.
The problem is that NumPy doesn't recognize sparse matrices as arrays or array-like objects. For example, calling asanyarray on a sparse matrix returns a 0D array whose one element is the original sparse matrix:
In [8]: numpy.asanyarray(scipy.sparse.csc_matrix([[1,2,3],[4,5,6]]))
Out[8]:
array(<2x3 sparse matrix of type '<type 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Column format>, dtype=object)
Like most of NumPy, numpy.median relies on having an array or array-like object as input. The routines it relies on, particularly sort, won't understand what they're looking at if you give it a sparse matrix.
I was finally able to solve this. I used masked arrays and the following code:
sample = []
sample_size = 50
idx = matplotlib.mlab.find(newsgroups_train.target==i)
random_index = []
for j in range(sample_size):
random_index.append(randrange(0,len(idx)-1))
y = np.ma.masked_where(X_train[sample[0]].toarray() == 0, X_train[sample[0]].toarray())
medArray = np.ma.median(y, axis=0).filled(0)
print '============median ' + newsgroups_train.target_names[i] + '============='
for k,word in enumerate(np.array(vectorizer.get_feature_names())[np.argsort(medArray)[::-1][0:10]]):
print word + ':' + str(np.sort(medArray)[::-1][k])
This gave me the median ignoring zeros.
Related
I want to calculate the mean of a 3D array along two axes and subtract this mean from the array.
In Matlab I use the repmat function to achieve this as follows
% A is an array of size 100x50x100
mean_A = mean(mean(A,3),1); % mean_A is 1D of length 50
Am = repmat(mean_A,[100,1,100]) % Am is 3D 100x50x100
flc_A = A - Am % flc_A is 3D 100x50x100
Now, I am trying to do the same with python.
mean_A = numpy.mean(numpy.mean(A,axis=2),axis=0);
gives me the 1D array. However, I cannot find a way to copy this to form a 3D array using numpy.tile().
Am I missing something or is there another way to do this in python?
You could set keepdims to True in both cases so the resulting shape is broadcastable and use np.broadcast_to to broadcast to the shape of A:
np.broadcast_to(np.mean(np.mean(A,2,keepdims=True),axis=0,keepdims=True), A.shape)
Note that you can also specify a tuple of axes along which to take the successive means:
np.broadcast_to(np.mean(A,axis=tuple([2,0]), keepdims=True), A.shape)
numpy.tile is not the same with Matlab repmat. You could refer to this question. However, there is an easy way to repeat the work you have done in Matlab. And you don't really have to understand how numpy.tile works in Python.
import numpy as np
A = np.random.rand(100, 50, 100)
# keep the dims of the array when calculating mean values
B = np.mean(A, axis=2, keepdims=True)
C = np.mean(B, axis=0, keepdims=True) # now the shape of C is (1, 50, 1)
# then simply duplicate C in the first and the third dimensions
D = np.repeat(C, 100, axis=0)
D = np.repeat(D, 100, axis=2)
D is the 3D array you want.
I am trying to sample with replacement a base 2D numpy array with shape of (4,2) by rows, say 10 times. The final output should be a 3D numpy array.
Have tried the code below, it works. But is there a way to do it without the for loop?
base=np.array([[20,30],[50,60],[70,80],[10,30]])
print(np.shape(base))
nsample=10
tmp=np.zeros((np.shape(base)[0],np.shape(base)[1],10))
for i in range(nsample):
id_pick = np.random.choice(np.shape(base)[0], size=(np.shape(base)[0]))
print(id_pick)
boot1=base[id_pick,:]
tmp[:,:,i]=boot1
print(tmp)
Here's one vectorized approach -
m,n = base.shape
idx = np.random.randint(0,m,(m,nsample))
out = base[idx].swapaxes(1,2)
Basic idea is that we generate all the possible indices with np.random.randint as idx. That would an array of shape (m,nsample). We use this array to index into the input array along the first axis. Thus, it selects random rows off base. To get the final output with a shape (m,n,nsample), we need to swap last two axes.
You can use the stack function from numpy. Your code would then look like:
base=np.array([[20,30],[50,60],[70,80],[10,30]])
print(np.shape(base))
nsample=10
tmp = []
for i in range(nsample):
id_pick = np.random.choice(np.shape(base)[0], size=(np.shape(base)[0]))
print(id_pick)
boot1=base[id_pick,:]
tmp.append(boot1)
tmp = np.stack(tmp, axis=-1)
print(tmp)
Based on #Divakar 's answer, if you already know the shape of this 2D-array, you can treat it as an (8,) 1D array while bootstrapping, and then reshape it:
m, n = base.shape
flatbase = np.reshape(base, (m*n,))
idxs = np.random.choice(range(8), (numReps, m*n))
bootflats = flatbase[idx]
boots = np.reshape(flatbase, (numReps, m, n))
I am trying to get a 2d array, by randomly generating its rows and appending
import numpy as np
my_nums = np.array([])
for i in range(100):
x = np.random.rand(2, 1)
my_nums = np.append(my_nums, np.array(x))
But I do not get what I want but instead get a 1d array.
What is wrong?
Transposing x did not help either.
You could do this by using np.append(axis=0) or np.vstack. This however requires the rows appended to have the same length as the rows already in the array.
You cannot use the same code to append a row with two values to an empty array, and to append a row to an already existing 2D array: numpy will throw a
ValueError: all the input arrays must have same number of dimensions.
You could initialize my_nums to work around this:
my_nums = np.random.rand(1, 2)
for i in range(99):
x = np.random.rand(1, 2)
my_nums = np.append(my_nums, x, axis=0)
Note the decrease in the range by one due to the initialization row. Also note that I changed the dimensions to (1, 2) to get actual row vectors.
Much easier than appending row-wise will of course be to create the array in the wanted final shape:
my_nums = np.random.rand(100, 2)
In numpy I have a 3d array and I would ike to remove some of the 2d subarrays. Think about it like this:
r = range(27)
arr = np.reshape(r, (3,3,3))
del = [[0,1,2],[0,0,2]]
flatSeam = np.ravel_multi_index(del, arr.shape)
arr = np.delete(arr, flatSeam)
So at the end I would like to have an array of the shape (3,2,3) without the elements 00, 10, 22 from the original array. My problem is that I acn not use ravel_multi_index for this, because my indices are 2d and the array shape is 3d, so the wrong indices are calculated (the code above also does not execute because the indices array and the shape have to be the same size).
Do you have any ideas how I can achieve this?
Here's an approach using advanced-indexing -
# arr: Input array, rm_idx : 2-row list/array of indices to be removed
m,n,p = arr.shape
mask = np.asarray(rm_idx[1])[:,None] != np.arange(n)
out = arr[np.arange(m)[:,None],np.where(mask)[1].reshape(m,-1)]
Alternatively, with boolean-indexing -
out = arr.reshape(-1,p)[mask.ravel()].reshape(m,-1,p)
A bit less memory-intensive approach as we try to avoid creating 2D mask -
vmask = ~np.in1d(np.arange(m*n),rm_idx[1] + n*np.arange(m))
out = arr.reshape(-1,p)[vmask].reshape(m,-1,p)
I have a 2d array and a 1d array and I need to multiply each element in the 1d array x each element in the 2d array columns. It's basically a matrix multiplication but numpy won't allow matrix multiplication because of the 1d array. This is because matrices are inherently 2d in numpy. How can I get around this problem? This is an example of what I want:
FrMtx = np.zeros(shape=(24,24)) #2d array
elem = np.zeros(24, dtype=float) #1d array
Result = np.zeros(shape=(24,24), dtype=float) #2d array to store results
some_loop to increment i:
some_other_loop to increment j:
Result[i][j] = (FrMtx[i][j] x elem[j])
Numerous efforts have given me errors such as arrays used as indices must be of integer or boolean type
Due to the NumPy broadcasting rules, a simple
Result = FrMtx * elem
Will give the desired result.
You should be able to just multiply your arrays together, but its not immediately obvious what 'direction' the arrays will be multiplied since the matrix is square. To be more explicit about which axes are being multiplied, I find it is helpful to always multiply arrays that have the same number of dimensions.
For example, to multiply the columns:
mtx = np.zeros(shape=(5,7))
col = np.zeros(shape=(5,))
result = mtx * col.reshape((5, 1))
By reshaping col to (5,1), we guarantee that axis 0 of mtx is multiplied against axis 0 of col. To multiply rows:
mtx = np.zeros(shape=(5,7))
row = np.zeros(shape=(7,))
result = mtx * row.reshape((1, 7))
This guarantees that axis 1 in mtx is multiplied by axis 0 in row.