I am working with 3D matrix in Python, for example, given matrix like this with size of 2x3x4:
[[[1 2 1 4]
[3 2 1 1]
[4 3 1 4]]
[[2 1 3 3]
[1 4 2 1]
[3 2 3 3]]]
I have task to find the value of entropy in each row in each dimension matrix. For example, in row 1 of dimension 1 of the matrix above [1,2,1,4], the normalized value (as such the total sum is 1) is [0.125, 0.25, 0.125, 0.5] and the value of entropy is calculated by the formula -sum(i*log(i)) where i is the normalized value. The resulting matrix is a 2x3 matrix where in each dimension there are 3 values of entropy (because there are 3 rows).
Here is the working example of my code using random matrix each time:
from scipy.stats import entropy
import numpy as np
matrix = np.random.randint(low=1,high=5,size=(2,3,4)) #how if size is (200,50,1000)
entropy_matrix=np.zeros((matrix.shape[0],matrix.shape[1]))
for i in range(matrix.shape[0]):
normalized = np.array([float(k)/np.sum(j) for j in matrix[i] for k in j]).reshape(matrix.shape[1],matrix.shape[2])
entropy_matrix[i] = np.array([entropy(m) for m in normalized])
My question is how do I scale-up this program to work with very large 3D matrix (for example with size of 200x50x1000) ?
I am using Python in Windows 10 (with Anaconda distribution).
Using 3D matrix size of 200x50x1000, I got running time of 290 s on my computer.
Using the definition of entropy for the second part and broadcasted operation on the first part, one vectorized solution would be -
p1 = matrix/matrix.sum(-1,keepdims=True).astype(float)
entropy_matrix_out = -np.sum(p1 * np.log(p1), axis=-1)
Alternatively, we can use einsum for the second part for further perf. boost -
entropy_matrix_out = -np.einsum('ijk,ijk->ij',p1,np.log(p1),optimize=True)
Related
I am comparing the Jaccard distance matrix I get when I process a dataset using pdist and a DIY Jaccard distance matrix function. I'm getting different results in my output distance matrices and I'm not sure why.
I think one of the following is the cause:
My implementation of jaccard distance calculation is wrong
scipy.spatial.distance.pdist(metric = 'jaccard') and scipy.spatial.distance.jaccard calculate jaccard distance in different ways (seems unlikely as their both in scipy.spatial.distance)
squareform is doing something to my data, potentially a normalisation
The docs for squareform go a bit over my head so some form of normalisation might be what's happening. However, the squareform-ed distance matrix does not have the same relative distance magnitudes between cells which is confusing (e.g. row 0 in my DIY distance matrix is 0, 0.571429, 1, and with pdist is 0, 1, 1 - the middle value is twice as high with pdist).
Can anyone explain the why I'm getting a different distance matrix when it's being analysed with the same metric?
My code:
import numpy as np
from scipy.spatial.distance import jaccard, squareform, pdist
def jaccard_dissimilarity(feature_list1, feature_list2, filler_val): #binary
#I don't care about every value in the array for my use case, so dont want to include them in my comparison
all_features = set([i for i in feature_list1 if i != filler_val])
all_features.update(set([i for i in feature_list2 if i != filler_val]))
counts_1 = [1 if feature in feature_list1 else 0 for feature in all_features]
counts_2 = [1 if feature in feature_list2 else 0 for feature in all_features]
return jaccard(counts_1, counts_2)
data_array = np.array([[1, 2, 3, 4, 5],
[3, 4, 5, 6, 7],
[8, 9, 10, 11, 12]])
# =============================================================================
# DIY distance matrix
# =============================================================================
#set filler val to None, so the arrays being compared are equivalent to pdist
dist_diy = np.array([[jaccard_dissimilarity(a,b, None) for a in data_array] for b in data_array])
# =============================================================================
# pdist distance matrix
# =============================================================================
dist_pdist = squareform(pdist(data_array, metric = 'jaccard'))
Input array:
1 2 3 4 5
3 4 5 6 7
8 9 10 11 12
dist_diy:
0 0.571429 1
0.571429 0 1
1 1 0
dist_pdist:
0 1 1
1 0 1
1 1 0
Looks like pdist considers objects at a given index when comparing arrays, rather than just what objects are present in the array itself - if I change data_array[1] to 3, 4, 5, 4, 5 then the distance matrix changes to reflect the fact that data_array[0][3:5] == data_array[1][3:5]:
0 0.6 1
0.6 0 1
1 1 0
The behaviour is discussed here, but the arrays don't have to be boolean based on the above tests (if the arrays were treated as boolean then the distance matrix would not change as all numbers are > 1 and are therefore == True).
The DIY function considered the objects present rather than the index at which those objects were found, hence the discrepancy!
This answer to performing outer addition with numpy discusses numpy's ufunc's "outer" and also numpy's broadcasting, and the examples there are summarized below.
In the case of addition, subtraction, multiplication and division, is the calculation the same under the hood, or will there be differences in performance, especially when the array size gets large?
Minimal, 2D example from the linked answer:
import numpy as np
a, b = np.arange(3), np.arange(5)
print(np.add.outer(a, b))
print(a[:, None] + b) # or a[:, np.newaxis] + b
both result in:
[[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]]
I am doing Principle Component Analysis (PCA) and I'd like to find out which features that contribute the most to the result.
My intuition is to sum up all the absolute values of the individual contribution of the features to the individual components.
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1, 4, 1], [-2, -1, 4, 2], [-3, -2, 4, 3], [1, 1, 4, 4], [2, 1, 4, 5], [3, 2, 4, 6]])
pca = PCA(n_components=0.95, whiten=True, svd_solver='full').fit(X)
pca.components_
array([[ 0.71417303, 0.46711713, 0. , 0.52130459],
[-0.46602418, -0.23839061, -0. , 0.85205128]])
np.sum(np.abs(pca.components_), axis=0)
array([1.18019721, 0.70550774, 0. , 1.37335586])
This yields, in my eyes, a measure of importance of each of the original features. Note that the 3rd feature has zero importance, because I intentionally created a column that is just a constant value.
Is there a better "measure of importance" for PCA?
The measure of importance for PCA is in explained_variance_ratio_. This array provides percentage of variance explained by each component. It is sorted by importance of the components in descending order and sums up to 1 when all the components are used, or minimal possible value above the requested threshold. In your example you set a threshold to 95% (of variance that should be explained), so the array sum will be 0.9949522861608583 as the first component explains 92.021143% and the second 7.474085% of the variance, hence the 2 components you receive.
components_ is the array that stores the directions of maximum variance in the feature space. It's dimensions are n_components_ by n_features_. This is what you multiply the data point(s) by when applying transform() to get reduced dimensionality projection of the data.
update
In order to get the percentage of contribution of the original features to each of the Principal Components, you just need to normalize components_, as they set the amount original vectors contribute to the projection.
r = np.abs(pca.components_.T)
r/r.sum(axis=0)
array([[0.41946155, 0.29941172],
[0.27435603, 0.15316146],
[0. , 0. ],
[0.30618242, 0.54742682]])
As you can see third feature does not contribute to the PCs.
If you need the total contribution of the original features to the explained variance, you need to take into account each PC contribution (i.e. explained_variance_ratio_):
ev = np.abs(pca.components_.T).dot(pca.explained_variance_ratio_)
ttl_ev = pca.explained_variance_ratio_.sum()*ev/ev.sum()
print(ttl_ev)
[0.40908847 0.26463667 0. 0.32122715]
If you just purely sum the PCs with np.sum(np.abs(pca.components_), axis=0), that assumes all PCs are equally important which is rarely true. To use PCA for crude feature selection, sum after discarding low-contribution PCs and/or after scaling the PCs by their relative contributions.
Here is a visual example that highlights why a plain sum doesn't work as desired.
Given 3 observations of 20 features (visualized as three 5x4 heatmaps):
>>> print(X.T)
[[2 1 1 1 1 1 1 1 1 4 1 1 1 4 1 1 1 1 1 2]
[1 1 1 1 1 1 1 1 1 4 1 1 1 6 3 1 1 1 1 2]
[1 1 1 2 1 1 1 1 1 5 2 1 1 5 1 1 1 1 1 2]]
These are the resulting PCs:
>>> pca = PCA(n_components=None, whiten=True, svd_solver='full').fit(X.T)
Note that PC3 has high magnitude at (2,1), but if we check its explained variance, it offers ~0 contribution:
>>> print(pca.explained_variance_ratio_)
array([0.6638886943392722, 0.3361113056607279, 2.2971091700327738e-32])
This causes a feature selection discrepancy when summing the unscaled PCs (left) vs summing the PCs scaled by their explained variance ratios (right):
>>> unscaled = np.sum(np.abs(pca.components_), axis=0)
>>> scaled = np.sum(pca.explained_variance_ratio_[:, None] * np.abs(pca.components_), axis=0)
With the unscaled sum (left), the meaningless PC3 is still given 33% weight. This causes (2,1) to be considered the most important feature, but if we look back to the original data, (2,1) offers low discrimination between observations.
With the scaled sum (right), PC1 and PC2 respectively have 66% and 33% weight. Now (3,1) and (3,2) are the most important features which actually tracks with the original data.
I have a data array of 30 trials(columns) each of 256 data points (rows) and would like to run a wavelet transform (which requires a 1D array) on each column with the eventual aim of obtaining the mean coefficients of the 30 trials.
Can someone point me in the right direction please?
If you have a multidimensional numpy array then you can use a for loop:
import numpy as np
A = np.array([[1,2,3], [4,5,6]])
# A is the matrix: 1 2 3
# 4 5 6
for col in A.transpose():
print("Column:", col)
# Perform your wavelet transform here, you can save the
# results to another multidimensional array.
This gives you access to each column as a 1D array.
Output:
Column: [1 4]
Column: [2 5]
Column: [3 6]
If you want to access the rows rather than the columns then loop through A rather than A.transpose().
I have a classified raster that I am reading into a numpy array. (n classes)
I want to use a 2d moving window (e.g. 3 by 3) to create a n-dimensional vector that stores the %cover of each class within the window. Because the raster is large it would be useful to store this information so as not to re-compute it each time....therefor I think the best solution is creating a 3d array to act as the vector. A new raster will be created based on these %/count values.
My idea is to:
1) create a 3d array n+1 'bands'
2) band 1 = the original classified raster. each other 'band' value = count cells of a value within the window (i.e. one band per class) ....for example:
[[2 0 1 2 1]
[2 0 2 0 0]
[0 1 1 2 1]
[0 2 2 1 1]
[0 1 2 1 1]]
[[2 2 3 2 2]
[3 3 3 2 2]
[3 3 2 2 2]
[3 3 0 0 0]
[2 2 0 0 0]]
[[0 1 1 2 1]
[1 3 3 4 2]
[1 2 3 4 3]
[2 3 5 6 5]
[1 1 3 4 4]]
[[2 3 2 2 1]
[2 3 3 3 2]
[2 4 4 3 1]
[1 3 5 3 1]
[1 3 3 2 0]]
4) read these bands into a vrt so only needs be created the once ...and can be read in for further modules.
Question: what is the most efficient 'moving window' method to 'count' within the window?
Currently - I am trying, and failing with the following code:
def lcc_binary_vrt(raster, dim, bands):
footprint = np.zeros(shape = (dim,dim), dtype = int)+1
g = gdal.Open(raster)
data = gdal_array.DatasetReadAsArray(g)
#loop through the band values
for i in bands:
print i
# create a duplicate '0' array of the raster
a_band = data*0
# we create the binary dataset for the band
a_band = np.where(data == i, 1, a_band)
count_a_band_fname = raster[:-4] + '_' + str(i) + '.tif'
# run the moving window (footprint) accross the band to create a 'count'
count_a_band = ndimage.generic_filter(a_band, np.count_nonzero(x), footprint=footprint, mode = 'constant')
geoTiff.create(count_a_band_fname, g, data, count_a_band, gdal.GDT_Byte, np.nan)
Any suggestions very much appreciated.
Becky
I don't know anything about the spatial sciences stuff, so I'll just focus on the main question :)
what is the most efficient 'moving window' method to 'count' within the window?
A common way to do moving window statistics with Numpy is to use numpy.lib.stride_tricks.as_strided, see for example this answer. Basically, the idea is to make an array containing all the windows, without any increase in memory usage:
from numpy.lib.stride_tricks import as_strided
...
m, n = a_band.shape
newshape = (m-dim+1, n-dim+1, dim, dim)
newstrides = a_band.strides * 2 # strides is a tuple
count_a_band = as_strided(ar, newshape, newstrides).sum(axis=(2,3))
However, for your use case this method is inefficient, because you're summing the same numbers over and over again, especially if the window size increases. A better way is to use a cumsum trick, like in this answer:
def windowed_sum_1d(ar, ws, axis=None):
if axis is None:
ar = ar.ravel()
else:
ar = np.swapaxes(ar, axis, 0)
ans = np.cumsum(ar, axis=0)
ans[ws:] = ans[ws:] - ans[:-ws]
ans = ans[ws-1:]
if axis:
ans = np.swapaxes(ans, 0, axis)
return ans
def windowed_sum(ar, ws):
for axis in range(ar.ndim):
ar = windowed_sum_1d(ar, ws, axis)
return ar
...
count_a_band = windowed_sum(a_band, dim)
Note that in both codes above it would be tedious to handle edge cases. Luckily, there is an easy way to include these and get the same efficiency as the second code:
count_a_band = ndimage.uniform_filter(a_band, size=dim, mode='constant') * dim**2
Though very similar to what you already had, this will be much faster! The downside is that you may need to round to integers to get rid of floating point rounding errors.
As a final note, your code
# create a duplicate '0' array of the raster
a_band = data*0
# we create the binary dataset for the band
a_band = np.where(data == i, 1, a_band)
is a bit redundant: You can just use a_band = (data == i).