I have 2 symmetric matrices, one of them being a correlation matrix and the other one similar to a correlation matrix. Examples of these matrices are shown below:
Correlation Matrix (c):
A B C D
A 1 0.5 0.1 0.4
B 0.5 1 0.9 0.3
C 0.1 0.9 1 0.3
D 0.4 0.3 0.3 1
Other Matrix (z):
A B C D
A 3 2 2 2
B 2 3 3 2
C 2 3 3 2
D 2 2 2 3
I'm ordering the correlation matrix in descending order so I can look at the top-most correlation values, using the following code:
c = corrMatrixMin10.abs()
s = c.unstack()
so = s.sort_values(kind="quicksort")
pd.DataFrame(so[so.values!=1].sort_values(ascending=False))
My question is as follows:
When I arrange the correlation matrix c in a descending order, the correlation matrix itself loses its shape. How do I have the other matrix z in the exact same order?
For example: The intersection of columns A and B in the matrix c is 0.5. The intersection of columns A and B in the matrix z is 2. How can I still preserve this order to associate these 2 values after arranging the matrix c in a descending order?
Any help would be greatly appreciated. TIA.
The code to generate the 2 matrices is as follows:
c = pd.DataFrame([[1, 0.5, 0.1, 0.4],
[0.5, 1, 0.9, 0.3],
[ 0.1, 0.9, 1, 0.3],
[ 0.4, 0.3, 0.3, 1]],
columns=list('ABCD'))
z = pd.DataFrame([[3, 2, 2, 2],
[2, 3, 3, 2],
[ 2, 3, 3, 2],
[ 2, 2, 2, 3]],
columns=list('ABCD'))
You can use Series.reindex
c_series = c.unstack().drop([(x, x) for x in c]).sort_values(ascending=False)
z_series = z.unstack().reindex(c_series.index)
Related
Scenario:
I want to create a majority vote system based that takes into account the weight of someone's vote about N observations.
So, M observers will give their guess about N observations, selecting from 3 classes (1,2,3). For each observation, each observer will have a weight associated with it.
Defining:
G: Matrix of guesses per observation / observer (N observations × M observers);
W: Weights for each observation / observer (N observations × M observers)
Example:
# 2 observations, 3 observers
G = [[1, 2, 3],
[2, 2, 1]]
# Weights (influence) each observer has about each observation
W = [[0.1, 0.2, 0.3],
[0.3, 0.1, 0.2]]
I need to compute another matrix with shape (N observations × C classes) that stores the probability of an observation comes from an specific class.
Example using values above:
G = [[1, 2, 3],
[2, 2, 1]]
W = [[0.1, 0.2, 0.3],
[0.3, 0.1, 0.2]]
P = [[0.1, 0.2, 0.3],
[0.2, (0.3 + 0.1), 0]]
After computing the P matrix, I could apply np.argmax() row-wise to get the column (class) with highest value:
P = [[0.1, 0.2, 0.3], #class 3 has highest value (0.3)
[0.2, 0.4, 0]] #class 2 has highest value (0.4)
result = [3, 2]
I would like to know how can I combine G and W to generate the P matrix.
You can get the job done in a vectorized manner by using NumPy's indices and advanced indexing:
In [569]: import numpy as np
In [570]: G = np.array([[1, 2, 3], [2, 2, 1]] )
In [571]: W = np.array([[0.1, 0.2, 0.3], [0.3, 0.1, 0.2]])
In [572]: C = 3
In [573]: M, N = G.shape
In [574]: row, col = np.indices((M, N))
In [575]: P3d = np.zeros(shape=(M, N, C))
In [576]: P3d[row, col, G-1] = W
In [577]: P = P3d.sum(axis=1)
In [578]: P
Out[578]:
array([[0.1, 0.2, 0.3],
[0.2, 0.4, 0. ]])
Initialize P with zero values then iterate by observations/rows of G and value of index i.e g[observation][index] if class 1 then add weight[observation][index] from W matrix to P[observation][class]+=weight[observation][index]. i.e in your sample testcase. for row 1. index 0 has value 1 and weight[0][0] is 0.1 so add 0.1 to row 0 and index[class] of P. similarly for index 2 and 3 value are same as index therefore same in P.
Now for row 2, index 1 has class 2 so we add weight of class 2 to p[2][class]+=0.3 and for index 2 class is again 2 so weight of that observer is 0.1 so again p[2][class]+=weight i.e 0.1. for last index class is 1 so p[2][class]+=weight now Our matrix is ready so use np.argmax() for required answer.
I have two dataframes: s-1 column, d-3 columns
s = {0: [0, 0.3, 0.5, -0.1, -0.2, 0.7, 0]}
d = {0: [0.1, 0.2, -0.2, 0, 0, 0, 0], 1: [0.3, 0.4, -0.7, 0, 0.8, 0, 0.1], 2: [-0.5, 0.4, -0.1, 0.5, 0.5, 0, 0]}
sd = pd.DataFrame(data=s)
dd = pd.DataFrame(data=d)
result = pd.DataFrame()
I want to get the result dataframe (1 column) based on values in those two:
1. When value in sd = 0 then 0
2. When value in sd != 0 then check if for this row there is at least one non-zero value in dd, if yes - get avg of non zero values, if no return OK
Here is what I would like to get:
results:
0 0
1 -0,033
2 -0,333
3 0,5
4 0,65
5 OK
6 0
I know I can use dd[dd != 0].mean(axis=1) to calculate the mean of non zero values for the row but I don't know how to connect all these 3 conditions together
Using np.where twice
np.where(sd[0]==0,0,np.where(dd.eq(0).all(1),'OK',dd.mask(dd==0).mean(1)))
Out[232]:
array(['0', '0.3333333333333333', '-0.3333333333333333', '0.5', '0.65',
'OK', '0'], dtype='<U32')
Using numpy.select:
c1 = sd[0].eq(0)
c2 = dd.eq(0).all(1)
res = np.select([c1, c2], [0, 'OK'], dd.where(dd.ne(0)).mean(1))
pd.Series(res)
0 0
1 0.3333333333333333
2 -0.3333333333333333
3 0.5
4 0.65
5 OK
6 0
dtype: object
thank you for your help. I managed to do it in a quite different way.
I used:
res1 = pd.Series(np.where(sd[0]==0, 0, dd[dd != 0].mean(axis=1))).fillna('OK')
The difference is that it returns float values (for rows that are not 'OK'), not string. It also appears to be a little bit faster.
I'm trying to define my own discrete distribution. The code I have works for integer values but not for decimal values. For example, this works:
>>> from scipy.stats import rv_discrete
>>> probabilities = [0.2, 0.5, 0.3]
>>> values = [1, 2, 3]
>>> distrib = rv_discrete(values=(values, probabilities))
>>> print distrib.rvs(size=10)
[1 3 3 2 2 2 2 2 1 3]
But if I use decimal values, it doesn't work:
>>> from scipy.stats import rv_discrete
>>> probabilities = [0.2, 0.5, 0.3]
>>> values = [.1, .2, .3]
>>> distrib = rv_discrete(values=(values, probabilities))
>>> print distrib.rvs(size=10)
[0 0 0 0 0 0 0 0 0 0]
Thanks..
Per stats.rv_discrete's doc string:
values : tuple of two array_like, optional
(xk, pk) where xk are integers with non-zero
probabilities pk with sum(pk) = 1.
(my emphasis). So the discrete distributions created by rv_discrete must use integer values. However, it is not hard to map those integer values to floats by using the rvs values as integer indices into values:
In [4]: values = np.array([0.1, 0.2, 0.3])
In [5]: idx = distrib.rvs(size=10); idx
Out[5]: array([1, 1, 0, 0, 1, 1, 0, 2, 1, 1])
In [6]: values[idx]
Out[6]: array([ 0.2, 0.2, 0.1, 0.1, 0.2, 0.2, 0.1, 0.3, 0.2, 0.2])
Thus you could use:
import numpy as np
import scipy.stats as stats
np.random.seed(2016)
probabilities = np.array([0.2, 0.5, 0.3])
values = np.array([0.1, 0.2, 0.3])
distrib = stats.rv_discrete(values=(range(len(probabilities)), probabilities))
idx = distrib.rvs(size=10)
result = values[idx]
print(result)
# [ 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.3 0.3 0.2]
I have two grids of equal shape, one is for land class and the other is land area.
Examples:
Land class
[[1 4 3],
[3 2 3],
[1 3 3]]
Land area
[[0.3 0.8 2.0],
[5.0 1.5 0.5],
[0.1 1.0 3.2]]
I need to sum up land area based on land class, and it would be delightful to print something like this:
1 0.4
2 1.5
3 11.7
4 0.8
The only module I've imported is numpy, and I would like to avoid importing others if possible. Suggestions?
You can do as follows:
import numpy as np
lc=np.array([[1, 4, 3],
[3, 2, 3],
[1, 3, 3]])
la=np.array([[0.3, 0.8, 2.0],
[5.0, 1.5, 0.5],
[0.1, 1.0 ,3.2]])
calc_areas = []
for v in np.unique(lc):
print(v, np.sum(la[lc==v]))
calc_areas.append([v, np.sum(la[lc==v])])
calc_areas.sort(key=lambda v: v[1], reverse=True)
print("Max area", calc_areas[0])
Gives:
1 0.4
2 1.5
3 11.7
4 0.8
('Max area', [3, 11.699999999999999])
I need to determine if the position (index) of the k largest values in matrix a are in the same position as the binary indicator matrix, b.
import numpy as np
a = np.matrix([[.8,.2,.6,.4],[.9,.3,.8,.6],[.2,.6,.8,.4],[.3,.3,.1,.8]])
b = np.matrix([[1,0,0,1],[1,0,1,1],[1,1,1,0],[1,0,0,1]])
print "a:\n", a
print "b:\n", b
d = argsort(a)
d[:,2:] # Return whether these indices are in 'b'
Returns:
a:
[[ 0.8 0.2 0.6 0.4]
[ 0.9 0.3 0.8 0.6]
[ 0.2 0.6 0.8 0.4]
[ 0.3 0.3 0.1 0.8]]
b:
[[1 0 0 1]
[1 0 1 1]
[1 1 1 0]
[1 0 0 1]]
matrix([[2, 0],
[2, 0],
[1, 2],
[1, 3]])
I would like to compare the indices returned from the last result and, if b has ones in those positions, return the count.
For this example, the final desired result would be:
1
2
2
1
In other words, in the first row of a, the top-2 values correspond to only one of the ones in b, etc.
Any ideas how to do this efficiently? Maybe the argsort is the wrong approach here.
Thanks.
When you take the argsort you get it from minimum 0 to maximum 3, so you can reverse it doing [::-1] to get for maximum 0 and for the minimum 3:
s = np.argsort(a, axis=1)[:,::-1]
#array([[0, 2, 3, 1],
# [0, 2, 3, 1],
# [2, 1, 3, 0],
# [3, 1, 0, 2]])
Now you can use np.take to get the 0s where the maximums are and 1s where the second-maximums are:
s2 = s + (np.arange(s.shape[0])*s.shape[1])[:,None]
s = np.take(s.flatten(),s2)
#array([[0, 3, 1, 2],
# [0, 3, 1, 2],
# [3, 1, 0, 2],
# [2, 1, 3, 0]])
In b, the 0 values should be replaced by a np.nan so that 0==np.nan gives False:
b = np.float_(b)
b[b==0] = np.nan
#array([[ 1., nan, nan, 1.],
# [ 1., nan, 1., 1.],
# [ 1., 1., 1., nan],
# [ 1., nan, nan, 1.]])
and the following comparison will give you the desired result:
print np.logical_or(s==b-1, s==b).sum(axis=1)
#[[1]
# [2]
# [2]
# [1]]
The general case, to compare the n biggest values of a against a binary b:
def check_a_b(a,b,n=2):
b = np.float_(b)
b[b==0] = np.nan
s = np.argsort(a, axis=1)[:,::-1]
s2 = s + (np.arange(s.shape[0])*s.shape[1])[:,None]
s = np.take(s.flatten(),s2)
ans = s==(b-1)
for i in range(n-1):
ans = np.logical_or( ans, s==b+i )
return ans.sum(axis=1)
This will do pair-wise comparisons in the logical_or.
Anothen simpler and much faster approach, based on the fact that:
True*1=1, True*0=0, False*0=0, and False*1=0
is:
def check_a_b_new(a,b,n=2):
s = np.argsort(a.view(np.ndarray), axis=1)[:,::-1]
s2 = s + (np.arange(s.shape[0])*s.shape[1])[:,None]
s = np.take(s.flatten(),s2)
return ((s < n)*b.view(np.ndarray)).sum(axis=1)
Avoiding the 0 to np.nan conversion, and the Python for loop that makes things pretty slow for a high value of n.
In response to Saullo's huge help, I was able to take his work and reduce the solution to three lines. Thanks Saullo!
#Inputs
k = 2
a = np.matrix([[.8,.2,.6,.4],[.9,.3,.8,.6],[.2,.6,.8,.4],[.3,.3,.1,.8]])
b = np.matrix([[1,0,0,1],[1,0,1,1],[1,1,1,0],[1,0,0,1]])
print "a:\n", a
print "b:\n", b
# Return values of interest
s = argsort(a.view(np.ndarray), axis=1)[:,::-1]
s2 = s + (arange(s.shape[0])*s.shape[1])[:,None]
out = take(b,s2).view(np.ndarray)[::,:k].sum(axis=1)
print out
Gives:
a:
[[ 0.8 0.2 0.6 0.4]
[ 0.9 0.3 0.8 0.6]
[ 0.2 0.6 0.8 0.4]
[ 0.3 0.3 0.1 0.8]]
b:
[[1 0 0 1]
[1 0 1 1]
[1 1 1 0]
[1 0 0 1]]
Out:
[1 2 2 1]