How to append array as column to the original Pandas dataframe - python

I have a data frame that looks like below:
x_1 x_2 x_3 x_combined
0 1 0 [0,1,0]
1 0 1 [1,0,1]
1 1 0. [1,1,0]
0 0 1 [0,0,1]
Then I calculated the centroid of each dimension by using np.mean(df['x_combined'], axis=0) and got
array([0.5, 0.5, 0.5])
How do I now append it back to the original DF as a fifth column and have the same value for each row? It should look like this:
x_1 x_2 x_3 x_combined centroid
0 1 0 [0,1,0] [0.5, 0.5, 0.5]
1 0 1 [1,0,1] [0.5, 0.5, 0.5]
1 1 0. [1,1,0] [0.5, 0.5, 0.5]
0 0 1 [0,0,1] [0.5, 0.5, 0.5]

This also works:
df = pd.DataFrame({
'x_1': [0, 1, 1, 0],
'x_2': [1, 0, 1, 0],
'x_3': [0, 1, 0, 1],
'x_combined': [np.array([0, 1, 0]), np.array([1, 0, 1]),
np.array([1, 1, 0]), np.array([0, 0, 1])]
})
a = np.mean(df['x_combined'], axis=0) # or a = df['x_combined'].mean(axis=0)
df['centroid'] = [a]*len(df)

Related

How to use slice notation in this case?

I am solving a problem in which I am using two large matrices A and B. Matrix A is formed by ones and zeros and matrix B is formed by integers in the range [0,...,10].
At some point, I have to update the components of A such that, if the component is 1 it stays at 1. If the component is 0, it can stay at 0 with probability p or change to 1 with probability 1-p. The parameters p are functions of the same component of matrix B, i.e., I have a list of probabilities such that, if I update A[i,j], then p equals the component B[i,j] of the vector probabilities
I can do the updating with the following code:
import numpy as np
for i in range(n):
for j in range(n):
if A[i,j]==0:
pass
else:
A[i,j]=np.random.choice([0,1],p=[probabilities[B[i,j]],1-probabilities[B[i,j]]])
I think that there should be a faster way to update matrix A using slice notation. Any advice?
See that this problem is equivalent to, given a vector 'a' and a matrix with positive entries 'B', obtain a matrix C in which C[i,j]=a[B[i,j]]
The general answer
You can achieve this by generating a random value with a continuous probability, and then compare the generated value with the inverse cumulative probability function of that distribution.
To be practical
Let's use uniform random variable X in the range [0, 1), then the probability that X < a is a. Assuming that A and B have the same shape
You can use numpy.where for this (make sure that your probabilities variable is a numpy array)
A[:,:] = np.where(A == 0, np.where(np.random.rand(*A.shape) < probabilities[B], 1, 0), A);
If you want to avoid computing the random values for positions where A is non-zero then you have a more complex indexing.
A[A == 0] = np.where(np.random.rand(*A[A == 0].shape) < probabilities[B[A == 0]], 1, 0);
Make many p in the 0-1 range:
In [36]: p=np.arange(1000)/1000
Your choice use - with sum to get an overall statistic:
In [37]: sum([np.random.choice([0,1],p=[p[i],1-p[i]]) for i in range(p.shape[0])])
Out[37]: 485
and a statistically similar random array without the comprehension:
In [38]: (np.random.rand(p.shape[0])>p).astype(int).sum()
Out[38]: 496
You are welcome to perform other tests to verify their equivalence.
If probabilities is a function that takes the whole B or B[mask], we should be able to do:
mask = A==0
n = mask.ravel().sum() # the number of true elements
A[mask] = (np.random.rand(n)>probabilities(B[mask])).astype(int)
To test this:
In [39]: A = np.random.randint(0,2, (4,5))
In [40]: A
Out[40]:
array([[1, 1, 0, 0, 0],
[1, 1, 1, 1, 1],
[1, 1, 0, 0, 1],
[1, 1, 1, 1, 0]])
In [41]: mask = A==0
In [42]: A[mask]
Out[42]: array([0, 0, 0, 0, 0, 0])
In [43]: B = np.arange(1,21).reshape(4,5)
In [44]: def foo(B): # scale the B values to probabilites
...: return B/20
...:
In [45]: foo(B)
Out[45]:
array([[0.05, 0.1 , 0.15, 0.2 , 0.25],
[0.3 , 0.35, 0.4 , 0.45, 0.5 ],
[0.55, 0.6 , 0.65, 0.7 , 0.75],
[0.8 , 0.85, 0.9 , 0.95, 1. ]])
In [46]: n = mask.ravel().sum()
In [47]: n
Out[47]: 6
In [51]: (np.random.rand(n)>foo(B[mask])).astype(int)
Out[51]: array([1, 1, 0, 1, 1, 0])
In [52]: A[mask] = (np.random.rand(n)>foo(B[mask])).astype(int)
In [53]: A
Out[53]:
array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 0, 1, 1],
[1, 1, 1, 1, 0]])
In [54]: foo(B[mask])
Out[54]: array([0.15, 0.2 , 0.25, 0.65, 0.7 , 1. ])

Creating dataframe based on conditions on other dataframes

I have two dataframes: s-1 column, d-3 columns
s = {0: [0, 0.3, 0.5, -0.1, -0.2, 0.7, 0]}
d = {0: [0.1, 0.2, -0.2, 0, 0, 0, 0], 1: [0.3, 0.4, -0.7, 0, 0.8, 0, 0.1], 2: [-0.5, 0.4, -0.1, 0.5, 0.5, 0, 0]}
sd = pd.DataFrame(data=s)
dd = pd.DataFrame(data=d)
result = pd.DataFrame()
I want to get the result dataframe (1 column) based on values in those two:
1. When value in sd = 0 then 0
2. When value in sd != 0 then check if for this row there is at least one non-zero value in dd, if yes - get avg of non zero values, if no return OK
Here is what I would like to get:
results:
0 0
1 -0,033
2 -0,333
3 0,5
4 0,65
5 OK
6 0
I know I can use dd[dd != 0].mean(axis=1) to calculate the mean of non zero values for the row but I don't know how to connect all these 3 conditions together
Using np.where twice
np.where(sd[0]==0,0,np.where(dd.eq(0).all(1),'OK',dd.mask(dd==0).mean(1)))
Out[232]:
array(['0', '0.3333333333333333', '-0.3333333333333333', '0.5', '0.65',
'OK', '0'], dtype='<U32')
Using numpy.select:
c1 = sd[0].eq(0)
c2 = dd.eq(0).all(1)
res = np.select([c1, c2], [0, 'OK'], dd.where(dd.ne(0)).mean(1))
pd.Series(res)
0 0
1 0.3333333333333333
2 -0.3333333333333333
3 0.5
4 0.65
5 OK
6 0
dtype: object
thank you for your help. I managed to do it in a quite different way.
I used:
res1 = pd.Series(np.where(sd[0]==0, 0, dd[dd != 0].mean(axis=1))).fillna('OK')
The difference is that it returns float values (for rows that are not 'OK'), not string. It also appears to be a little bit faster.

How do I sort the rows of a 2d numpy array based on indices given by another 2d numpy array

Example:
arr = np.array([[.5, .25, .19, .05, .01],[.25, .5, .19, .05, .01],[.5, .25, .19, .05, .01]])
print(arr)
[[ 0.5 0.25 0.19 0.05 0.01]
[ 0.25 0.5 0.19 0.05 0.01]
[ 0.5 0.25 0.19 0.05 0.01]]
idxs = np.argsort(arr)
print(idxs)
[[4 3 2 1 0]
[4 3 2 0 1]
[4 3 2 1 0]]
How can I use idxs to index arr? I want to do something like arr[idxs], but this does not work.
It's not the prettiest, but I think something like
>>> arr[np.arange(len(arr))[:,None], idxs]
array([[ 0.01, 0.05, 0.19, 0.25, 0.5 ],
[ 0.01, 0.05, 0.19, 0.25, 0.5 ],
[ 0.01, 0.05, 0.19, 0.25, 0.5 ]])
should work. The first term gives the x coordinates we want (using broadcasting over the last singleton axis):
>>> np.arange(len(arr))[:,None]
array([[0],
[1],
[2]])
with idxs providing the y coordinates. Note that if we had used unravel_index, the x coordinates to use would always have been 0 instead:
>>> np.unravel_index(idxs, arr.shape)[0]
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
How about something like this:
I changed variables to make the example more clear, but you basically need to index by two 2D arrays.
In [102]: a = np.array([[1,2,3], [4,5,6]])
In [103]: b = np.array([[0,2,1], [2,1,0]])
In [104]: temp = np.repeat(np.arange(a.shape[0]), a.shape[1]).reshape(a.shape).T
# temp is just [[0,1], [0,1], [0,1]]
# probably can be done more elegantly
In [105]: a[temp, b.T].T
Out[105]:
array([[1, 3, 2],
[6, 5, 4]])

Python - Find K max values in each row of one matrix and compare to binary matrix

I need to determine if the position (index) of the k largest values in matrix a are in the same position as the binary indicator matrix, b.
import numpy as np
a = np.matrix([[.8,.2,.6,.4],[.9,.3,.8,.6],[.2,.6,.8,.4],[.3,.3,.1,.8]])
b = np.matrix([[1,0,0,1],[1,0,1,1],[1,1,1,0],[1,0,0,1]])
print "a:\n", a
print "b:\n", b
d = argsort(a)
d[:,2:] # Return whether these indices are in 'b'
Returns:
a:
[[ 0.8 0.2 0.6 0.4]
[ 0.9 0.3 0.8 0.6]
[ 0.2 0.6 0.8 0.4]
[ 0.3 0.3 0.1 0.8]]
b:
[[1 0 0 1]
[1 0 1 1]
[1 1 1 0]
[1 0 0 1]]
matrix([[2, 0],
[2, 0],
[1, 2],
[1, 3]])
I would like to compare the indices returned from the last result and, if b has ones in those positions, return the count.
For this example, the final desired result would be:
1
2
2
1
In other words, in the first row of a, the top-2 values correspond to only one of the ones in b, etc.
Any ideas how to do this efficiently? Maybe the argsort is the wrong approach here.
Thanks.
When you take the argsort you get it from minimum 0 to maximum 3, so you can reverse it doing [::-1] to get for maximum 0 and for the minimum 3:
s = np.argsort(a, axis=1)[:,::-1]
#array([[0, 2, 3, 1],
# [0, 2, 3, 1],
# [2, 1, 3, 0],
# [3, 1, 0, 2]])
Now you can use np.take to get the 0s where the maximums are and 1s where the second-maximums are:
s2 = s + (np.arange(s.shape[0])*s.shape[1])[:,None]
s = np.take(s.flatten(),s2)
#array([[0, 3, 1, 2],
# [0, 3, 1, 2],
# [3, 1, 0, 2],
# [2, 1, 3, 0]])
In b, the 0 values should be replaced by a np.nan so that 0==np.nan gives False:
b = np.float_(b)
b[b==0] = np.nan
#array([[ 1., nan, nan, 1.],
# [ 1., nan, 1., 1.],
# [ 1., 1., 1., nan],
# [ 1., nan, nan, 1.]])
and the following comparison will give you the desired result:
print np.logical_or(s==b-1, s==b).sum(axis=1)
#[[1]
# [2]
# [2]
# [1]]
The general case, to compare the n biggest values of a against a binary b:
def check_a_b(a,b,n=2):
b = np.float_(b)
b[b==0] = np.nan
s = np.argsort(a, axis=1)[:,::-1]
s2 = s + (np.arange(s.shape[0])*s.shape[1])[:,None]
s = np.take(s.flatten(),s2)
ans = s==(b-1)
for i in range(n-1):
ans = np.logical_or( ans, s==b+i )
return ans.sum(axis=1)
This will do pair-wise comparisons in the logical_or.
Anothen simpler and much faster approach, based on the fact that:
True*1=1, True*0=0, False*0=0, and False*1=0
is:
def check_a_b_new(a,b,n=2):
s = np.argsort(a.view(np.ndarray), axis=1)[:,::-1]
s2 = s + (np.arange(s.shape[0])*s.shape[1])[:,None]
s = np.take(s.flatten(),s2)
return ((s < n)*b.view(np.ndarray)).sum(axis=1)
Avoiding the 0 to np.nan conversion, and the Python for loop that makes things pretty slow for a high value of n.
In response to Saullo's huge help, I was able to take his work and reduce the solution to three lines. Thanks Saullo!
#Inputs
k = 2
a = np.matrix([[.8,.2,.6,.4],[.9,.3,.8,.6],[.2,.6,.8,.4],[.3,.3,.1,.8]])
b = np.matrix([[1,0,0,1],[1,0,1,1],[1,1,1,0],[1,0,0,1]])
print "a:\n", a
print "b:\n", b
# Return values of interest
s = argsort(a.view(np.ndarray), axis=1)[:,::-1]
s2 = s + (arange(s.shape[0])*s.shape[1])[:,None]
out = take(b,s2).view(np.ndarray)[::,:k].sum(axis=1)
print out
Gives:
a:
[[ 0.8 0.2 0.6 0.4]
[ 0.9 0.3 0.8 0.6]
[ 0.2 0.6 0.8 0.4]
[ 0.3 0.3 0.1 0.8]]
b:
[[1 0 0 1]
[1 0 1 1]
[1 1 1 0]
[1 0 0 1]]
Out:
[1 2 2 1]

Assigning a list element with the sum of the other elements

I have a 2d matrix which can be any size but always a square. I want to loop through the matrix and for each diagonal element (x in the example) I want to assign the value 1-sum_of_all_other_values_in_the_row e.g.
Mtx = [[ x ,.2 , 0 ,.2,.2]
[ 0 , x ,.4 ,.2,.2]
[.2 ,.2 , x , 0, 0]
[ 0 , 0 ,.2 , x,.2]
[ 0 , 0 , 0 , 0, x]]
for i in enumerate(Mtx):
for j in enumerate(Mtx):
if Mtx[i][j] == 'x'
Mtx[i][j] = 1-sum of all other [j]'s in the row
I can't figure out how to get the sum of the j's in each row
for i,row in enumerate(Mtx): #same thing as `for i in range(len(Mtx)):`
Mtx[i][i]=0
Mtx[i][i]=1-sum(Mtx[i])
##could also use (if it makes more sense to you):
#row[i]=0
#Mtx[i][i]=1-sum(row)
You could do this as such:
from copy import copy
for i, row in enumerate(Mtx):
row_copy = copy(row)
row_copy.pop(i)
row[i] = 1 - sum(row_copy)
mtx = [[ 0 ,.2 , 0 ,.2,.2],
[ 0 , 0 , .4 ,.2,.2,],
[.2 ,.2 , 0 , 0, 0],
[ 0 , 0 ,.2 , 0,.2],
[ 0 , 0 , 0 , 0, 0]]
for i in range(len(mtx)):
summ=sum(mtx[i])
mtx[i][i]=round(1-summ,2) #use round to get 0.4 instead of .39999999999999999
print(mtx)
output:
[[0.4, 0.2, 0, 0.2, 0.2], [0, 0.2, 0.4, 0.2, 0.2], [0.2, 0.2, 0.6, 0, 0], [0, 0, 0.2, 0.6, 0.2], [0, 0, 0, 0, 1.0]]

Categories

Resources