Calculating the mean values between every row inside a range - python

I have a dataframe of size 700x20. My data are pixel intensity coordinates for specific locations on an image, where i have 14 people where each has 50 images. I am trying to perform dimensionality reduction and for such task one of the steps require me to calculate the mean between each class, where i have two classes. In my dataframe in every 50th row are the features that belongs to a class, therefore i'd have from 0 to 50 features for class A, 51 to 100 features for class B, 101-150 for class A, 151-200 for class B and so on.
What i want to do is calculate the mean for every nth given row, from N to M and calculate the mean value. Here's a link for the dataframe for better visualization of the problem: Dataframe pickle file
What i tried was ordering the the dataframe and calculate separately but it didn't work, it calculated the mean for every row and grouped them in 14 different classes.
class_feature_means = pd.DataFrame(columns=target_names)
for c, rows in df.groupby('class'):
class_feature_means[c] = rows.mean()
class_feature_means
Minimal reproducible example:
my_array = np.asarray([[31, 25, 17, 62],
[31, 26, 19, 59,],
[31, 23, 17, 67,],
[31, 23, 19, 67,],
[31, 28, 17, 65,],
[32, 26, 19, 62,],
[32, 26, 17, 66,],
[30, 24, 17, 68],
[29, 24, 17, 68],
[33, 24, 17, 68],
[32, 52, 16, 68],
[29, 24, 17, 68],
[33, 24, 17, 68],
[32, 52, 16, 68],
[29, 24, 17, 68],
[33, 24, 17, 68],
[32, 52, 16, 68],
[30, 25, 16, 97]])
my_array = my_array.reshape(18, 4)
my_array = my_array.reshape(18, 4)
indices = sorted(list(range(0,int(my_array.shape[0]/3)))*3)
class_dict = dict(zip(range(0,int((my_array.shape[0]/3))), string.ascii_uppercase))
target_names = ["Index_" + c for c in class_dict.values()]
pixel_index = [1, 2, 3, 4]
X = pd.DataFrame(my_array, columns= pixel_index)
y = pd.Categorical.from_codes(indices,target_names)
df = X.join(pd.Series(y,name='class'))
df
Basically what i want to do is group into a unique class A, C, E, take their sum and divide by 3, therefore achieving mean value for class A or lets call it class 0.
Then, group into a unique class B, D, F, take their sum and divide by 3, therefore achieving mean value for class B, or class 1.

Create helper array with inteegr division and modulo for groups and pass to groupby for aggregate sum, last divide:
N = 3
arr = np.arange(len(df)) // N % 2
print (arr)
[0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1]
df = df.groupby(arr).sum() / N
print (df)
1 2 3 4
0 92.666667 82.666667 51.333333 198.000000
1 94.333333 92.666667 51.333333 210.333333

Related

Matlab to Python - extracting lower subdiagonal triangle, why different order?

I am translating code from MATLAB to Python. I need to extract the lower subdiagonal values of a matrix. My attempt in python seems to extract the same values (sum is equal), but in different order. This is a problem as I need to apply corrcoef after.
The original Matlab code is using an array of indices to subset a matrix.
MATLAB code:
values = 1:100;
matrix = reshape(values,[10,10]);
subdiag = find(tril(ones(10),-1));
matrix_subdiag = matrix(subdiag);
subdiag_sum = sum(matrix_subdiag);
disp(matrix_subdiag(1:10))
disp(subdiag_sum)
Output:
2
3
4
5
6
7
8
9
10
13
1530
My attempt in Python
import numpy as np
matrix = np.arange(1,101).reshape(10,10)
matrix_t = matrix.T #to match MATLAB arrangement
matrix_subdiag = matrix_t[np.tril_indices((10), k = -1)]
subdiag_sum = np.sum(matrix_subdiag)
print(matrix_subdiag[0:10], subdiag_sum))
Output:
[2 3 13 4 14 24 5 15 25 35] 1530
How do I get the same order output? Where is my error?
Thank you!
For the sum use directly numpy.triu on the non-transposed matrix:
S = np.triu(matrix, k=1).sum()
# 1530
For the indices, numpy.triu_indices_from and slicing as a flattened array:
idx = matrix[np.triu_indices_from(matrix, k=1)]
output:
array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 18, 19, 20,
24, 25, 26, 27, 28, 29, 30, 35, 36, 37, 38, 39, 40, 46, 47, 48, 49,
50, 57, 58, 59, 60, 68, 69, 70, 79, 80, 90])

How to select elements on given axis base on the value of another array

I am triying to solve the following problem in a more numpy-friendly way (without loops):
G is NxM matrix fill with 0, 1 or 2
D is a 3xNxM matrix
We want the a NxM matrix (R) with R[i,j] = D[k,i,j] being k=g[i,j]
A loop base solution is:
def getVals(g, d):
arr=np.zeros(g.shape)
for row in range(g.shape[0]):
for column in range(g.shape[1]):
arr[row,column]=d[g[row,column],row,column]
return arr
Try with ogrid and advanced indexing:
x,y = np.ogrid[:N,:M]
out = D[G, x[None], y[None]]
Test:
N,M=4,5
G = np.random.randint(0,3, (N,M))
D = np.random.rand(3,N,M)
np.allclose(getVals(G,D), D[G, x[None], y[None]])
# True
You could also use np.take_along_axis
Then you can simply extract your values along one specific axis:
# Example input data:
G = np.random.randint(0,3,(4,5)) # 4x5 array
D = np.random.randint(0,9,(3,4,5)) # 3x4x5 array
# Get the results:
R = np.take_along_axis(D,G[None,:],axis=0)
Since G should have the same number of dimension as D, we simply add a new dimension to G with G[None,:].
Here's my try (I assume g and d are Numpy Ndarrays):
def getVals(g, d):
m,n = g.shape
indexes = g.flatten()*m*n + np.arange(m*n)
arr = d.flatten()[indexes].reshape(m,n)
return arr
So if
d = [[[96, 89, 51, 40, 51],
[31, 72, 39, 77, 33]],
[[34, 11, 54, 86, 73],
[12, 21, 74, 39, 14]],
[[14, 91, 38, 77, 97],
[44, 55, 93, 88, 55]]]
and
g = [[2, 1, 2, 1, 1],
[0, 2, 0, 0, 2]]
then you are going to get
arr = [[14, 11, 38, 86, 73],
[31, 55, 39, 77, 55]]

creating a range of numbers in pandas based on single column

I have a pandas dataframe:
df2 = pd.DataFrame({'ID':['A','B','C','D','E'], 'loc':['Lon','Tok','Ber','Ams','Rom'], 'start':[20,10,30,40,43]})
ID loc start
0 A Lon 20
1 B Tok 10
2 C Ber 30
3 D Ams 40
4 E Rom 43
I'm looking to add in a column called range which takes the value in 'start' and produces a range of values which (including the initial value) are 10 less than the initial value, all in the same row.
The desired output:
ID loc start range
0 A Lon 20 20,19,18,17,16,15,14,13,12,11,10
1 B Tok 10 10,9,8,7,6,5,4,3,2,1,0
2 C Ber 30 30,29,28,27,26,25,24,23,22,21,20
3 D Ams 40 40,39,38,37,36,35,34,33,32,31,30
4 E Rom 43 43,42,41,40,39,38,37,36,35,34,33
I have tried:
df2['range'] = [i for i in range(df2.start, df2.start -10)]
and
def create_range2(row):
return df2['start'].between(df2.start, df2.start - 10)
df2.loc[:, 'range'] = df2.apply(create_range2, axis = 1)
however I can't seem to get the desired output. I intend to apply this solution to multiple dataframes, one of which has > 2,000,000 rows.
thanks
You might prepare range creating function and .apply it to start column following way:
import pandas as pd
df2 = pd.DataFrame({'ID':['A','B','C','D','E'], 'loc':['Lon','Tok','Ber','Ams','Rom'], 'start':[20,10,30,40,43]})
def make_10(x):
return list(range(x, x-10-1, -1))
df2["range"] = df2["start"].apply(make_10)
print(df2)
output
ID loc start range
0 A Lon 20 [20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10]
1 B Tok 10 [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
2 C Ber 30 [30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20]
3 D Ams 40 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30]
4 E Rom 43 [43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33]
Explanation: .apply method of pandas.Series (column of pandas.DataFrame) accept function which is applied element-wise. Note that there is -1 in range as it is inclusive-exclusive and -1 as step size as you want to have descending values.
does this work?
df2['range'] = df2.apply(lambda row: list(range(row['start'],row['start']-11,-1)),axis=1)
df2
output
ID loc start range
0 A Lon 20 [20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10]
1 B Tok 10 [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
2 C Ber 30 [30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20]
3 D Ams 40 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30]
4 E Rom 43 [43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33]
or if you want comma-separated:
df2['range'] = df2.apply(lambda row: ','.join([str(v) for v in range(row['start'],row['start']-11,-1)]),axis=1)
to get
ID loc start range
0 A Lon 20 20,19,18,17,16,15,14,13,12,11,10
1 B Tok 10 10,9,8,7,6,5,4,3,2,1,0
2 C Ber 30 30,29,28,27,26,25,24,23,22,21,20
3 D Ams 40 40,39,38,37,36,35,34,33,32,31,30
4 E Rom 43 43,42,41,40,39,38,37,36,35,34,33

How to find average, Max and largest(similar to excel function) in a list in python?

I have a list of numbers and from this list, I want to create 3 more lists that contain the maximum, average, and 5th largest number from it. My original list overdraw is the block of lists, which means it has sub-blocks in it and each block has 6 numbers in it and there are a total of 3 blocks or 6x3 matrix or array.
overdraw:
[[16,13,23,14,33,45],[23,11,54,34,23,76],[22,54,34,43,41,11]]
I know how to calculate max, average and 5 largest in this list. But I want a answer in a specific way like I know the max, average, and 5th largest values of each block but I want them to get printed 4 times. I know all the values:
Max = [45, 76, 54]
Average = [24, 37, 34]
Largest(5th) = [14, 23, 22]
my approach:
overdraw = [[16,13,23,14,33,45],[23,11,54,34,23,76],[22,54,34,43,41,11]]
x = [sorted(block, reverse=True) for block in overdraw] # first sort the whole list
max = [x[i][0] for i in range(0, len(x))] # for max
largest = [x[i][4] for i in range(0, len(x))] #5th largest
average = [sum(x[i])/len(x[i]) for i in range(0, len(x))] #average
print("max: ", max)
print("5th largest: ", largest)
print("average: ", average)
You will get the same output after running this code but I want output in this format:
Average = [24, 24, 24, 24, 37, 37, 37, 37, 34, 34, 34, 34]
Max = [45, 45, 45, 45, 76, 76, 76, 76, 54, 54, 54, 54]
Largest(5th) = [14, 14, 14, 14, 23, 23, 23, 23, 22, 22, 22, 22]
As you can see each average, max, and the largest number is printed 4 times in their respective list. So can anyone help with this answer?
What about using pandas.DataFrame.explode
import pandas as pd
df = pd.DataFrame({
'OvIdx' : 3 * [range(4)],
'Average' : average,
'Max' : max, # should be renamed/assigned as max_ instead
'Largest(5th)': largest
}).explode('OvIdx').set_index('OvIdx').astype(int)
print(df)
which shows
Average Max Largest(5th)
OvIdx
0 24 45 14
1 24 45 14
2 24 45 14
3 24 45 14
0 36 76 23
1 36 76 23
2 36 76 23
3 36 76 23
0 34 54 22
1 34 54 22
2 34 54 22
3 34 54 22
from here, you can still do all the calculations you want and/or getting a NumPy array, doing df.values.
Following your comment, you can also get your column(s) as individual entities, doing, e.g.
>>> df.Average.tolist()
[24, 24, 24, 24, 36, 36, 36, 36, 34, 34, 34, 34]
>>> df.Max.tolist()
[45, 45, 45, 45, 76, 76, 76, 76, 54, 54, 54, 54]
>>> df['Largest(5th)'].tolist() # as string key since the name is a little bit exotic
[14, 14, 14, 14, 23, 23, 23, 23, 22, 22, 22, 22]
which approach starts to be a little bit overkilled, readable though.
A solution that returns lists like you specified
import itertools
import numpy as np
n_times = 4
overdraw = [[16,13,23,14,33,45],[23,11,54,34,23,76],[22,54,34,43,41,11]]
y = [sorted(block, reverse=True) for block in overdraw]
maximum = list(itertools.chain(*[[max(x)]*n_times for x in y]))
average = list(itertools.chain(*[[int(round(sum(x)/len(x)))]*n_times for x in y]))
fifth_largest = list(itertools.chain(*[[x[4]]*n_times for x in y]))
print(f"Average = {average}")
print(f"Max = {maximum}")
print(f"Largest(5th): {fifth_largest}")
Outputs:
Average = [24, 24, 24, 24, 37, 37, 37, 37, 34, 34, 34, 34]
Max = [45, 45, 45, 45, 76, 76, 76, 76, 54, 54, 54, 54]
Largest(5th): [14, 14, 14, 14, 23, 23, 23, 23, 22, 22, 22, 22]

Find largest value from multiple colums in each group of row index in Python, arrange those values diagonally in matrix, and find determinant

I am new to Python. I want to find the largest values from all the columns for repetitive row indexes (i.e. 5 to 130), and also show its row and column index label in output.The largest values should be absolute. (Irrespective of + or - sign). There should not be duplicates for row indexes in different groups.
After finding largest from each group,I want to arrange those values diagonally in square matrix. Then fill the remaining array with the corresponding values of indexes for each group from the main dataframe and find its Determinant.
df=pd.DataFrame(
{'0_deg': [43, 50, 45, -17, 5, 19, 11, 32, 36, 41, 19, 11, 32, 36, 1, 19, 7, 1, 36, 10],
'10_deg': [47, 41, 46, -18, 4, 16, 12, 34, -52, 31, 16, 12, 34, -71, 2, 9, 52, 34, -6, 9],
'20_deg': [46, 43, -56, 29, 6, 14, 13, 33, 43, 6, 14, 13, 37, 43, 3, 14, 13, 25, 40, 8],
'30_deg': [-46, 16, -40, -11, 9, 15, 33, -39, -22, 21, 15, 63, -39, -22, 4, 6, 25, -39, -22, 7]
}, index=[5, 10, 12, 101, 130, 5, 10, 12, 101, 130, 5, 10, 12, 101, 130, 5, 10, 12, 101, 130]
)
Data set :
Expected Output:
My code is showing only till output 1.
Actual Output:
Code:
df = pd.read_csv ('Matrixfile.csv')
df = df.set_index('Number')
def f(x):
x1 = x.abs().stack()
x2 = x.stack()
x = x2.iloc[np.argsort(-x1)].head(1)
return x
groups = (df.index == 5).cumsum()
df1 = df.groupby(groups).apply(f).reset_index(level=[1,2])
df1.columns = ['Number','Angle','Value']
print (df1)
df1.to_csv('Matrix_OP.csv', encoding='utf-8', index=True)
I am not sure about #piRSquared output from what I understood from your question. There might be some errors in there, for instance, in group 2, max(abs(values)) = 52 (underline in red in picture) but 41 is displayed on left...
Here is a less elegant way of doing it but maybe easier for you to understand :
import numpy as np
# INPUT
data_dict ={'0_deg': [43, 50, 45, -17, 5, 19, 11, 32, 36, 41, 19, 11, 32, 36, 1, 19, 7, 1, 36, 10],
'10_deg': [47, 41, 46, -18, 4, 16, 12, 34, -52, 31, 16, 12, 34, -71, 2, 9, 52, 34, -6, 9],
'20_deg': [46, 43, -56, 29, 6, 14, 13, 33, 43, 6, 14, 13, 37, 43, 3, 14, 13, 25, 40, 8],
'30_deg': [-46, 16, -40, -11, 9, 15, 33, -39, -22, 21, 15, 63, -39, -22, 4, 6, 25, -39, -22, 7],
}
# Row idx of a group in this list
idx = [5, 10, 12, 101, 130]
# Getting some dimensions and sorting the data
row_idx_length = len(idx)
group_length = len(data_dict['0_deg'])
number_of_groups = len(data_dict.keys())
idx = idx*number_of_groups
data_arr = np.zeros((group_length,number_of_groups),dtype=np.int32)
#
col = 0
keys = []
for key in sorted(data_dict):
data_arr[:,col] = data_dict[key]
keys.append(key)
col+=1
def get_extrema_value_group(arr):
# function to find absolute extrema value of a 2d array
extrema = 0
for i in range(0, len(arr)):
max_value = max(arr[i])
min_value = min(arr[i])
if (abs(min_value) > max_value) and (abs(extrema) < abs(min_value)):
extrema = min_value
elif (abs(min_value) < max_value) and (abs(extrema) < max_value):
extrema = max_value
return extrema
# For output 1
max_values = []
for i in range(0,row_idx_length*number_of_groups,row_idx_length):
# get the max value for the current group
value = get_extrema_value_group(data_arr[i:i+row_idx_length])
# get the row and column idx associated with the max value
idx_angle_number = np.nonzero(abs(data_arr[i:i+row_idx_length,:])==value)
print('Group number : ' + str(i//row_idx_length+1))
print('Number : '+ str(idx[idx_angle_number[0][0]]))
print('Angle : '+ keys[idx_angle_number[1][0]])
print('Absolute extrema value : ' + str(value))
print('------')
max_values.append(value)
# Arrange those values diagonally in square matrix for output 2
A = np.diag(max_values)
print('A = ' + str(A))
# Fill A with desired values
for i in range(0,number_of_groups,1):
A[i,0] = data_arr[i*row_idx_length+2,2] # 20 deg 12
A[i,1:3] = data_arr[i*row_idx_length+3,1] # x2 : 10 deg 101
A[i,3] = data_arr[i*row_idx_length+1,1] # 10 deg 10
# Final output
# replace the diagonal of A with max values
# get the idx of diag
A_di = np.diag_indices(number_of_groups)
# replace with max values
A[A_di] = max_values
print ('A = ' + str(A))
# Compute determinant of A
det_A = np.linalg.det(A)
print ('det(A) = '+str(det_A))
Output 1:
Group number : 1
Number : 12
Angle : 20_deg
Absolute extrema value : -56
------
Group number : 2
Number : 101
Angle : 10_deg
Absolute extrema value : -52
------
Group number : 3
Number : 101
Angle : 10_deg
Absolute extrema value : -71
------
Group number : 4
Number : 10
Angle : 10_deg
Absolute extrema value : 52
------
Output 2 :
A = [[-56 0 0 0]
[ 0 -52 0 0]
[ 0 0 -71 0]
[ 0 0 0 52]]
Output 3 :
A = [[-56 -18 -18 41]
[ 33 -52 -52 12]
[ 37 -71 -71 12]
[ 25 -6 -6 52]]
det(A) = -5.4731330578761246e-11

Categories

Resources