I have a probability table like this:
BC_array =[np.array(['B=n','B=m','B=s','B=n','B=m','B=s']),np.array(['C=F', 'C=F', 'C=F', 'C=T', 'C=T', 'C=T'])]
pD_BC_array=np.array([[0.9,0.8,0.1,0.3,0.4,0.01],[0.08,0.17,0.01,0.05,0.05,0.01],[0.01,0.01,0.87,0.05,0.15,0.97],[0.01,0.02,0.02,0.6,0.4,0.01]])
pD_BC=pd.DataFrame(pD_BC_array,index=['D=h','D=c','D=s','D=r'],columns=BC_array)
B=n B=m B=s B=n B=m B=s
C=F C=F C=F C=T C=T C=T
D=h 0.90 0.80 0.10 0.30 0.40 0.01
D=c 0.08 0.17 0.01 0.05 0.05 0.01
D=s 0.01 0.01 0.87 0.05 0.15 0.97
D=r 0.01 0.02 0.02 0.60 0.40 0.01
How could I marginalize 'C'(sum up all the 'C=F' and 'C=T' together) and get table:
B=n B=m B=s
D=h 1.20 1.20 0.11
D=c 0.13 0.22 0.02
D=s 0.06 0.16 1.84
D=r 0.61 0.42 0.03
like this?
You can call sum on the df and pass params axis=1 for row-wise and level=0 to sum along that level:
In [259]:
pD_BC.sum(axis=1, level=0)
Out[259]:
B=m B=n B=s
D=h 1.20 1.20 0.11
D=c 0.22 0.13 0.02
D=s 0.16 0.06 1.84
D=r 0.42 0.61 0.03
Related
I have a problem in python. The table looks like the following table I have columns values from 1 to 6: the values are random just to show the general idea
time
sensor
sample
value1
value2
value3
value4
value5
value6
22.10
ACCX
6
0.23
0.44
0.53
0.23
0.44
0.53
22.10
ACCY
6
0.87
0.32
0.12
0.87
0.32
0.12
22.10
ACCZ
6
0.44
0.33
0.45
0.63
0.44
0.93
22.12
ACCX
6
0.63
0.44
0.93
0.87
0.32
0.12
22.12
ACCY
6
0.87
0.32
0.12
0.44
0.33
0.45
22.12
ACCZ
6
0.44
0.33
0.45
0.34
0.22
0.78
22.15
ACCX
6
0.23
0.44
0.53
0.64
0.53
0.25
22.15
ACCY
6
0.87
0.32
0.12
0.87
0.32
0.12
22.15
ACCZ
6
0.44
0.33
0.45
0.44
0.33
0.45
22.18
ACCX
6
0.63
0.44
0.93
0.87
0.32
0.12
22.18
ACCY
6
0.87
0.32
0.12
0.44
0.33
0.45
22.18
ACCZ
6
0.44
0.33
0.45
0.87
0.32
0.12
And I need to convert rows that have same the time and sensor to columns. I need all rows with date and sensor to appear like this where the date will be repeated 6 times:
time
ACCX
ACCY
ACCZ
22.10
0.23
0.44
0.23
22.10
0.87
0.32
0.12
22.10
0.44
0.33
0.45
22.10
0.23
0.44
0.23
22.10
0.87
0.32
0.12
22.10
0.44
0.33
0.45
22.12
0.23
0.44
0.53
22.12
0.87
0.32
0.12
22.12
0.44
0.33
0.45
22.12
0.44
0.33
0.45
22.12
0.63
0.44
0.93
22.12
0.87
0.32
0.12
22.15
0.44
0.33
0.45
22.15
0.23
0.44
0.53
22.15
0.87
0.32
0.12
22.15
0.44
0.33
0.45
22.15
0.63
0.44
0.93
22.15
0.87
0.32
0.12
22.18
0.44
0.33
0.45
22.18
0.44
0.33
0.45
22.18
0.63
0.44
0.93
22.18
0.87
0.32
0.12
22.18
0.44
0.33
0.45
22.18
0.44
0.33
0.45
First drag the valueN items into a column (together with the resp. column labels) by .melting the dataframe, then .pivot the sensors into the columns, and do some cleaning up:
res = (
df.drop(columns="sample")
.melt(id_vars=["time", "sensor"])
.pivot(index=["time", "variable"], columns="sensor")
.droplevel(-1).reset_index()
.droplevel(0, axis=1).rename(columns={"": "time"})
)
But: The result for your sample doesn't look like the expected result (the values)?
Supposed we have a df with a sum() value in the below DataFrame, thanks so much for #jezrael 's answer here, now sum value is in the first line, and avg value is the second line, but it's ugly, how to let sum value and avg value in the same column and with index name:Total? Also place it in the first line as below
# Total 27.56 25.04 -1.31
code in pandas is as below:
df.columns=['value_a','value_b','name','up_or_down','difference']
df1 = df[['value_a','value_b']].sum().to_frame().T
df2 = df[['difference']].mean().to_frame().T
df = pd.concat([df1,df2, df], ignore_index=True)
df
value_a value_b name up_or_down difference
project_name
27.56 25.04
-1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
Thanks so much for any advice
Use DataFrame.agg with dictionary for aggregate functions:
df.columns=['value_a','value_b','name','up_or_down','difference']
df1 = df.agg({'value_a':'sum', 'value_b':'sum', 'difference':'mean'}).to_frame('Total').T
df = pd.concat([df1,df])
print (df.head())
value_a value_b difference name up_or_down
Total 27.56 25.04 -0.035405 NaN NaN
2021-project11 0.43 0.48 0.050000 2021-project11 up
2021-project1 0.62 0.56 -0.060000 2021-project1 down
2021-project2 0.51 0.47 -0.040000 2021-project2 down
2021-porject3 0.37 0.34 -0.030000 2021-porject3 down
I have the following numpy 3d array:
mat.data['Sylvain_2015'].shape = (180, 12, 15)
This array is populated with a variable (muscle activation) for each participant (first dimension: 180), for each muscle (second dimension: 12), for each condition (third dimension: 15).
I want to transform this array to the following pandas dataframe:
muscle participant activation test
0 1 1 100.000000 1
1 1 1 69.322225 2
2 1 1 84.917656 3
3 1 1 80.983069 4
4 1 1 65.163384 5
5 1 1 30.528706 6
Is there a more efficient way than with three for loops:
participants, muscles, tests, relative_mvc = ([] for i in range(4))
for iparticipant in range(mat.data[idataset].shape[0]):
for imuscle in range(mat.data[idataset].shape[1]):
max_mvc = np.nanmax(mat.data[idataset][iparticipant, imuscle, :])
for itest in range(mat.data[idataset].shape[2]):
participants.append(iparticipant+1)
datasets.append(idataset)
muscles.append(imuscle+1)
tests.append(itest+1)
# normalize mvc (relative to max)
activation.append(mat.data[idataset][iparticipant, imuscle, itest]*100/max_mvc)
df = pd.DataFrame({
'participant': participants,
'dataset': datasets,
'muscle': muscles,
'test': tests,
'relative_mvc': relative_mvc,
}).dropna()
Here is a sample of the 3d array for two participants (created with this useful post)
# Array shape: (2, 12, 15)
0.13 0.09 0.11 0.11 0.09 0.04 0.03 0.06 0.11 0.09 0.03 0.10 0.01 0.03 0.08
0.21 0.36 0.34 0.18 0.25 0.23 0.11 0.05 0.27 0.27 0.13 0.26 0.04 0.02 0.34
0.16 0.09 0.41 0.28 0.20 0.10 0.16 0.04 0.15 0.25 0.04 0.18 0.02 0.09 0.24
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
0.09 0.09 0.10 0.09 0.08 0.05 0.01 0.02 0.08 0.07 0.08 0.08 0.01 0.02 0.09
0.17 0.39 0.33 0.21 0.17 0.29 0.06 0.01 0.21 0.25 0.27 0.22 0.03 0.01 0.31
0.01 0.01 0.01 0.03 0.01 0.01 0.03 0.06 0.01 0.01 0.04 0.01 0.03 0.06 0.01
0.06 0.01 0.07 0.07 0.07 0.03 0.06 0.12 0.09 0.08 0.04 0.04 0.04 0.03 0.10
0.01 0.03 0.02 0.01 0.01 0.11 0.10 0.01 0.01 0.01 0.09 0.01 0.04 0.01 0.02
0.10 0.10 0.14 0.11 0.08 0.03 0.01 0.02 0.05 0.06 0.01 0.09 0.01 0.01 0.10
0.05 0.03 0.06 0.08 0.08 0.01 0.03 0.02 0.03 0.04 0.02 0.07 0.00 0.02 0.06
0.04 0.05 0.03 0.02 0.08 0.03 0.02 0.02 0.06 0.05 0.02 0.06 0.03 0.01 0.02
# New slice
0.21 0.08 0.15 0.11 0.15 0.05 0.01 0.01 0.06 0.04 0.02 0.13 0.02 0.02 0.16
0.26 0.14 0.18 0.12 0.22 0.10 0.10 0.07 0.12 0.17 0.09 0.18 0.03 0.02 0.13
0.10 0.13 0.13 0.05 0.08 0.08 0.08 0.03 0.03 0.06 0.10 0.06 0.05 0.02 0.05
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
0.11 0.08 0.10 0.07 0.10 0.05 0.02 0.02 0.05 0.03 0.03 0.10 0.05 0.04 0.10
0.13 0.20 0.18 0.12 0.12 0.17 0.03 0.01 0.12 0.10 0.12 0.15 0.09 0.04 0.16
0.02 0.01 0.01 0.06 0.03 0.01 0.03 0.06 0.02 0.01 0.04 0.04 0.04 0.05 0.04
0.02 0.02 0.04 0.03 0.05 0.04 0.07 0.03 0.04 0.01 0.02 0.06 0.03 0.03 0.03
0.02 0.03 0.02 0.02 0.02 0.07 0.04 0.02 0.01 0.01 0.04 0.02 0.03 0.02 0.02
0.07 0.11 0.14 0.03 0.04 0.08 0.01 0.01 0.10 0.11 0.01 0.02 0.01 0.01 0.02
0.03 0.02 0.03 0.05 0.04 0.01 0.01 0.02 0.01 0.03 0.01 0.04 0.01 0.01 0.03
0.04 0.05 0.03 0.03 0.04 0.06 0.02 0.01 0.01 0.03 0.05 0.03 0.03 0.02 0.02
# New slice
I can think of a couple of ways. Here is one way without using loops, where make a Panel first, and then convert to dataframe.
# normalize values first
max_values = np.nanmax(mat.data[idataset], axis=2)
values = mat.data[idataset]*100/max_values.reshape(max_values.shape +(1,))
# get sizes
iparticipant, imuscle, itest = mat.data[idataset].shape
# set axes labels
items = np.arange(1, 1+iparticipant)
major_axis = np.arange(1, imuscle+1)
minor_axis = np.arange(1, itest + 1)
# make a panel (3-d dataframe)
panel = pd.Panel(values, items=iparticipant, major_axis=major_axis, minor_axis=minor_axis)
# covert to dataframe and fix column labels
df = panel.to_frame().stack().reset_index()
df.columns = ['muscle', 'test', 'participant', 'relative_mvc']
df['dataset'] = idataset
I have the below assumed dataframe
a b c d e F
0.02 0.62 0.31 0.67 0.27 a
0.30 0.07 0.23 0.42 0.00 a
0.82 0.59 0.34 0.73 0.29 a
0.90 0.80 0.13 0.14 0.07 d
0.50 0.62 0.94 0.34 0.53 d
0.59 0.84 0.95 0.42 0.54 d
0.13 0.33 0.87 0.20 0.25 d
0.47 0.37 0.84 0.69 0.28 e
Column F represents the columns of the dataframe.
For each row of column F I want to find relevant row and column from the rest of the dataframe and return the values into one column
The outcome will look like this:
a b c d e f To_Be_Filled
0.02 0.62 0.31 0.67 0.27 a 0.02
0.30 0.07 0.23 0.42 0.00 a 0.30
0.82 0.59 0.34 0.73 0.29 a 0.82
0.90 0.80 0.13 0.14 0.07 d 0.14
0.50 0.62 0.94 0.34 0.53 d 0.34
0.59 0.84 0.95 0.42 0.54 d 0.42
0.13 0.33 0.87 0.20 0.25 d 0.20
0.47 0.37 0.84 0.69 0.28 e 0.28
I am able to identify each case with the below, but not sure how to do it across the whole dataframe.
test.loc[test.iloc[:,5]==a,test.columns==a]
Many thanks in advance.
You can use lookup:
df['To_Be_Filled'] = df.lookup(np.arange(len(df)), df['F'])
df
Out:
a b c d e F To_Be_Filled
0 0.02 0.62 0.31 0.67 0.27 a 0.02
1 0.30 0.07 0.23 0.42 0.00 a 0.30
2 0.82 0.59 0.34 0.73 0.29 a 0.82
3 0.90 0.80 0.13 0.14 0.07 d 0.14
4 0.50 0.62 0.94 0.34 0.53 d 0.34
5 0.59 0.84 0.95 0.42 0.54 d 0.42
6 0.13 0.33 0.87 0.20 0.25 d 0.20
7 0.47 0.37 0.84 0.69 0.28 e 0.28
np.arange(len(df)) can be replaced with df.index.
I have a python function that randomize a dictionary representing a position specific scoring matrix.
for example:
mat = {
'A' : [ 0.53, 0.66, 0.67, 0.05, 0.01, 0.86, 0.03, 0.97, 0.33, 0.41, 0.26 ]
'C' : [ 0.14, 0.04, 0.13, 0.92, 0.99, 0.04, 0.94, 0.00, 0.07, 0.23, 0.35 ]
'T' : [ 0.25, 0.07, 0.01, 0.01, 0.00, 0.04, 0.00, 0.03, 0.06, 0.12, 0.14 ]
'G' : [ 0.08, 0.23, 0.20, 0.02, 0.00, 0.06, 0.04, 0.00, 0.54, 0.24, 0.25 ]
}
The scambling function:
def scramble_matrix(matrix, iterations):
mat_len = len(matrix["A"])
pos1 = pos2 = 0
for count in range(iterations):
pos1,pos2 = random.sample(range(mat_len), 2)
#suffle the matrix:
for nuc in matrix.keys():
matrix[nuc][pos1],matrix[nuc][pos2] = matrix[nuc][pos2],matrix[nuc][pos1]
return matrix
def print_matrix(matrix):
for nuc in matrix.keys():
print nuc+"[",
for count in matrix[nuc]:
print "%.2f"%count,
print "]"
now to the problem...
When I try to scramble a matrix directly, It's works fine:
print_matrix(mat)
print ""
print_matrix(scramble_matrix(mat,10))
gives:
A[ 0.53 0.66 0.67 0.05 0.01 0.86 0.03 0.97 0.33 0.41 0.26 ]
C[ 0.14 0.04 0.13 0.92 0.99 0.04 0.94 0.00 0.07 0.23 0.35 ]
T[ 0.25 0.07 0.01 0.01 0.00 0.04 0.00 0.03 0.06 0.12 0.14 ]
G[ 0.08 0.23 0.20 0.02 0.00 0.06 0.04 0.00 0.54 0.24 0.25 ]
A[ 0.41 0.97 0.03 0.86 0.53 0.66 0.33.05 0.67 0.26 0.01 ]
C[ 0.23 0.00 0.94 0.04 0.14 0.04 0.07 0.92 0.13 0.35 0.99 ]
T[ 0.12 0.03 0.00 0.04 0.25 0.07 0.06 0.01 0.01 0.14 0.00 ]
G[ 0.24 0.00 0.04 0.06 0.08 0.23 0.54 0.02 0.20 0.25 0.00 ]
but when I try to assign this scrambling to a list , it does not work!!! ...
print_matrix(mat)
s=[]
for x in range(3):
s.append(scramble_matrix(mat,10))
for matrix in s:
print ""
print_matrix(matrix)
result:
A[ 0.53 0.66 0.67 0.05 0.01 0.86 0.03 0.97 0.33 0.41 0.26 ]
C[ 0.14 0.04 0.13 0.92 0.99 0.04 0.94 0.00 0.07 0.23 0.35 ]
T[ 0.25 0.07 0.01 0.01 0.00 0.04 0.00 0.03 0.06 0.12 0.14 ]
G[ 0.08 0.23 0.20 0.02 0.00 0.06 0.04 0.00 0.54 0.24 0.25 ]
A[ 0.01 0.66 0.97 0.67 0.03 0.05 0.33 0.53 0.26 0.41 0.86 ]
C[ 0.99 0.04 0.00 0.13 0.94 0.92 0.07 0.14 0.35 0.23 0.04 ]
T[ 0.00 0.07 0.03 0.01 0.00 0.01 0.06 0.25 0.14 0.12 0.04 ]
G[ 0.00 0.23 0.00 0.20 0.04 0.02 0.54 0.08 0.25 0.24 0.06 ]
A[ 0.01 0.66 0.97 0.67 0.03 0.05 0.33 0.53 0.26 0.41 0.86 ]
C[ 0.99 0.04 0.00 0.13 0.94 0.92 0.07 0.14 0.35 0.23 0.04 ]
T[ 0.00 0.07 0.03 0.01 0.00 0.01 0.06 0.25 0.14 0.12 0.04 ]
G[ 0.00 0.23 0.00 0.20 0.04 0.02 0.54 0.08 0.25 0.24 0.06 ]
A[ 0.01 0.66 0.97 0.67 0.03 0.05 0.33 0.53 0.26 0.41 0.86 ]
C[ 0.99 0.04 0.00 0.13 0.94 0.92 0.07 0.14 0.35 0.23 0.04 ]
T[ 0.00 0.07 0.03 0.01 0.00 0.01 0.06 0.25 0.14 0.12 0.04 ]
G[ 0.00 0.23 0.00 0.20 0.04 0.02 0.54 0.08 0.25 0.24 0.06 ]
What is the problem???
Why the scrambling do not work after the first time, and all the list filled with the same matrix?!
Your scrambling function is modifying the existing matrix, it is not creating a new one.
You create a matrix, scramble it and add it to a list. Then you scramble it again and add it again to the list. Both elements of the list contain now the same matrix object, which got scrambled twice.
You are shuffling the same matrix in-place for 3 times. But you really want to shuffle 3 copies of original matrix. So you should do:
from copy import deepcopy
print_matrix(mat)
s=[]
for x in range(3):
s.append(scramble_matrix(deepcopy(mat),10)) # note the deepcopy()
for matrix in s:
print ""
print_matrix(matrix)