How would you flip and fold diagonaly a matrix with pandas? - python
I have some datas I would like to organize for visualization and statistics but I don't know how to proceed.
The data are in 3 columns (stimA, stimB and subjectAnswer) and 10 rows (numero of pairs) and they are from a pairwise comparison test, in panda's dataFrame format. Example :
stimA
stimB
subjectAnswer
1
2
36
3
1
55
5
3
98
...
...
...
My goal is to organize them as a matrix with each row and column corresponding to one stimulus with the subjectAnswer data grouped to the left side of the matrix' diagonal (in my example, the subjectAnswer 36 corresponding to stimA 1 and stimB 2 should go to the index [2][1]), like this :
stimA/stimB
1
2
3
4
5
1
...
2
36
3
55
4
...
5
...
...
98
I succeeded in pivoting the first table to the matrix but I couldn't succeed the arrangement on the left side of the diag of my datas, here is my code :
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
session1 = pd.read_csv(filepath, names=['stimA', 'stimB', 'subjectAnswer'])
pivoted = session1.pivot('stimA','stimB','subjectAnswer')
Which gives :
session1 :
stimA stimB subjectAnswer
0 1 3 6
1 4 3 21
2 4 5 26
3 2 3 10
4 1 2 6
5 1 5 6
6 4 1 6
7 5 2 13
8 3 5 15
9 2 4 26
pivoted :
stimB 1 2 3 4 5
stimA
1 NaN 6.0 6.0 NaN 6.0
2 NaN NaN 10.0 26.0 NaN
3 NaN NaN NaN NaN 15.0
4 6.0 NaN 21.0 NaN 26.0
5 NaN 13.0 NaN NaN NaN
The expected output for pivoted :
stimB 1 2 3 4 5
stimA
1 NaN NaN Nan NaN NaN
2 6.0 NaN Nan NaN NaN
3 6.0 10.0 NaN NaN NaN
4 6.0 26.0 21.0 NaN NaN
5 6.0 13.0 15.0 26.0 NaN
Thanks a lot for your help !
If I understand you correctly, the stimuli A and B are interchangeable. So to get the matrix layout you want, you can swap A with B in those rows where A is smaller than B. In other words, you don't use the original A and B for the pivot table, but the maximum and minimum of A and B:
session1['stim_min'] = np.min(session1[['stimA', 'stimB']], axis=1)
session1['stim_max'] = np.max(session1[['stimA', 'stimB']], axis=1)
pivoted = session1.pivot('stim_max', 'stim_min', 'subjectAnswer')
pivoted
stim_min 1 2 3 4
stim_max
2 6.0 NaN NaN NaN
3 6.0 10.0 NaN NaN
4 6.0 26.0 21.0 NaN
5 6.0 13.0 15.0 26.0
sort the columns stimA and stimB along the columns axis and assign two temporary columns namely x and y in the dataframe. Here sorting is required because we need to ensure that the resulting matrix clipped on the upper right side.
Pivot the dataframe with index as y, columns as x and values as subjectanswer, then reindex the reshaped frame in order to ensure that all the available unique stim names are present in the index and columns of the matrix
session1[['x', 'y']] = np.sort(session1[['stimA', 'stimB']], axis=1)
i = np.union1d(session1['x'], session1['y'])
session1.pivot('y', 'x','subjectAnswer').reindex(i, i)
x 1 2 3 4 5
y
1 NaN NaN NaN NaN NaN
2 6.0 NaN NaN NaN NaN
3 6.0 10.0 NaN NaN NaN
4 6.0 26.0 21.0 NaN NaN
5 6.0 13.0 15.0 26.0 NaN
Related
Maintaining dataframe shape when slicing in pandas
I've imported a .csv into pandas and want to extract specific values and put them into a new column whilst maintaining the existing shape. So df[::3] extracts the data- 1 1 2 4 3 7 4 5 6 7 I want it to look like 1 1 2 3 4 4 5 6 7 7
Here is a solution: df = pd.read_csv(r"C:/users/k_sego/colsplit.csv",sep=";") df1 = df[['col1']] df2 = df[['col2']] DF = pd.merge(df1,df2, how='outer',left_on=['col1'],right_on=['col2']) and the result is col1 col2 0 1.0 1.0 1 2.0 NaN 2 3.0 NaN 3 4.0 4.0 4 5.0 NaN 5 6.0 NaN 6 7.0 7.0 7 NaN NaN 8 NaN NaN 9 NaN NaN 10 NaN NaN
Pandas set all values after first NaN to NaN
For each row I would like to set all values to NaN after the appearance of the first NaN. E.g.: a b c 1 2 3 4 2 nan 2 nan 3 3 nan 23 Should become this: a b c 1 2 3 4 2 nan nan nan 3 3 nan nan So far I only know how to do this with an apply with a for loop over each column per row - it's very slow!
Check with cumprod df=df.where(df.notna().cumprod(axis=1).eq(1)) a b c 1 2.0 3.0 4.0 2 NaN NaN NaN 3 3.0 NaN NaN
Pandas: grab positions in dataframe which indexes are listed in another dataframe
Suppose that I have 2 dataframes, with indexes populated so that elements in columns are unique, because in real data they are: vals = pd.DataFrame(np.random.randint(0,10,(10, 3)), columns=list('ABC')) indexes = pd.DataFrame(np.argsort(np.random.randint(0,10,(10, 3)), axis=0)[:5], columns=list('ABC')) >>> vals A B C 0 64 20 48 1 28 60 81 2 5 73 77 3 74 66 86 4 41 39 21 5 65 37 98 6 10 20 73 7 6 70 3 8 36 29 28 9 43 13 12 >>> indexes A B C 0 4 2 3 1 3 3 8 2 5 1 7 3 9 8 9 4 2 4 0 I would like to retain only those values in vals which indexes are listed in indexes. I don't care about row integrity or NAs, as I'll use the columns as Series later. This is what I came up with: vals_indexes = pd.DataFrame() for i in range(vals.shape[1]): vals_indexes = pd.concat([vals_indexes, vals.iloc[[e for e in indexes.iloc[:, i] if e in vals.index], i]], axis=1) >>> vals_indexes A B C 0 NaN NaN 48.0 1 NaN 60.0 NaN 2 5.0 73.0 NaN 3 74.0 66.0 86.0 4 41.0 39.0 NaN 5 65.0 NaN NaN 7 NaN NaN 3.0 8 NaN 29.0 28.0 9 43.0 NaN 12.0 Which is a bit ugly, but works for me. Question: is there a more effective way to do this?
use .loc within a loop to replace non existing index with nan for i in vals.columns: vals.loc[vals[i].isin(list(indexes[i].unique())),i]=np.nan print(vals) A B C 0 NaN 2.0 NaN 1 NaN 5.0 NaN 2 2.0 3.0 NaN 3 NaN NaN NaN 4 NaN NaN 6.0 5 9.0 NaN NaN 6 NaN NaN 4.0 7 NaN 7.0 NaN 8 2.0 NaN NaN 9 NaN NaN NaN
How to remove clustered/unclustered values less than a certain length from pandas dataframe?
If I have a pandas data frame like this: A 1 1 2 1 3 NaN 4 1 5 NaN 6 1 7 1 8 1 9 1 10 NaN 11 1 12 1 13 1 How do I remove values that are clustered in a length less than some value (in this case four) for example? Such that I get an array like this: A 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN 6 1 7 1 8 1 9 1 10 NaN 11 NaN 12 NaN 13 NaN
Using groupby and np.where s = df.groupby(df.A.isnull().cumsum()).transform(lambda s: pd.notnull(s).sum()) df['B'] = np.where(s.A>=4, df.A, np.nan) Outputs A B 1 1.0 NaN 2 1.0 NaN 3 NaN NaN 4 1.0 NaN 5 NaN NaN 6 1.0 1.0 7 1.0 1.0 8 1.0 1.0 9 1.0 1.0 10 NaN NaN 11 1.0 NaN 12 1.0 NaN 13 1.0 NaN
split pandas column with tuple
I have a dictionary of the form; data = {A:[(1,2),(3,4),(5,6),(7,8),(8,9)], B:[(3,4),(4,5),(5,6),(6,7)], C:[(10,11),(12,13)]} I create a dataFrame by: df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in data.iteritems()])) which in turn becomes; A B C (1,2) (3,4) (10,11) (3,4) (4,5) (12,13) (5,6) (5,6) NaN (6,7) (6,7) NaN (8,9) NaN NaN Is there a way to go from the dataframe above to the one below: A B C one two one two one two 1 2 3 4 10 11 3 4 4 5 12 13 5 6 5 6 NaN NaN 6 7 6 7 NaN NaN 8 9 NaN NaN NaN NaN
You can use list comprehension with DataFrame constructor with converting columns to numpy array by values + tolist and concat: cols = ['A','B','C'] L = [pd.DataFrame(df[x].values.tolist(), columns=['one','two']) for x in cols] df = pd.concat(L, axis=1, keys=cols) print (df) A B C one two one two one two 0 1 2 3 4 5 6 1 7 8 9 10 11 12 2 13 14 15 16 17 18 EDIT: Similar solution with dict comprehension, integers values was converted to floats, because type of NaN is float too. data = {'A':[(1,2),(3,4),(5,6),(7,8),(8,9)], 'B':[(3,4),(4,5),(5,6),(6,7)], 'C':[(10,11),(12,13)]} cols = ['A','B','C'] d = {k: pd.DataFrame(v, columns=['one','two']) for k,v in data.items()} df = pd.concat(d, axis=1) print (df) A B C one two one two one two 0 1 2 3.0 4.0 10.0 11.0 1 3 4 4.0 5.0 12.0 13.0 2 5 6 5.0 6.0 NaN NaN 3 7 8 6.0 7.0 NaN NaN 4 8 9 NaN NaN NaN NaN EDIT: For multiple by one column is possible use slicers: s = df[('A', 'one')] print (s) 0 1 1 3 2 5 3 7 4 8 Name: (A, one), dtype: int64 df.loc(axis=1)[:, 'one'] = df.loc(axis=1)[:, 'one'].mul(s, axis=0) print (df) A B C one two one two one two 0 1.0 2 3.0 4.0 10.0 11.0 1 9.0 4 12.0 5.0 36.0 13.0 2 25.0 6 25.0 6.0 NaN NaN 3 49.0 8 42.0 7.0 NaN NaN 4 64.0 9 NaN NaN NaN NaN Another solution: idx = pd.IndexSlice df.loc[:, idx[:, 'one']] = df.loc[:, idx[:, 'one']].mul(s, axis=0) print (df) A B C one two one two one two 0 1.0 2 3.0 4.0 10.0 11.0 1 9.0 4 12.0 5.0 36.0 13.0 2 25.0 6 25.0 6.0 NaN NaN 3 49.0 8 42.0 7.0 NaN NaN 4 64.0 9 NaN NaN NaN NaN