I have some datas I would like to organize for visualization and statistics but I don't know how to proceed.
The data are in 3 columns (stimA, stimB and subjectAnswer) and 10 rows (numero of pairs) and they are from a pairwise comparison test, in panda's dataFrame format. Example :
stimA
stimB
subjectAnswer
1
2
36
3
1
55
5
3
98
...
...
...
My goal is to organize them as a matrix with each row and column corresponding to one stimulus with the subjectAnswer data grouped to the left side of the matrix' diagonal (in my example, the subjectAnswer 36 corresponding to stimA 1 and stimB 2 should go to the index [2][1]), like this :
stimA/stimB
1
2
3
4
5
1
...
2
36
3
55
4
...
5
...
...
98
I succeeded in pivoting the first table to the matrix but I couldn't succeed the arrangement on the left side of the diag of my datas, here is my code :
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
session1 = pd.read_csv(filepath, names=['stimA', 'stimB', 'subjectAnswer'])
pivoted = session1.pivot('stimA','stimB','subjectAnswer')
Which gives :
session1 :
stimA stimB subjectAnswer
0 1 3 6
1 4 3 21
2 4 5 26
3 2 3 10
4 1 2 6
5 1 5 6
6 4 1 6
7 5 2 13
8 3 5 15
9 2 4 26
pivoted :
stimB 1 2 3 4 5
stimA
1 NaN 6.0 6.0 NaN 6.0
2 NaN NaN 10.0 26.0 NaN
3 NaN NaN NaN NaN 15.0
4 6.0 NaN 21.0 NaN 26.0
5 NaN 13.0 NaN NaN NaN
The expected output for pivoted :
stimB 1 2 3 4 5
stimA
1 NaN NaN Nan NaN NaN
2 6.0 NaN Nan NaN NaN
3 6.0 10.0 NaN NaN NaN
4 6.0 26.0 21.0 NaN NaN
5 6.0 13.0 15.0 26.0 NaN
Thanks a lot for your help !
If I understand you correctly, the stimuli A and B are interchangeable. So to get the matrix layout you want, you can swap A with B in those rows where A is smaller than B. In other words, you don't use the original A and B for the pivot table, but the maximum and minimum of A and B:
session1['stim_min'] = np.min(session1[['stimA', 'stimB']], axis=1)
session1['stim_max'] = np.max(session1[['stimA', 'stimB']], axis=1)
pivoted = session1.pivot('stim_max', 'stim_min', 'subjectAnswer')
pivoted
stim_min 1 2 3 4
stim_max
2 6.0 NaN NaN NaN
3 6.0 10.0 NaN NaN
4 6.0 26.0 21.0 NaN
5 6.0 13.0 15.0 26.0
sort the columns stimA and stimB along the columns axis and assign two temporary columns namely x and y in the dataframe. Here sorting is required because we need to ensure that the resulting matrix clipped on the upper right side.
Pivot the dataframe with index as y, columns as x and values as subjectanswer, then reindex the reshaped frame in order to ensure that all the available unique stim names are present in the index and columns of the matrix
session1[['x', 'y']] = np.sort(session1[['stimA', 'stimB']], axis=1)
i = np.union1d(session1['x'], session1['y'])
session1.pivot('y', 'x','subjectAnswer').reindex(i, i)
x 1 2 3 4 5
y
1 NaN NaN NaN NaN NaN
2 6.0 NaN NaN NaN NaN
3 6.0 10.0 NaN NaN NaN
4 6.0 26.0 21.0 NaN NaN
5 6.0 13.0 15.0 26.0 NaN
I've imported a .csv into pandas and want to extract specific values and put them into a new column whilst maintaining the existing shape.
So df[::3] extracts the data-
1 1
2 4
3 7
4
5
6
7
I want it to look like
1 1
2
3
4 4
5
6
7 7
Here is a solution:
df = pd.read_csv(r"C:/users/k_sego/colsplit.csv",sep=";")
df1 = df[['col1']]
df2 = df[['col2']]
DF = pd.merge(df1,df2, how='outer',left_on=['col1'],right_on=['col2'])
and the result is
col1 col2
0 1.0 1.0
1 2.0 NaN
2 3.0 NaN
3 4.0 4.0
4 5.0 NaN
5 6.0 NaN
6 7.0 7.0
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
This is my dataframe:
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
id value
0 1 5
1 1 6
2 1 NaN
3 2 NaN
4 2 8
5 2 4
6 2 NaN
7 2 10
8 3 NaN
This is my expected output:
id value
0 1 5
1 1 6
2 1 7
3 2 NaN
4 2 8
5 2 4
6 2 2
7 2 10
8 3 NaN
This is my current output using this code:
df.value.interpolate(method="krogh")
0 5.000000
1 6.000000
2 9.071429
3 10.171429
4 8.000000
5 4.000000
6 2.357143
7 10.000000
8 36.600000
Basically, I want to do two important things here:
Groupby ID then Interpolate using only above values not below row values
This should do the trick:
df["value_interp"]=df.value.combine_first(df.groupby("id")["value"].apply(lambda y: y.expanding().apply(lambda x: x.interpolate(method="krogh").to_numpy()[-1], raw=False)))
Outputs:
id value value_interp
0 1.0 5.0 5.0
1 1.0 6.0 6.0
2 1.0 NaN 7.0
3 2.0 NaN NaN
4 2.0 8.0 8.0
5 2.0 4.0 4.0
6 2.0 NaN 0.0
7 2.0 10.0 10.0
8 3.0 NaN NaN
(It interpolates based only on the previous values within the group - hence index 6 will return 0 not 2)
You can group by id and then loop over groups to make interpolations. For id = 2 interpolation will not give you value 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
data = []
for name, group in df.groupby('id'):
group_interpolation = group.interpolate(method='krogh', limit_direction='forward', axis=0)
data.append(group_interpolation)
df = (pd.concat(data)).round(1)
Output:
id value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 4.7
7 2.0 10.0
8 3.0 NaN
Current pandas.Series.interpolate does not support what you want so to achieve your goal you need to do 2 grouby's that will account for your desire to use only previous rows. The idea is as follows: to combine into one group only missing value (!!!) and previous rows (it might have limitations if you have several missing values in a row, but it serves well for your toy example)
Suppose we have a df:
print(df)
ID Value
0 1 5.0
1 1 6.0
2 1 NaN
3 2 NaN
4 2 8.0
5 2 4.0
6 2 NaN
7 2 10.0
8 3 NaN
Then we will combine any missing values within a group with previous rows:
df["extrapolate"] = df.groupby("ID")["Value"].apply(lambda grp: grp.isnull().cumsum().shift().bfill())
print(df)
ID Value extrapolate
0 1 5.0 0.0
1 1 6.0 0.0
2 1 NaN 0.0
3 2 NaN 1.0
4 2 8.0 1.0
5 2 4.0 1.0
6 2 NaN 1.0
7 2 10.0 2.0
8 3 NaN NaN
You may see, that when grouped by ["ID","extrapolate"] the missing value will fall into the same group as nonnull values of previous rows.
Now we are ready to do extrapolation (with spline of order=1):
df.groupby(["ID","extrapolate"], as_index=False).apply(lambda grp:grp.interpolate(method="spline",order=1)).drop("extrapolate", axis=1)
ID Value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 0.0
7 2.0 10.0
8 NaN NaN
Hope this helps.
how can I separate each dataframe with an empty row
ive combined them using this snippet
frames1 = [df4, df5, df6]
Summary = pd.concat(frames1)
so how can i split them with an empty row
You can use the below example which works:
Create test dfs
df1 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
df3 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
dfs=[df1,df2,df3]
Solution:
pd.concat([df.append(pd.Series(), ignore_index=True) for df in dfs])
A B C D
0 17.0 16.0 15.0 7.0
1 13.0 6.0 12.0 18.0
2 0.0 2.0 10.0 17.0
3 8.0 13.0 10.0 17.0
4 4.0 18.0 8.0 19.0
5 NaN NaN NaN NaN
0 14.0 0.0 13.0 12.0
1 10.0 3.0 6.0 3.0
2 15.0 10.0 15.0 3.0
3 9.0 16.0 11.0 4.0
4 5.0 7.0 6.0 2.0
5 NaN NaN NaN NaN
0 10.0 18.0 13.0 12.0
1 1.0 6.0 10.0 0.0
2 2.0 19.0 4.0 18.0
3 4.0 3.0 9.0 16.0
4 16.0 6.0 5.0 6.0
5 NaN NaN NaN NaN
For horizontal stack:
pd.concat([df.assign(test=np.nan) for df in dfs],axis=1)
A B C D test A B C D test A B C D test
0 17 16 15 7 NaN 14 0 13 12 NaN 10 18 13 12 NaN
1 13 6 12 18 NaN 10 3 6 3 NaN 1 6 10 0 NaN
2 0 2 10 17 NaN 15 10 15 3 NaN 2 19 4 18 NaN
3 8 13 10 17 NaN 9 16 11 4 NaN 4 3 9 16 NaN
4 4 18 8 19 NaN 5 7 6 2 NaN 16 6 5 6 NaN
Is this what you want?:
fname = 'test2.csv'
frames1 = [df4, df5, df6]
with open(fname, mode='a+') as f:
for df in frames1:
df.to_csv(fname, mode='a', header = f.tell() == 0)
f.write('\n')
test2.csv:
,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8
0,0,1,2
1,3,4,5
2,6,7,8
0,0,1,2
1,3,4,5
2,6,7,8
f.tell() == 0 checks whether the file handle is at the beginning of the file, i.e. at 0, if yes, prints header, else doesn't.
NOTE: I have used same values for all the dfs, that's why all the results are similar.
For columns:
fname = 'test3.csv'
frames1 = [df1, df2, df3]
Summary = pd.concat([df.assign(**{' ':' '}) for df in frames1], axis=1)
Summary.to_csv(fname)
test3.csv:
,a,b,c, ,a,b,c, ,a,b,c,
0,0,1,2, ,0,1,2, ,0,1,2,
1,3,4,5, ,3,4,5, ,3,4,5,
2,6,7,8, ,6,7,8, ,6,7,8,
But the columns will not be equally spaced. If you save with header=False:
test3.csv:
0,0,1,2, ,0,1,2, ,0,1,2,
1,3,4,5, ,3,4,5, ,3,4,5,
2,6,7,8, ,6,7,8, ,6,7,8,
How can I apply a function element-wise to a pandas DataFrame and pass a column-wise calculated value (e.g. quantile of column)? For example, what if I want to replace all elements in a DataFrame (with NaN) where the value is lower than the 80th percentile of the column?
def _deletevalues(x, quantile):
if x < quantile:
return np.nan
else:
return x
df.applymap(lambda x: _deletevalues(x, x.quantile(0.8)))
Using applymap only allows one to access each value individually and throws (of course) an AttributeError: ("'float' object has no attribute 'quantile'
Thank you in advance.
Use DataFrame.mask:
df = df.mask(df < df.quantile())
print (df)
a b c
0 NaN 7.0 NaN
1 NaN NaN 6.0
2 NaN NaN 5.0
3 8.0 NaN NaN
4 7.0 3.0 5.0
5 6.0 7.0 NaN
6 NaN NaN NaN
7 8.0 4.0 NaN
8 NaN NaN 6.0
9 7.0 7.0 6.0
In [139]: df
Out[139]:
a b c
0 1 7 3
1 1 2 6
2 3 0 5
3 8 2 1
4 7 3 5
5 6 7 2
6 0 2 1
7 8 4 1
8 5 0 6
9 7 7 6
for all columns:
In [145]: df.apply(lambda x: np.where(x < x.quantile(),np.nan,x))
Out[145]:
a b c
0 NaN 7.0 NaN
1 NaN NaN 6.0
2 NaN NaN 5.0
3 8.0 NaN NaN
4 7.0 3.0 5.0
5 6.0 7.0 NaN
6 NaN NaN NaN
7 8.0 4.0 NaN
8 NaN NaN 6.0
9 7.0 7.0 6.0
or
In [149]: df[df < df.quantile()] = np.nan
In [150]: df
Out[150]:
a b c
0 NaN 7.0 NaN
1 NaN NaN 6.0
2 NaN NaN 5.0
3 8.0 NaN NaN
4 7.0 3.0 5.0
5 6.0 7.0 NaN
6 NaN NaN NaN
7 8.0 4.0 NaN
8 NaN NaN 6.0
9 7.0 7.0 6.0