I've imported a .csv into pandas and want to extract specific values and put them into a new column whilst maintaining the existing shape.
So df[::3] extracts the data-
1 1
2 4
3 7
4
5
6
7
I want it to look like
1 1
2
3
4 4
5
6
7 7
Here is a solution:
df = pd.read_csv(r"C:/users/k_sego/colsplit.csv",sep=";")
df1 = df[['col1']]
df2 = df[['col2']]
DF = pd.merge(df1,df2, how='outer',left_on=['col1'],right_on=['col2'])
and the result is
col1 col2
0 1.0 1.0
1 2.0 NaN
2 3.0 NaN
3 4.0 4.0
4 5.0 NaN
5 6.0 NaN
6 7.0 7.0
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
Related
I have some datas I would like to organize for visualization and statistics but I don't know how to proceed.
The data are in 3 columns (stimA, stimB and subjectAnswer) and 10 rows (numero of pairs) and they are from a pairwise comparison test, in panda's dataFrame format. Example :
stimA
stimB
subjectAnswer
1
2
36
3
1
55
5
3
98
...
...
...
My goal is to organize them as a matrix with each row and column corresponding to one stimulus with the subjectAnswer data grouped to the left side of the matrix' diagonal (in my example, the subjectAnswer 36 corresponding to stimA 1 and stimB 2 should go to the index [2][1]), like this :
stimA/stimB
1
2
3
4
5
1
...
2
36
3
55
4
...
5
...
...
98
I succeeded in pivoting the first table to the matrix but I couldn't succeed the arrangement on the left side of the diag of my datas, here is my code :
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
session1 = pd.read_csv(filepath, names=['stimA', 'stimB', 'subjectAnswer'])
pivoted = session1.pivot('stimA','stimB','subjectAnswer')
Which gives :
session1 :
stimA stimB subjectAnswer
0 1 3 6
1 4 3 21
2 4 5 26
3 2 3 10
4 1 2 6
5 1 5 6
6 4 1 6
7 5 2 13
8 3 5 15
9 2 4 26
pivoted :
stimB 1 2 3 4 5
stimA
1 NaN 6.0 6.0 NaN 6.0
2 NaN NaN 10.0 26.0 NaN
3 NaN NaN NaN NaN 15.0
4 6.0 NaN 21.0 NaN 26.0
5 NaN 13.0 NaN NaN NaN
The expected output for pivoted :
stimB 1 2 3 4 5
stimA
1 NaN NaN Nan NaN NaN
2 6.0 NaN Nan NaN NaN
3 6.0 10.0 NaN NaN NaN
4 6.0 26.0 21.0 NaN NaN
5 6.0 13.0 15.0 26.0 NaN
Thanks a lot for your help !
If I understand you correctly, the stimuli A and B are interchangeable. So to get the matrix layout you want, you can swap A with B in those rows where A is smaller than B. In other words, you don't use the original A and B for the pivot table, but the maximum and minimum of A and B:
session1['stim_min'] = np.min(session1[['stimA', 'stimB']], axis=1)
session1['stim_max'] = np.max(session1[['stimA', 'stimB']], axis=1)
pivoted = session1.pivot('stim_max', 'stim_min', 'subjectAnswer')
pivoted
stim_min 1 2 3 4
stim_max
2 6.0 NaN NaN NaN
3 6.0 10.0 NaN NaN
4 6.0 26.0 21.0 NaN
5 6.0 13.0 15.0 26.0
sort the columns stimA and stimB along the columns axis and assign two temporary columns namely x and y in the dataframe. Here sorting is required because we need to ensure that the resulting matrix clipped on the upper right side.
Pivot the dataframe with index as y, columns as x and values as subjectanswer, then reindex the reshaped frame in order to ensure that all the available unique stim names are present in the index and columns of the matrix
session1[['x', 'y']] = np.sort(session1[['stimA', 'stimB']], axis=1)
i = np.union1d(session1['x'], session1['y'])
session1.pivot('y', 'x','subjectAnswer').reindex(i, i)
x 1 2 3 4 5
y
1 NaN NaN NaN NaN NaN
2 6.0 NaN NaN NaN NaN
3 6.0 10.0 NaN NaN NaN
4 6.0 26.0 21.0 NaN NaN
5 6.0 13.0 15.0 26.0 NaN
I've reviewed several posts on here about better ways to loop through dataframes, but can't seem to figure out how to apply them to my specific situation.
I have a dataframe of about 2M rows and I need to calculate six statistics for each row, one per column. There are 3 columns so 18 total. However, the issue is that I need to update those stats using a sample of the dataframe so that the mean/median, etc is different per row.
Here's what I have so far:
r = 0
for i in imputed_df.iterrows():
t = imputed_df.sample(n=10)
for (columnName) in cols:
imputed_df.loc[r,columnName + '_mean'] = t[columnName].mean()
imputed_df.loc[r,columnName + '_var'] = t[columnName].var()
imputed_df.loc[r,columnName + '_std'] = t[columnName].std()
imputed_df.loc[r,columnName + '_skew'] = t[columnName].skew()
imputed_df.loc[r,columnName + '_kurt'] = t[columnName].kurt()
imputed_df.loc[r,columnName + '_med'] = t[columnName].median()
But this has been running for two days without finishing. I tried to take a subset of 2000 rows from the original dataframe and even that one has been running for hours.
Is there a better way to do this?
EDIT: Added a sample dataset of what it should look like. each suffixed column should have the calculated value of the subset of 10 rows.
timestamp activityID w2 w3 w4
0 41.21 1.0 -1.34587 9.57245 2.83571
1 41.22 1.0 -1.76211 10.63590 2.59496
2 41.23 1.0 -2.45116 11.09340 2.23671
3 41.24 1.0 -2.42381 11.88590 1.77260
4 41.25 1.0 -2.31581 12.45170 1.50289
The problem is that you do the operation for each column using unnecessary loops.
We could use
DataFrame.agg with DataFrame.unstack and Series.set_axis to get correct names of columns.
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, (10, 100))).add_prefix('col')
new_serie = df.agg(['sum', 'mean',
'var', 'std',
'skew', 'kurt', 'median']).unstack()
new_df = pd.concat([df, new_serie.set_axis([f'{x}_{y}'
for x, y in new_serie.index])
.to_frame().T], axis=1)
# if new_df already exist:
#new_df.loc[0, :] = new_serie.set_axis([f'{x}_{y}' for x, y in new_serie.index])
col0 col1 col2 col3 col4 col5 col6 col7 col8 col9 ... \
0 8 7 6 7 6 5 8 7 8 4 ...
1 8 1 8 7 0 8 8 4 6 1 ...
2 5 6 3 5 4 9 3 0 2 5 ...
3 3 3 3 3 5 4 5 1 3 5 ...
4 7 9 4 5 6 7 0 3 4 6 ...
5 0 5 2 0 8 0 3 7 6 5 ...
6 7 0 1 4 8 9 4 9 2 9 ...
7 0 6 1 0 6 1 3 0 3 4 ...
8 3 6 1 8 3 0 7 6 8 6 ...
9 2 5 8 5 8 4 9 1 9 9 ...
col98_skew col98_kurt col98_median col99_sum col99_mean col99_var \
0 0.456435 -0.939607 3.0 39.0 3.9 6.322222
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
col99_std col99_skew col99_kurt col99_median
0 2.514403 0.402601 1.099343 4.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
This is my dataframe:
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
id value
0 1 5
1 1 6
2 1 NaN
3 2 NaN
4 2 8
5 2 4
6 2 NaN
7 2 10
8 3 NaN
This is my expected output:
id value
0 1 5
1 1 6
2 1 7
3 2 NaN
4 2 8
5 2 4
6 2 2
7 2 10
8 3 NaN
This is my current output using this code:
df.value.interpolate(method="krogh")
0 5.000000
1 6.000000
2 9.071429
3 10.171429
4 8.000000
5 4.000000
6 2.357143
7 10.000000
8 36.600000
Basically, I want to do two important things here:
Groupby ID then Interpolate using only above values not below row values
This should do the trick:
df["value_interp"]=df.value.combine_first(df.groupby("id")["value"].apply(lambda y: y.expanding().apply(lambda x: x.interpolate(method="krogh").to_numpy()[-1], raw=False)))
Outputs:
id value value_interp
0 1.0 5.0 5.0
1 1.0 6.0 6.0
2 1.0 NaN 7.0
3 2.0 NaN NaN
4 2.0 8.0 8.0
5 2.0 4.0 4.0
6 2.0 NaN 0.0
7 2.0 10.0 10.0
8 3.0 NaN NaN
(It interpolates based only on the previous values within the group - hence index 6 will return 0 not 2)
You can group by id and then loop over groups to make interpolations. For id = 2 interpolation will not give you value 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
data = []
for name, group in df.groupby('id'):
group_interpolation = group.interpolate(method='krogh', limit_direction='forward', axis=0)
data.append(group_interpolation)
df = (pd.concat(data)).round(1)
Output:
id value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 4.7
7 2.0 10.0
8 3.0 NaN
Current pandas.Series.interpolate does not support what you want so to achieve your goal you need to do 2 grouby's that will account for your desire to use only previous rows. The idea is as follows: to combine into one group only missing value (!!!) and previous rows (it might have limitations if you have several missing values in a row, but it serves well for your toy example)
Suppose we have a df:
print(df)
ID Value
0 1 5.0
1 1 6.0
2 1 NaN
3 2 NaN
4 2 8.0
5 2 4.0
6 2 NaN
7 2 10.0
8 3 NaN
Then we will combine any missing values within a group with previous rows:
df["extrapolate"] = df.groupby("ID")["Value"].apply(lambda grp: grp.isnull().cumsum().shift().bfill())
print(df)
ID Value extrapolate
0 1 5.0 0.0
1 1 6.0 0.0
2 1 NaN 0.0
3 2 NaN 1.0
4 2 8.0 1.0
5 2 4.0 1.0
6 2 NaN 1.0
7 2 10.0 2.0
8 3 NaN NaN
You may see, that when grouped by ["ID","extrapolate"] the missing value will fall into the same group as nonnull values of previous rows.
Now we are ready to do extrapolation (with spline of order=1):
df.groupby(["ID","extrapolate"], as_index=False).apply(lambda grp:grp.interpolate(method="spline",order=1)).drop("extrapolate", axis=1)
ID Value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 0.0
7 2.0 10.0
8 NaN NaN
Hope this helps.
I have a dictionary of the form;
data = {A:[(1,2),(3,4),(5,6),(7,8),(8,9)],
B:[(3,4),(4,5),(5,6),(6,7)],
C:[(10,11),(12,13)]}
I create a dataFrame by:
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in data.iteritems()]))
which in turn becomes;
A B C
(1,2) (3,4) (10,11)
(3,4) (4,5) (12,13)
(5,6) (5,6) NaN
(6,7) (6,7) NaN
(8,9) NaN NaN
Is there a way to go from the dataframe above to the one below:
A B C
one two one two one two
1 2 3 4 10 11
3 4 4 5 12 13
5 6 5 6 NaN NaN
6 7 6 7 NaN NaN
8 9 NaN NaN NaN NaN
You can use list comprehension with DataFrame constructor with converting columns to numpy array by values + tolist and concat:
cols = ['A','B','C']
L = [pd.DataFrame(df[x].values.tolist(), columns=['one','two']) for x in cols]
df = pd.concat(L, axis=1, keys=cols)
print (df)
A B C
one two one two one two
0 1 2 3 4 5 6
1 7 8 9 10 11 12
2 13 14 15 16 17 18
EDIT:
Similar solution with dict comprehension, integers values was converted to floats, because type of NaN is float too.
data = {'A':[(1,2),(3,4),(5,6),(7,8),(8,9)],
'B':[(3,4),(4,5),(5,6),(6,7)],
'C':[(10,11),(12,13)]}
cols = ['A','B','C']
d = {k: pd.DataFrame(v, columns=['one','two']) for k,v in data.items()}
df = pd.concat(d, axis=1)
print (df)
A B C
one two one two one two
0 1 2 3.0 4.0 10.0 11.0
1 3 4 4.0 5.0 12.0 13.0
2 5 6 5.0 6.0 NaN NaN
3 7 8 6.0 7.0 NaN NaN
4 8 9 NaN NaN NaN NaN
EDIT:
For multiple by one column is possible use slicers:
s = df[('A', 'one')]
print (s)
0 1
1 3
2 5
3 7
4 8
Name: (A, one), dtype: int64
df.loc(axis=1)[:, 'one'] = df.loc(axis=1)[:, 'one'].mul(s, axis=0)
print (df)
A B C
one two one two one two
0 1.0 2 3.0 4.0 10.0 11.0
1 9.0 4 12.0 5.0 36.0 13.0
2 25.0 6 25.0 6.0 NaN NaN
3 49.0 8 42.0 7.0 NaN NaN
4 64.0 9 NaN NaN NaN NaN
Another solution:
idx = pd.IndexSlice
df.loc[:, idx[:, 'one']] = df.loc[:, idx[:, 'one']].mul(s, axis=0)
print (df)
A B C
one two one two one two
0 1.0 2 3.0 4.0 10.0 11.0
1 9.0 4 12.0 5.0 36.0 13.0
2 25.0 6 25.0 6.0 NaN NaN
3 49.0 8 42.0 7.0 NaN NaN
4 64.0 9 NaN NaN NaN NaN
I have a data frame:
A B C
Timestamp
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN 5
4 NaN NaN 4
5 NaN 3 3
6 NaN 2 NaN
7 3 1 NaN
8 2 NaN NaN
9 1 NaN NaN
I would like to backfill it by incrementing the last available value in each column so it looks like this:
A B C
Timestamp
1 9 7 7
2 8 6 6
3 7 5 5
4 6 4 4
5 5 3 3
6 4 2 NaN
7 3 1 NaN
8 2 NaN NaN
9 1 NaN NaN
Let's try this:
df1 = df1[::-1].fillna(method='ffill')
(df1 + (df1 == df1.shift()).cumsum()).sort_index()
Output:
A B C
Timestamp
1 9.0 7.0 7.0
2 8.0 6.0 6.0
3 7.0 5.0 5.0
4 6.0 4.0 4.0
5 5.0 3.0 3.0
6 4.0 2.0 NaN
7 3.0 1.0 NaN
8 2.0 NaN NaN
9 1.0 NaN NaN
You can try this:
def bfill_increment(col):
col_null = col.isnull()[::-1]
groups = col_null.diff().fillna(0).cumsum()
return col_null.groupby(groups).cumsum()[::-1] + col.bfill()
df.apply(bfill_increment)