get means and SEM in one df with pandas groupby - python

I'd like to find an efficient way to use the df.groupby() function in pandas to return both the means and standard deviations of a data frame - preferably in one shot!
import pandas as PD
df = pd.DataFrame({'case':[1, 1, 2, 2, 3, 3],
'condition':[1,2,1,2,1,2],
'var_a':[0.92, 0.88, 0.90, 0.79, 0.94, 0.85],
'var_b':[0.21, 0.15, 0.1, 0.16, 0.17, 0.23]})
with that data, I'd like an easier way (if there is one!) to perform the following:
grp_means = df.groupby('case', as_index=False).mean()
grp_sems = df.groupby('case', as_index=False).sem()
grp_means.rename(columns={'var_a':'var_a_mean', 'var_b':'var_b_mean'},
inplace=True)
grp_sems.rename(columns={'var_a':'var_a_SEM', 'var_b':'var_b_SEM'},
inplace=True)
grouped = pd.concat([grp_means, grp_sems[['var_a_SEM', 'var_b_SEM']]], axis=1)
grouped
Out[1]:
case condition var_a_mean var_b_mean var_a_SEM var_b_SEM
0 1 1.5 0.900 0.18 0.900 0.18
1 2 1.5 0.845 0.13 0.845 0.13
2 3 1.5 0.895 0.20 0.895 0.20
I also recently learned of the .agg() function, and tried df.groupby('grouper column') agg('var':'mean', 'var':sem') but this just returns a SyntaxError.

I think need DataFrameGroupBy.agg, but then remove column ('condition','sem') and map for convert MultiIndex to columns:
df = df.groupby('case').agg(['mean','sem']).drop(('condition','sem'), axis=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
case condition_mean var_a_mean var_a_sem var_b_mean var_b_sem
0 1 1.5 0.900 0.020 0.18 0.03
1 2 1.5 0.845 0.055 0.13 0.03
2 3 1.5 0.895 0.045 0.20 0.03

Related

Initial value of multiple variables dataframe for time dilation

Dataframe:
product1
product2
product3
product4
product5
straws
orange
melon
chair
bread
melon
milk
book
coffee
cake
bread
melon
coffe
chair
book
CountProduct1
CountProduct2
CountProduct3
Countproduct4
Countproduct5
1
1
1
1
1
2
1
1
1
1
2
3
2
2
2
RatioProduct1
RatioProduct2
RatioProduct3
Ratioproduct4
Ratioproduct5
0.28
0.54
0.33
0.35
0.11
0.67
0.25
0.13
0.11
0.59
2.5
1.69
1.9
2.5
1.52
I want to create five others columns that keep my initial ratio of each item along the dataframe.
Output:
InitialRatio1
InitialRatio2
InitialRatio3
InitialRatio4
InitialRatio5
0.28
0.54
0.33
0.35
0.11
0.33
0.25
0.13
0.31
0.59
0.11
0.33
0.31
0.35
0.13
Check the code again. Do you have an error in product3 = coffe and product4 = coffee? Fixed coffe to coffee. As a result, 0.31 should not be.
import pandas as pd
pd.set_option('display.max_rows', None) # print everything rows
pd.set_option('display.max_columns', None) # print everything columns
df = pd.DataFrame(
{
'product1':['straws', 'melon', 'bread'],
'product2':['orange', 'milk', 'melon'],
'product3':['melon', 'book', 'coffee'],
'product4':['chair', 'coffee', 'chair'],
'product5':['bread', 'cake', 'book'],
'time':[1,2,3],
'Count1':[1,2,2],
'Count2':[1,1,3],
'Count3':[1,1,2],
'Count4':[1,1,2],
'Count5':[1,1,2],
'ratio1':[0.28, 0.67, 2.5],
'ratio2':[0.54, 0.25, 1.69],
'ratio3':[0.33, 0.13, 1.9],
'ratio4':[0.35, 0.11, 2.5],
'ratio5':[0.11, 0.59, 1.52],
})
print(df)
product = df[['product1', 'product2', 'product3', 'product4', 'product5']].stack().reset_index()
count = df[['Count1', 'Count2', 'Count3', 'Count4', 'Count5']].stack().reset_index()
ratio = df[['ratio1', 'ratio2', 'ratio3', 'ratio4', 'ratio5']].stack().reset_index()
print(ratio)
arr = pd.unique(product[0])
aaa = [i for i in range(len(arr)) if product[product[0] == arr[i]].count()[0] > 1]
for i in aaa:
prod_ind = product[product[0] == arr[i]].index
val_ratio = ratio.loc[prod_ind[0], 0]
ratio.loc[prod_ind, 0] = val_ratio
print(ratio.pivot_table(index='level_0', columns='level_1', values=[0]))
Output:
level_1 ratio1 ratio2 ratio3 ratio4 ratio5
level_0
0 0.28 0.54 0.33 0.35 0.11
1 0.33 0.25 0.13 0.11 0.59
2 0.11 0.33 0.11 0.35 0.13
To work with data, they need to be turned into one column using stack().reset_index(). Create a list of unique products arr. Further in the list aaa I get indexes of arr, which are more than one.
prod_ind = product[product[0] == arr[i]].index
In a loop, I get indexes of products that are more than one.
val_ratio = ratio.loc[prod_ind[0], 0]
Get the first value of the product.
ratio.loc[prod_ind, 0] = val_ratio
Set this value for all products.
To access the values, explicit loc indexing is used, where the row indices are in square brackets on the left, and the names of the columns on the right. Read more here.
In pivot_table I create back the table.
To insert the processed data into the original dataframe, simply use the following:
table = ratio.pivot_table(index='level_0', columns='level_1', values=[0])
df[['ratio1', 'ratio2', 'ratio3', 'ratio4', 'ratio5']] = table
print(df)
If you're after code to create the init_rateX columns then the following will work
pd.DataFrame(
np.divide(
df[["ratio1", "ratio2", "ratio3", "ratio4", "ratio5"]].to_numpy(),
df[["Count1", "Count2", "Count3", "Count4", "Count5"]].to_numpy(),
),
columns=["init_rate1", "init_rate2", "init_rate3", "init_rate4", "init_rate5"],
)
which gives
init_rate1 init_rate2 init_rate3 init_rate4 init_rate5
0 0.28 0.25 0.33 0.57 0.835
1 0.33 0.13 0.97 0.65 0.760
2 0.54 0.11 0.45 0.95 1.160
3 0.35 0.59 0.34 1.25 1.650
However it does not agree with your calcs for init_rate4 or init_rate5 so some clarification might be needed.

Multiindex data.frame from two data.frames join by column headers

There are a dozens similar sounding questions here, I think I've searched them all and could not find a solution to my problem:
I have 2 df: df_c:
CAN-01 CAN-02 CAN-03
CE
ce1 0.84 0.73 0.50
ce2 0.06 0.13 0.05
And df_z:
CAN-01 CAN-02 CAN-03
marker
cell1 0.29 1.5 7
cell2 1.00 3.0 1
I want to join for each 'marker' + 'CE' combination over their column names
Example: cell1 + ce1:
[[0.29, 0.84],[1.5,0.73],[7,0.5], ...]
(Continuing for cell1 + ce2, cell2 + ce1, cell2 + ce2)
I have a working example using two loops and .loc twice, but it takes forever on the full data set.
I think the best to build is a multiindex DF with some merge/join/concat magic:
CAN-01 CAN-02 CAN-03
Source
0 CE 0.84 0.73 0.50
Marker 0.29 1.5 7
1 CE ...
Marker ...
Sample Code
dc = [['ce1', 0.84, 0.73, 0.5],['c2', 0.06,0.13,0.05]]
dat_c = pd.DataFrame(dc, columns=['CE', 'CAN-01', 'CAN-02', 'CAN-03'])
dat_c.set_index('CE',inplace=True)
dz = [['cell1', 0.29, 1.5, 7],['cell2', 1, 3, 1]]
dat_z = pd.DataFrame(dz, columns=['marker', "CAN-01", "CAN-02", "CAN-03"])
dat_z.set_index('marker',inplace=True)
Bad/Slow Solution
for ci, c_row in dat_c.iterrows(): # for each CE in CE table
tmp = []
for j,colz in enumerate(dat_z.columns[1:]):
if not colz in dat_c:
continue
entry_c = c_row.loc[colz]
if len(entry_c.shape) > 0:
continue
tmp.append([dat_z.loc[marker,colz],entry_c])
IIUC:
use append()+groupby():
dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
df=dat_c.append(dat_z).groupby(level=0).agg(list)
output of df:
CAN-01 CAN-02 CAN-03
cell1 [0.84, 0.29] [0.73, 1.5] [0.5, 7.0]
cell2 [0.06, 1.0] [0.13, 3.0] [0.05, 1.0]
If needed list:
dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
lst=dat_c.append(dat_z).groupby(level=0).agg(list).to_numpy().tolist()
output of lst:
[[[0.84, 0.29], [0.73, 1.5], [0.5, 7.0]],
[[0.06, 1.0], [0.13, 3.0], [0.05, 1.0]]]

How to display `.value_counts()` in interval in pandas dataframe

I need to display .value_counts() in interval in pandas dataframe. Here's my code
prob['bucket'] = pd.qcut(prob['prob good'], 20)
grouped = prob.groupby('bucket', as_index = False)
kstable = pd.DataFrame()
kstable['min_prob'] = grouped.min()['prob good']
kstable['max_prob'] = grouped.max()['prob good']
kstable['counts'] = prob['bucket'].value_counts()
My Output
min_prob max_prob counts
0 0.26 0.48 NaN
1 0.49 0.52 NaN
2 0.53 0.54 NaN
3 0.55 0.56 NaN
4 0.57 0.58 NaN
I know that I have pronblem in kstable['counts'] syntax, but how to solve this?
Use named aggregation for simplify your code, for counts is used GroupBy.size to new column counts and is apply function for column bucket:
prob['bucket'] = pd.qcut(prob['prob good'], 20)
kstable = prob.groupby('bucket', as_index = False).agg(min_prob=('prob good','min'),
max_prob=('prob good','max'),
counts=('bucket','size'))
In your solution should working with DataFrame.assign:
kstable = kstable.assign(counts = prob['bucket'].value_counts())

python / pandas - Find common columns between two dataframes, and create another one with same columns showing their difference

My version of pandas is:
pd.__version__
'0.25.3'
I have two dataframes, below is a sample, with the majority of the columns being the same across the two dataframes. I am trying to find the common columns, and create a new dataframe with all the common columns that shows their difference in values.
A sample from c_r dataframe:
Comp_name EOL - CL Per $ Access - CL Per $ Total Impact - CL Per $
Nike -0.02 -0.39 -0.01
Nike -0.02 -0.39 -0.02
Adidas -0.02 -0.39 -0.01
Adidas -0.02 -0.39 -0.02
A sample from x dataframe:
Comp_name EOL - CL Per $ Access - CL Per $ Total Impact - CL Per $
Nike -0.02 -0.39 0.05
Nike -0.02 -0.39 0.03
Adidas -0.02 -0.39 0.08
Adidas -0.02 -0.39 0.08
new_df: (to have the same column names, and show the difference, i.e:)
EOL - CL Per $ - Diff Access - CL Per $ - Diff Total Impact - CL Per $ - Diff
-0.00 -0.00 -0.06
-0.00 -0.00 -0.05
-0.00 -0.00 -0.09
-0.00 -0.00 -0.10
I have tried - please see where the error is in the code:
new_df = pd.DataFrame()
for i in c_r:
for j in x:
if c_r[i].dtype != object and x[j].dtype != object:
if i == j:
## THE ISSUE IS IN THE LINE BELOW ##
new_df[i+'-Diff'] = (c_r[i]) - (x[j])
else:
pass
but for some reason I get back only 1 row of values.
Any ideas of why my code does not work? How can I achieve it the resulting dataframe, including the initial column of Comp_name?
Thanks all.
Have you tried using intersection/ symmetric_difference(for difference) i.e.
a = dataframe2.columns.intersection(dataframe1.columns)
print(a)
I think I understood the problem now, I have a small code as below.
import pandas as pd
d = {'col1': [-0.02 , -0.02 ,-0.02 ,-0.02 ], 'col2': [-0.39, -0.39, -0.39, -0.39],'col3': [-0.01,-0.02,-0.01,-0.02]}
d1 = {'col1': [-0.02 , -0.02 ,-0.02 ,-0.02 ], 'col2': [-0.39, -0.39, -0.39, -0.39],'col3': [0.05,0.03,0.06,0.04]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d1)
df = df.apply(pd.to_numeric, errors='coerce')
df2 = df2.apply(pd.to_numeric, errors='coerce')
print(df)
print(df2)
col1 = df.col1 - df2.col1
col2 = df.col2 - df2.col2
col3 = df.col3 - df2.col3
dfnew = pd.concat([col1, col2,col3], axis=1)
print(type(col1))
print(dfnew)

Create new DF with values representing difference between two dataframes

I am working with two numeric data.frames, both with 13803obs and 13803 variables. Their col- and rownames are identical however their entries are different. What I want to do is create a new data.frame where I have subtracted df2 values with df1 values.
"Formula" would be this, df1(entri-values) - df2(entri-values) = df3 difference. In other words, the purpose is to find the difference between all entries.
My problem illustrated here.
DF1
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.71 0.98 0.32
[GENE128] 0.23 0.61 0.90
[GENE271] 0.87 0.95 0.63
DF2
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.70 0.94 0.30
[GENE128] 0.25 0.51 0.80
[GENE271] 0.82 0.92 0.60
NEW DF3
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.01 0.04 0.02
[GENE128] -.02 0.10 0.10
[GENE271] 0.05 0.03 0.03
So, in DF3 the values are the difference between DF1 and DF2 for each entry.
DF1(GENE231) - DF2(GENE231) = DF3(DIFFERENCE-GENE231)
DF1(GENE271) - DF2(GENE271) = DF3(DIFFERENCE-GENE271)
and so on...
Help would be much appreciated!
Kind regards,
Harebell

Categories

Resources