I have a dataframe like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cf = pd.DataFrame({'grade': rng.choice(list('ACD'),size=(8)),
'dash': rng.choice(list('PQRS'),size=(8)),
'dumeel': rng.choice(list('QWER'),size=(8)),
'dumma': rng.choice((1234),size=(8)),
'target': rng.choice([0,1],size=(8))
})
I would like to do the below
a) Find the total and %total for each of my value in the categorical columns against the target column
I tried the below but it only gets me to half way of the results.
cols = cf.select_dtypes('object')
cf.melt('target',cols).groupby(['variable','value']).size().reset_index(name='cnt of records')
How can I use the above result to compute target met and target not met details using the target column?
I expect my output to be like as shown below (note that I have shown only two columns grade and dash for sample). Code should follow the same logic for all string columns
Select your columns to flatten with melt then join the target column. Finally, group by variable and value columns and apply a dict of functions to each group.
funcs = {
'cnt of records': 'count',
'target met': lambda x: sum(x),
'target not met': lambda x: len(x) - sum(x),
'target met %': lambda x: f"{round(100 * sum(x) / len(x), 2):.2f}%",
'target not met %': lambda x: f"{round(100 * (len(x) - sum(x)) / len(x), 2):.2f}%"
}
out = df.select_dtypes('object').melt(ignore_index=False).join(df['target']) \
.groupby(['variable', 'value'])['target'].agg(**funcs).reset_index()
Output:
>>> out
variable value cnt of records target met target not met target met % target not met %
0 dash Q 2 0 2 0.00% 100.00%
1 dash R 2 2 0 100.00% 0.00%
2 dash S 4 2 2 50.00% 50.00%
3 dumeel E 3 2 1 66.67% 33.33%
4 dumeel Q 3 2 1 66.67% 33.33%
5 dumeel R 1 0 1 0.00% 100.00%
6 dumeel W 1 0 1 0.00% 100.00%
7 grade A 2 0 2 0.00% 100.00%
8 grade C 3 2 1 66.67% 33.33%
9 grade D 3 2 1 66.67% 33.33%
You can use agg after you groupby for this:
cols = cf.select_dtypes('object')
df = (
cf.melt('target', cols)
.groupby(['variable','value'])
['target']
.agg([('l', 'size'), ('s', 'sum')]) # l = length (total count of rows in this group), s = sum (total count of rows in the group where target = 1)
.pipe(lambda x: (
x.assign(
met_pct=x.s / x.l * 100,
not_met_pct=100 - (x.s / x.l * 100),
met=x.s,
not_met=x.l - x.s
)
)).reset_index()
.drop(['l', 's'], axis=1)
)
Output:
>>> df
variable value met_pct not_met_pct met not_met
0 dash P 100.000000 0.000000 1 0
1 dash Q 0.000000 100.000000 0 3
2 dash R 50.000000 50.000000 1 1
3 dash S 50.000000 50.000000 1 1
4 dumeel E 0.000000 100.000000 0 1
5 dumeel Q 100.000000 0.000000 1 0
6 dumeel R 50.000000 50.000000 2 2
7 dumeel W 0.000000 100.000000 0 2
8 grade A 0.000000 100.000000 0 1
9 grade C 50.000000 50.000000 2 2
10 grade D 33.333333 66.666667 1 2
Related
For the DF below - in the Value Column, Product 3(i.e, 100) and Product 4 (i.e. 98) have amounts that are outliers. I want to
group by ['Class']
obtain the mean of the [Value] excluding the outlier amount
replace the outlier amount with the mean calculated in step 2.
Any suggestions of how to structure the code greatly appreciated. I have my code that works given the sample table, but I have a feeling that when I implement in the real solution it might not work.
Product,Class,Value
0 1 A 5
1 2 A 4
2 3 A 100
3 4 B 98
4 5 B 20
5 6 B 25
My code implementation:
# Establish the condition to remove the outlier rows from the DF
stds = 1.0
filtered_df = df[~df.groupby('Class')['Value'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)]
Output:
Product Class Value
0 1 A 5
1 2 A 4
4 5 B 20
5 6 B 25
# compute mean of each class without the outliers
class_means = filtered_df[['Class', 'Value']].groupby(['Class'])['Value'].mean()
Output:
Class
A 4.5
B 22.5
#extract rows in DF that are outliers and fail the test
outlier_df = df[df.groupby('Class')['Value'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)]
outlier_df
Output:
Product Class Value
2 3 A 100
3 4 B 98
#replace outlier values with computed means grouped by class
outlier_df['Value'] = np.where((outlier_df.Class == class_means.index), class_means,outlier_df.Value)
outlier_df
Output:
Product Class Value
2 3 A 4.5
3 4 B 22.5
#recombine cleaned dataframes
df_cleaned = pd.concat([filtered_df,outlier_df], axis=0 )
df_cleaned
Output:
Product Class Value
0 1 A 5.0
1 2 A 4.0
4 5 B 20.0
5 6 B 25.0
2 3 A 4.5
3 4 B 22.5
Proceed as follows:
Start from your code:
stds = 1.0
Save your lambda function under a variable:
isOutlier = lambda x: abs((x - x.mean()) / x.std()) > stds
Define the following function, to be applied to each group:
def newValue(grp):
val = grp.Value
outl = isOutlier(val)
return val.mask(outl, val[~outl].mean())
Generate new Value column:
df.Value = df.groupby('Class', group_keys=False).apply(newValue)
The result is:
Product Class Value
0 1 A 5.0
1 2 A 4.0
2 3 A 4.5
3 4 B 22.5
4 5 B 20.0
5 6 B 25.0
You even don't lose the original row order.
Edit
Or you can "incorporate" the content of your lambda function in newValue
(as you don't call it in any other place):
def newValue(grp):
val = grp.Value
outl = abs((val - val.mean()) / val.std()) > stds
return val.mask(outl, val[~outl].mean())
Let us say, I have the following data frame.
Frequency
20
14
10
8
6
2
1
I want to scale Frequency value from 0 to 1.
Is there a way to do this in Python? I have found something similar here But it doesn't serve my purpose.
I am sure there's a more standard way to do this in Python, but I use a self-defined function that you can select the range to be scaled on:
def my_scaler(min_scale_num,max_scale_num,var):
return (max_scale_num - min_scale_num) * ( (var - min(var)) / (max(var) - min(var)) ) + min_scale_num
# You can input your range
df['scaled'] = my_scaler(0,1,df['Frequency'].astype(float)) # scaled between 0,1
df['scaled2'] = my_scaler(-5,5,df['Frequency'].astype(float)) # scaled between -5,5
df
Frequency scaled scaled2
0 20 1.000000 5.000000
1 14 0.684211 1.842105
2 10 0.473684 -0.263158
3 8 0.368421 -1.315789
4 6 0.263158 -2.368421
5 2 0.052632 -4.473684
6 1 0.000000 -5.000000
Just change a, b = 10, 50 to a, b = 0, 1 in linked answer for upper and lower values for scale:
a, b = 0, 1
x, y = df.Frequency.min(), df.Frequency.max()
df['normal'] = (df.Frequency - x) / (y - x) * (b - a) + a
print (df)
Frequency normal
0 20 1.000000
1 14 0.684211
2 10 0.473684
3 8 0.368421
4 6 0.263158
5 2 0.052632
6 1 0.000000
You can use applymap to apply any function on each cell of the df.
For example:
df = pd.DataFrame([20, 14, 10, 8, 6, 2, 1], columns=["Frequency"])
min = df.min()
max = df.max()
df2 = df.applymap(lambda x: (x - min)/(max-min))
df
Frequency
0 20
1 14
2 10
3 8
4 6
5 2
6 1
df2
0 Frequency 1.0
dtype: float64
1 Frequency 0.684211
dtype: float64
2 Frequency 0.473684
dtype: float64
3 Frequency 0.368421
dtype: float64
4 Frequency 0.263158
dtype: float64
5 Frequency 0.052632
dtype: float64
6 Frequency 0.0
dtype: float64
i have a small sample data set:
import pandas as pd
d = {
'measure1_x': [10,12,20,30,21],
'measure2_x':[11,12,10,3,3],
'measure3_x':[10,0,12,1,1],
'measure1_y': [1,2,2,3,1],
'measure2_y':[1,1,1,3,3],
'measure3_y':[1,0,2,1,1]
}
df = pd.DataFrame(d)
df = df.reindex_axis([
'measure1_x','measure2_x', 'measure3_x','measure1_y','measure2_y','measure3_y'
], axis=1)
it looks like:
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y
10 11 10 1 1 1
12 12 0 2 1 0
20 10 12 2 1 2
30 3 1 3 3 1
21 3 1 1 3 1
i created the column names almost the same except for '_x' and '_y' to help identify which pair should be multiplying: i want to multiply the pair with the same column name when '_x' and '_y' are disregarded, then i want sum the numbers to get a total number, keep in mind my actual data set is huge and the columns are not in this perfect order so this naming is a way for identifying correct pairs to multiply:
total = measure1_x * measure1_y + measure2_x * measure2_y + measure3_x * measure3_y
so desired output:
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y total
10 11 10 1 1 1 31
12 12 0 2 1 0 36
20 10 12 2 1 2 74
30 3 1 3 3 1 100
21 3 1 1 3 1 31
my attempt and thought process, but cannot proceed anymore syntax wise:
#first identify the column names that has '_x' and '_y', then identify if
#the column names are the same after removing '_x' and '_y', if the pair has
#the same name then multiply them, do that for all pairs and sum the results
#up to get the total number
for colname in df.columns:
if "_x".lower() in colname.lower() or "_y".lower() in colname.lower():
if "_x".lower() in colname.lower():
colnamex = colname
if "_y".lower() in colname.lower():
colnamey = colname
#if colnamex[:-2] are the same for colnamex and colnamey then multiply and sum
filter + np.einsum
Thought I'd try something a little different this time—
get your _x and _y columns separately
do a product-sum. This is very easy to specify with einsum (and fast).
df = df.sort_index(axis=1) # optional, do this if your columns aren't sorted
i = df.filter(like='_x')
j = df.filter(like='_y')
df['Total'] = np.einsum('ij,ij->i', i, j) # (i.values * j).sum(axis=1)
df
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y Total
0 10 11 10 1 1 1 31
1 12 12 0 2 1 0 36
2 20 10 12 2 1 2 74
3 30 3 1 3 3 1 100
4 21 3 1 1 3 1 31
A slightly more robust version which filters out non-numeric columns and performs an assertion beforehand—
df = df.sort_index(axis=1).select_dtypes(exclude=[object])
i = df.filter(regex='.*_x')
j = df.filter(regex='.*_y')
assert i.shape == j.shape
df['Total'] = np.einsum('ij,ij->i', i, j)
If the assertion fails, the the assumptions of 1) your columns being numeric, and 2) the number of x and y columns being equal, as your question would suggest, do not hold for your actual dataset.
Use df.columns.str.split to generate a new MultiIndex
Use prod with axis and level arguments
Use sum with axis argument
Use assign to create new column
df.assign(
Total=df.set_axis(
df.columns.str.split('_', expand=True),
axis=1, inplace=False
).prod(axis=1, level=0).sum(1)
)
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y Total
0 10 11 10 1 1 1 31
1 12 12 0 2 1 0 36
2 20 10 12 2 1 2 74
3 30 3 1 3 3 1 100
4 21 3 1 1 3 1 31
Restrict dataframe to just columns that look like 'meausre[i]_[j]'
df.assign(
Total=df.filter(regex='^measure\d+_\w+$').pipe(
lambda d: d.set_axis(
d.columns.str.split('_', expand=True),
axis=1, inplace=False
)
).prod(axis=1, level=0).sum(1)
)
Debugging
See if this gets you the correct Totals
d_ = df.copy()
d_.columns = d_.columns.str.split('_', expand=True)
d_.prod(axis=1, level=0).sum(1)
0 31
1 36
2 74
3 100
4 31
dtype: int64
I have a function defined like below:
def process_trans(chunk):
grouped_object=chunk.groupby('msno',sort=False) # not sorting results in a minor speedup
func = {
'late_count':['sum'],
'is_discount':['count'],
'is_not_discount':['count'],
'discount':['sum'], 'is_auto_renew':['mean'], 'is_cancel':['mean'], 'payment_type' : ['??'}
result=grouped_object.agg(func)
return result
As you can see, I know that I can insert sum, count, mean for each column. What type of keyword I can insert for determine the payment_type that appear most frequently. Note that each type is represented by integer.
I see people are introducing mode but the index 0 is needed to identify the most frequent item. Any better idea?
I believe you need value_counts and select first value of index, because function return sorted Series:
'payment_type' : lambda x: x.value_counts().index[0]
All together in sample:
chunk = pd.DataFrame({'msno':list('aaaddd'),
'late_count':[4,5,4,5,5,4],
'is_discount':[7,8,9,4,2,3],
'is_not_discount':[4,5,4,5,5,4],
'discount':[7,8,9,4,2,3],
'is_auto_renew':[1,3,5,7,1,0],
'is_cancel':[5,3,6,9,2,4],
'payment_type':[1,0,0,1,1,0]})
print (chunk)
discount is_auto_renew is_cancel is_discount is_not_discount \
0 7 1 5 7 4
1 8 3 3 8 5
2 9 5 6 9 4
3 4 7 9 4 5
4 2 1 2 2 5
5 3 0 4 3 4
late_count msno payment_type
0 4 a 1
1 5 a 0
2 4 a 0
3 5 d 1
4 5 d 1
5 4 d 0
grouped_object=chunk.groupby('msno',sort=False)
func = {
'late_count':['sum'],
'is_discount':['count'],
'is_not_discount':['count'],
'discount':['sum'],
'is_auto_renew':['mean'],
'is_cancel':['mean'],
'payment_type' : [lambda x: x.value_counts().index[0]]}
result=grouped_object.agg(func)
print (result)
is_not_discount is_discount is_cancel discount late_count is_auto_renew \
count count mean sum sum mean
msno
a 3 3 4.666667 24 13 3.000000
d 3 3 5.000000 9 14 2.666667
payment_type
<lambda>
msno
a 0
d 1
You can make use of series.mode i.e
func = {
'late_count':['sum'],
'is_discount':['count'],
'is_not_discount':['count'],
'discount':['sum'], 'is_auto_renew':['mean'], 'is_cancel':['mean'],
'payment_type': lambda x : x.mode()}
# Data from # jezrael.
grouped_object.agg(func).rename(columns={'<lambda>': 'mode'})
Output :
is_not_discount is_auto_renew late_count payment_type discount \
count mean sum mode sum
msno
a 3 3.000000 13 0 24
d 3 2.666667 14 1 9
is_discount is_cancel
count mean
msno
a 3 4.666667
d 3 5.000000
I'm generating a number of dataframes with the same shape, and I want to compare them to one another. I want to be able to get the mean and median across the dataframes.
Source.0 Source.1 Source.2 Source.3
cluster
0 0.001182 0.184535 0.814230 0.000054
1 0.000001 0.160490 0.839508 0.000001
2 0.000001 0.173829 0.826114 0.000055
3 0.000432 0.180065 0.819502 0.000001
4 0.000152 0.157041 0.842694 0.000113
5 0.000183 0.174142 0.825674 0.000001
6 0.000001 0.151556 0.848405 0.000038
7 0.000771 0.177583 0.821645 0.000001
8 0.000001 0.202059 0.797939 0.000001
9 0.000025 0.189537 0.810410 0.000028
10 0.006142 0.003041 0.493912 0.496905
11 0.003739 0.002367 0.514216 0.479678
12 0.002334 0.001517 0.529041 0.467108
13 0.003458 0.000001 0.532265 0.464276
14 0.000405 0.005655 0.527576 0.466364
15 0.002557 0.003233 0.507954 0.486256
16 0.004161 0.000001 0.491271 0.504568
17 0.001364 0.001330 0.528311 0.468996
18 0.002886 0.000001 0.506392 0.490721
19 0.001823 0.002498 0.509620 0.486059
Source.0 Source.1 Source.2 Source.3
cluster
0 0.000001 0.197108 0.802495 0.000396
1 0.000001 0.157860 0.842076 0.000063
2 0.094956 0.203057 0.701662 0.000325
3 0.000001 0.181948 0.817841 0.000210
4 0.000003 0.169680 0.830316 0.000001
5 0.000362 0.177194 0.822443 0.000001
6 0.000001 0.146807 0.852924 0.000268
7 0.001087 0.178994 0.819564 0.000354
8 0.000001 0.202182 0.797333 0.000485
9 0.000348 0.181399 0.818252 0.000001
10 0.003050 0.000247 0.506777 0.489926
11 0.004420 0.000001 0.513927 0.481652
12 0.006488 0.001396 0.527197 0.464919
13 0.001510 0.000001 0.525987 0.472502
14 0.000001 0.000001 0.520737 0.479261
15 0.000001 0.001765 0.515658 0.482575
16 0.000001 0.000001 0.492550 0.507448
17 0.002855 0.000199 0.526535 0.470411
18 0.000001 0.001952 0.498303 0.499744
19 0.001232 0.000001 0.506612 0.492155
Then I want to get the mean of these two dataframes.
What is the easiest way to do this?
Just to clarify I want to get the mean for each particular cell when the indexes and columns of all the dataframes are exactly the same.
So in the example I gave, the average for [0,Source.0] would be (0.001182 + 0.000001) / 2 = 0.0005915.
Assuming the two dataframes have the same columns, you could just concatenate them and compute your summary stats on the concatenated frames:
import numpy as np
import pandas as pd
# some random data frames
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
# concatenate them
df_concat = pd.concat((df1, df2))
print df_concat.mean()
# x -0.163044
# y 2.120000
# dtype: float64
print df_concat.median()
# x -0.192037
# y 2.000000
# dtype: float64
Update
If you want to compute stats across each set of rows with the same index in the two datasets, you can use .groupby() to group the data by row index, then apply the mean, median etc.:
by_row_index = df_concat.groupby(df_concat.index)
df_means = by_row_index.mean()
print df_means.head()
# x y
# 0 -0.850794 1.5
# 1 0.159038 1.5
# 2 0.083278 1.0
# 3 -0.540336 0.5
# 4 0.390954 3.5
This method will work even when your dataframes have unequal numbers of rows - if a particular row index is missing in one of the two dataframes, the mean/median will be computed on the single existing row.
I go similar as #ali_m, but since you want one mean per row-column combination, I conclude differently:
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df = pd.concat([df1, df2])
foo = df.groupby(level=1).mean()
foo.head()
x y
0 0.841282 2.5
1 0.716749 1.0
2 -0.551903 2.5
3 1.240736 1.5
4 1.227109 2.0
As per Niklas' comment, the solution to the question is panel.mean(axis=0).
As a more complete example:
import pandas as pd
import numpy as np
dfs = {}
nrows = 4
ncols = 3
for i in range(4):
dfs[i] = pd.DataFrame(np.arange(i, nrows*ncols+i).reshape(nrows, ncols),
columns=list('abc'))
print('DF{i}:\n{df}\n'.format(i=i, df=dfs[i]))
panel = pd.Panel(dfs)
print('Mean of stacked DFs:\n{df}'.format(df=panel.mean(axis=0)))
Will give the following output:
DF0:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
DF1:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
DF2:
a b c
0 2 3 4
1 5 6 7
2 8 9 10
3 11 12 13
DF3:
a b c
0 3 4 5
1 6 7 8
2 9 10 11
3 12 13 14
Mean of stacked DFs:
a b c
0 1.5 2.5 3.5
1 4.5 5.5 6.5
2 7.5 8.5 9.5
3 10.5 11.5 12.5
Here is a solution first unstack both dataframes so they are series with multiindexes(cluster, colnames)... then you can use Series addition and division, which automattically do the operation on the indexes, finally unstack them... here it is in code...
averages = (df1.stack()+df2.stack())/2
averages = averages.unstack()
And your done...
Or for more general purposes...
dfs = [df1,df2]
averages = pd.concat([each.stack() for each in dfs],axis=1)\
.apply(lambda x:x.mean(),axis=1)\
.unstack()
You can simply assign a label to each frame, call it group and then concat and groupby to do what you want:
In [57]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [58]: df2 = df.copy()
In [59]: dfs = [df, df2]
In [60]: df
Out[60]:
a b c d
0 0.1959 0.1260 0.1464 0.1631
1 0.9344 -1.8154 1.4529 -0.6334
2 0.0390 0.4810 1.1779 -1.1799
3 0.3542 0.3819 -2.0895 0.8877
4 -2.2898 -1.0585 0.8083 -0.2126
5 0.3727 -0.6867 -1.3440 -1.4849
6 -1.1785 0.0885 1.0945 -1.6271
7 -1.7169 0.3760 -1.4078 0.8994
8 0.0508 0.4891 0.0274 -0.6369
9 -0.7019 1.0425 -0.5476 -0.5143
In [61]: for i, d in enumerate(dfs):
....: d['group'] = i
....:
In [62]: dfs[0]
Out[62]:
a b c d group
0 0.1959 0.1260 0.1464 0.1631 0
1 0.9344 -1.8154 1.4529 -0.6334 0
2 0.0390 0.4810 1.1779 -1.1799 0
3 0.3542 0.3819 -2.0895 0.8877 0
4 -2.2898 -1.0585 0.8083 -0.2126 0
5 0.3727 -0.6867 -1.3440 -1.4849 0
6 -1.1785 0.0885 1.0945 -1.6271 0
7 -1.7169 0.3760 -1.4078 0.8994 0
8 0.0508 0.4891 0.0274 -0.6369 0
9 -0.7019 1.0425 -0.5476 -0.5143 0
In [63]: final = pd.concat(dfs, ignore_index=True)
In [64]: final
Out[64]:
a b c d group
0 0.1959 0.1260 0.1464 0.1631 0
1 0.9344 -1.8154 1.4529 -0.6334 0
2 0.0390 0.4810 1.1779 -1.1799 0
3 0.3542 0.3819 -2.0895 0.8877 0
4 -2.2898 -1.0585 0.8083 -0.2126 0
5 0.3727 -0.6867 -1.3440 -1.4849 0
6 -1.1785 0.0885 1.0945 -1.6271 0
.. ... ... ... ... ...
13 0.3542 0.3819 -2.0895 0.8877 1
14 -2.2898 -1.0585 0.8083 -0.2126 1
15 0.3727 -0.6867 -1.3440 -1.4849 1
16 -1.1785 0.0885 1.0945 -1.6271 1
17 -1.7169 0.3760 -1.4078 0.8994 1
18 0.0508 0.4891 0.0274 -0.6369 1
19 -0.7019 1.0425 -0.5476 -0.5143 1
[20 rows x 5 columns]
In [65]: final.groupby('group').mean()
Out[65]:
a b c d
group
0 -0.394 -0.0576 -0.0682 -0.4339
1 -0.394 -0.0576 -0.0682 -0.4339
Here, each group is the same, but that's only because df == df2.
Alternatively, you can throw the frames into a Panel:
In [69]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [70]: df2 = DataFrame(np.random.randn(10, 4), columns=list('abcd'))
In [71]: panel = pd.Panel({0: df, 1: df2})
In [72]: panel
Out[72]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 10 (major_axis) x 4 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 9
Minor_axis axis: a to d
In [73]: panel.mean()
Out[73]:
0 1
a 0.3839 0.2956
b 0.1855 -0.3164
c -0.1167 -0.0627
d -0.2338 -0.0450
With Pandas version 1.3.4 this works for me:
import numpy as np
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100), z=np.random.randint(-3, 2, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 2, 100), z=np.random.randint(-1, 2, 100)))
pd.concat([df1, df2]).groupby(level=0).mean()