expanding mean on a groupby object - python

I have a dataframe that looks something like:
Name Col1 Col2 Col3 DownIndicator
A 0
A 1
B 0
C 0
C 1
C 1
I want to create column that shows expanding mean for the DownIndicator Column
So the the output desired would be
Name Col1 Col2 Col3 DownIndicator MeanDown
A 0 0
A 1 0.5
B 0 0
C 0 0
C 1 0.5
C 1 0.66
Could you please help.
I am looking at expanding_mean, but unable to apply it in practice

Use Expanding.mean with groupby, but because is create MultiIndex is necessary remove first level by reset_index(level=0, drop=True):
df['new'] = (df.groupby('Name')['DownIndicator']
.expanding()
.mean()
.reset_index(level=0, drop=True))
print (df)
Name DownIndicator new
0 A 0 0.000000
1 A 1 0.500000
2 B 0 0.000000
3 C 0 0.000000
4 C 1 0.500000
5 C 1 0.666667

Related

pandas groupby two columns and generate columns from second

I have a pandas dataframe like:
index col1 col2 col3 col4 col5
0 a c 1 2 f
1 a c 1 2 f
2 a d 1 2 f
3 b d 1 2 g
4 b e 1 2 g
5 b e 1 2 g
if i group by two columns, like the following:
df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'})
I get:
col3 col4
col1 col2
a c 2 4
d 1 2
b d 1 2
e 2 4
Is it possible to convert this to:
col1 c_col3 d_col3 c_col4 d_col4 e_col3 e_col4
a 2 1 4 2 Nan Nan
b Nan 1 Nan 2 2 4
in an efficient manner where col1 is the index?
Add unstack for MultiIndex in columns, so necessary flattening:
df1 = df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'}).unstack()
#python 3.6+
df1.columns = [f'{j}_{i}' for i, j in df1.columns]
#python bellow
#df1.columns = ['{}_{}'.format(j, i) for i, j in df1.columns]
print (df1)
c_col3 d_col3 e_col3 c_col4 d_col4 e_col4
col1
a 2.0 1.0 NaN 4.0 2.0 NaN
b NaN 1.0 2.0 NaN 2.0 4.0

Re-occurrence count

I have a dataframe like the follow:
Col1
0 C
1 A
3 D
4 A
5 A
I would like to count the step/index that a certain value will re-occur so I would get the following:
Col1 Col2
0 C NaN
1 A 2
3 D NaN
4 A 1
5 A NaN
Any ideas on how to do it ? Thanks for help !
Use GroupBy.cumcount and then replace 0 to NaNs:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).replace(0,np.nan)
print (df)
Col1 Col2
0 C NaN
1 A 2.0
3 D NaN
4 A 1.0
5 A NaN
Alternative solution with mask:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).mask(lambda x: x == 0)

Python Pandas - Deal with duplicates

I want to deal with duplicates in a pandas df:
df=pd.DataFrame({'A':[1,1,1,2,1],'B':[2,2,1,2,1],'C':[2,2,1,1,1],'D':['a','c','a','c','c']})
df
I want to keep only rows with unique values of A, B, C an create binary columns D_a and D_c, so the results will be something like this without doing super slow loops on each row..
result= pd.DataFrame({'A':[1,1,2],'B':[2,1,2],'C':[2,1,1],'D_a':[1,1,0],'D_c':[1,1,1]})
Thanks a lot
You can use:
df1 = (df.groupby(['A','B','C'])['D']
.value_counts()
.unstack(fill_value=0)
.add_prefix('D_')
.clip_upper(1)
.reset_index()
.rename_axis(None, axis=1))
print (df1)
A B C D_a D_c
0 1 1 1 1 1
1 1 2 2 1 1
2 2 2 1 0 1
Using get_dummies + sum -
df = df.set_index(['A', 'B', 'C'])\
.D.str.get_dummies()\
.sum(level=[0, 1, 2])\
.add_prefix('D_')\
.reset_index()
df
A B C D_a D_c
0 1 1 1 1 1
1 1 2 2 1 1
2 2 2 1 0 1
You can do something like this
df.loc[df['D']=='a', 'D_a'] = 1
df.loc[df['D']=='c', 'D_c'] = 1
This will put a 1 in a new column where every an "a" or "c" appears.
A B C D D_a D_c
0 1 2 2 a 1.0 NaN
1 1 2 2 c NaN 1.0
2 1 1 1 a 1.0 NaN
3 2 2 1 c NaN 1.0
4 1 1 1 c NaN 1.0
but then you have to replace the NaN with a 0.
df = df.fillna(0)
Next you only have to select the columns you need and then drop the duplicates.
df = df[["A","B","C", "D_a", "D_c"]].drop_duplicates()
Hope this is the solution you were looking for.

Python/Pandas - Combining groupby mean and min

What's the syntax for combining mean and a min on a dataframe? I want to group by 2 columns, calculate the mean within a group for col3 and keep the min value of col4. Would something like
groupeddf = nongrouped.groupby(['col1', 'col2', 'col3'], as_index=False).mean().min('col4')
work? If not, what's the correct syntax? Thank you!
EDIT
Okay, so the question wasn't quite clear without an example. I'll update it now. Also changes in text above.
I have:
ungrouped
col1 col2 col3 col4
1 2 3 4
1 2 4 1
2 4 2 1
2 4 1 3
2 3 1 3
Wanted output is grouped by columns 1-2, mean for column 3 (and actually some more columns on the data, this is simplified) and the minimum of col4:
grouped
col1 col2 col3 col4
1 2 3.5 1
2 4 1.5 1
2 3 1 3
I think you need first mean and then min of column col4:
min_val = nongrouped.groupby(['col1', 'col2', 'col3'], as_index=False).mean()['col4'].min()
or min of Series:
min_val = nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean().min()
Sample:
nongrouped = pd.DataFrame({'col1':[1,1,3],
'col2':[1,1,6],
'col3':[1,1,9],
'col4':[1,3,5]})
print (nongrouped)
col1 col2 col3 col4
0 1 1 1 1
1 1 1 1 3
2 3 6 9 5
print (nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean())
1 1 1 2
3 6 9 5
Name: col4, dtype: int64
min_val = nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean().min()
print (min_val)
2
EDIT:
You need aggregate:
groupeddf = nongrouped.groupby(['col1', 'col2'], sort=False)
.agg({'col3':'mean','col4':'min'})
.reset_index()
.reindex(columns=nongrouped.columns)
print (groupeddf)
col1 col2 col3 col4
0 1 2 3.5 1
1 2 4 1.5 1
2 2 3 1.0 3

Mapping in Pandas with multiple keys

I have two Pandas Datafrmae as
df: df2:
col1 col2 val col1 col2 s
0 1 a 1.2 0 1 a 0.90
1 2 b 3.2 1 3 b 0.70
2 2 a 4.2 2 1 b 0.02
3 1 b -1.2 3 2 a 0.10
and I want to use df2['s'] and multiply it in df['val'] whenever the combination of ['col1', 'col2'] match. If a row does not match, I don't need to do the multiplication
I create a mapper
mapper = df2.set_index(['col1','col2'])['s']
where I get
mapper
col1 col2
1 a 0.90
3 b 0.70
1 b 0.02
2 a 0.10
Name: s, dtype: float64
and I want to match it with df[['col1','col2']]
df[['col1','col2']]
col1 col2
0 1 a
1 2 b
2 2 a
3 1 b
but when I do the mapping
mapped_value = df[['col1','col2']].map(mapper)
I get the following error
AttributeError: 'DataFrame' object has no attribute 'map'
any hint?
I think you need mul:
df = df2.set_index(['col1','col2'])['s'].mul(df.set_index(['col1','col2'])['val'])
print (df)
col1 col2
1 a 1.080
b -0.024
2 a 0.420
b NaN
3 b NaN
dtype: float64
If need replace NaN:
df = df2.set_index(['col1','col2'])['s']
.mul(df.set_index(['col1','col2'])['val'], fill_value=1)
print (df)
col1 col2
1 a 1.080
b -0.024
2 a 0.420
b 3.200
3 b 0.700
dtype: float64

Categories

Resources