Mapping in Pandas with multiple keys - python

I have two Pandas Datafrmae as
df: df2:
col1 col2 val col1 col2 s
0 1 a 1.2 0 1 a 0.90
1 2 b 3.2 1 3 b 0.70
2 2 a 4.2 2 1 b 0.02
3 1 b -1.2 3 2 a 0.10
and I want to use df2['s'] and multiply it in df['val'] whenever the combination of ['col1', 'col2'] match. If a row does not match, I don't need to do the multiplication
I create a mapper
mapper = df2.set_index(['col1','col2'])['s']
where I get
mapper
col1 col2
1 a 0.90
3 b 0.70
1 b 0.02
2 a 0.10
Name: s, dtype: float64
and I want to match it with df[['col1','col2']]
df[['col1','col2']]
col1 col2
0 1 a
1 2 b
2 2 a
3 1 b
but when I do the mapping
mapped_value = df[['col1','col2']].map(mapper)
I get the following error
AttributeError: 'DataFrame' object has no attribute 'map'
any hint?

I think you need mul:
df = df2.set_index(['col1','col2'])['s'].mul(df.set_index(['col1','col2'])['val'])
print (df)
col1 col2
1 a 1.080
b -0.024
2 a 0.420
b NaN
3 b NaN
dtype: float64
If need replace NaN:
df = df2.set_index(['col1','col2'])['s']
.mul(df.set_index(['col1','col2'])['val'], fill_value=1)
print (df)
col1 col2
1 a 1.080
b -0.024
2 a 0.420
b 3.200
3 b 0.700
dtype: float64

Related

Pandas update columns with loc and apply

I'm trying to update some columns of a dataframe where some condition is met (only some lines will meet the condition).
I'm using apply with loc. My function returns a pandas series.
The problem is that the columns are updates with NaN.
Simplifying my problem, we can consider the following dataframe df_test:
col1 col2 col3 col4
0 A 1 1 2
1 B 2 1 2
2 A 3 1 2
3 B 4 1 2
I now want to update col3 and col4 when col1=A. For that I'll use the apply method
df_test.loc[df_test['col1']=='A', ['col3', 'col4']] = df_test[df_test['col1']=='A'].apply(lambda row: pd.Series([10,20]), axis=1)
Doing that I get:
col1 col2 col3 col4
0 A 1 NaN NaN
1 B 2 1.0 2.0
2 A 3 NaN NaN
3 B 4 1.0 2.0
If instead of pd.Series([10, 20]) I use np.array([10, 20]) or [10, 20] I get the following error
ValueError: shape mismatch: value array of shape (2,2) could not be broadcast to indexing result of shape (2,)
What do I need to return to obtain
col1 col2 col3 col4
0 A 1 10 20
1 B 2 1 2
2 A 3 10 20
3 B 4 1 2
thanks!

expanding mean on a groupby object

I have a dataframe that looks something like:
Name Col1 Col2 Col3 DownIndicator
A 0
A 1
B 0
C 0
C 1
C 1
I want to create column that shows expanding mean for the DownIndicator Column
So the the output desired would be
Name Col1 Col2 Col3 DownIndicator MeanDown
A 0 0
A 1 0.5
B 0 0
C 0 0
C 1 0.5
C 1 0.66
Could you please help.
I am looking at expanding_mean, but unable to apply it in practice
Use Expanding.mean with groupby, but because is create MultiIndex is necessary remove first level by reset_index(level=0, drop=True):
df['new'] = (df.groupby('Name')['DownIndicator']
.expanding()
.mean()
.reset_index(level=0, drop=True))
print (df)
Name DownIndicator new
0 A 0 0.000000
1 A 1 0.500000
2 B 0 0.000000
3 C 0 0.000000
4 C 1 0.500000
5 C 1 0.666667

pandas groupby two columns and generate columns from second

I have a pandas dataframe like:
index col1 col2 col3 col4 col5
0 a c 1 2 f
1 a c 1 2 f
2 a d 1 2 f
3 b d 1 2 g
4 b e 1 2 g
5 b e 1 2 g
if i group by two columns, like the following:
df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'})
I get:
col3 col4
col1 col2
a c 2 4
d 1 2
b d 1 2
e 2 4
Is it possible to convert this to:
col1 c_col3 d_col3 c_col4 d_col4 e_col3 e_col4
a 2 1 4 2 Nan Nan
b Nan 1 Nan 2 2 4
in an efficient manner where col1 is the index?
Add unstack for MultiIndex in columns, so necessary flattening:
df1 = df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'}).unstack()
#python 3.6+
df1.columns = [f'{j}_{i}' for i, j in df1.columns]
#python bellow
#df1.columns = ['{}_{}'.format(j, i) for i, j in df1.columns]
print (df1)
c_col3 d_col3 e_col3 c_col4 d_col4 e_col4
col1
a 2.0 1.0 NaN 4.0 2.0 NaN
b NaN 1.0 2.0 NaN 2.0 4.0

Python/Pandas - Combining groupby mean and min

What's the syntax for combining mean and a min on a dataframe? I want to group by 2 columns, calculate the mean within a group for col3 and keep the min value of col4. Would something like
groupeddf = nongrouped.groupby(['col1', 'col2', 'col3'], as_index=False).mean().min('col4')
work? If not, what's the correct syntax? Thank you!
EDIT
Okay, so the question wasn't quite clear without an example. I'll update it now. Also changes in text above.
I have:
ungrouped
col1 col2 col3 col4
1 2 3 4
1 2 4 1
2 4 2 1
2 4 1 3
2 3 1 3
Wanted output is grouped by columns 1-2, mean for column 3 (and actually some more columns on the data, this is simplified) and the minimum of col4:
grouped
col1 col2 col3 col4
1 2 3.5 1
2 4 1.5 1
2 3 1 3
I think you need first mean and then min of column col4:
min_val = nongrouped.groupby(['col1', 'col2', 'col3'], as_index=False).mean()['col4'].min()
or min of Series:
min_val = nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean().min()
Sample:
nongrouped = pd.DataFrame({'col1':[1,1,3],
'col2':[1,1,6],
'col3':[1,1,9],
'col4':[1,3,5]})
print (nongrouped)
col1 col2 col3 col4
0 1 1 1 1
1 1 1 1 3
2 3 6 9 5
print (nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean())
1 1 1 2
3 6 9 5
Name: col4, dtype: int64
min_val = nongrouped.groupby(['col1', 'col2', 'col3'])['col4'].mean().min()
print (min_val)
2
EDIT:
You need aggregate:
groupeddf = nongrouped.groupby(['col1', 'col2'], sort=False)
.agg({'col3':'mean','col4':'min'})
.reset_index()
.reindex(columns=nongrouped.columns)
print (groupeddf)
col1 col2 col3 col4
0 1 2 3.5 1
1 2 4 1.5 1
2 2 3 1.0 3

calculate percentage values depending on size group in dataframe - pandas

I have a dataframe as below:
idx col1 col2 col3
0 1.1 A 100
1 1.1 A 100
2 1.1 A 100
3 2.6 B 100
4 2.5 B 100
5 3.4 B 100
6 2.6 B 100
I want to update col3 with percentage values depending on the group size of col1,col2 (two columns ie., for each row having 1.1,A - col3 value should have 33.33)
Desired output:
idx col1 col2 col3
0 1.1 A 33.33
1 1.1 A 33.33
2 1.1 A 33.33
3 2.6 B 50
4 2.5 B 100
5 3.4 B 100
6 2.6 B 50
I think you need groupby with transform size:
df['col3'] = 100 / df.groupby(['col1', 'col2'])['col3'].transform('size')
print df
col1 col2 col3
idx
0 1.1 A 33.333333
1 1.1 A 33.333333
2 1.1 A 33.333333
3 2.6 B 50.000000
4 2.5 B 100.000000
5 3.4 B 100.000000
6 2.6 B 50.000000

Categories

Resources