Fill values of a column based on mean of another column - python

I have a pandas DataFrame. I'm trying to fill the nans of the Price column based on the average price of the corresponding level in the Section column. What's an efficient and elegant way to do this? My data looks something like this
Name Sex Section Price
Joe M 1 2
Bob M 1 nan
Nancy F 2 5
Grace F 1 6
Jen F 2 3
Paul M 2 nan

You could use combine groupby, transform, and mean. Note that I've modified your example because otherwise both Sections have the same mean value. Starting from
In [21]: df
Out[21]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 NaN
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 NaN
we can use
df["Price"] = (df["Price"].fillna(df.groupby("Section")["Price"].transform("mean"))
to produce
In [23]: df
Out[23]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5
This works because we can compute the mean by Section:
In [29]: df.groupby("Section")["Price"].mean()
Out[29]:
Section
1 4.0
2 7.5
Name: Price, dtype: float64
and broadcast this back up to a full Series we can pass to fillna() using transform:
In [30]: df.groupby("Section")["Price"].transform("mean")
Out[30]:
0 4.0
1 4.0
2 7.5
3 4.0
4 7.5
5 7.5
Name: Price, dtype: float64

pandas surgical but slower
Refer to #DSM's answer for a quicker pandas solution
This is a more surgical approach that may provide some perspective, possibly usefull
use groupyby
calculate our mean for each Section
means = df.groupby('Section').Price.mean()
identify nulls
use isnull to use for boolean slicing
nulls = df.Price.isnull()
use map
slice the Section column to limit to just those rows with null Price
fills = df.Section[nulls].map(means)
use loc
fill in the spots in df only where nulls are
df.loc[nulls, 'Price'] = fills
All together
means = df.groupby('Section').Price.mean()
nulls = df.Price.isnull()
fills = df.Section[nulls].map(means)
df.loc[nulls, 'Price'] = fills
print(df)
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5

by "corresponding level" i am assuming you mean with equal section value.
if so, you can solve this by
for section_value in sorted(set(df.Section)):
df.loc[df['Section']==section_value, 'Price'] = df.loc[df['Section']==section_value, 'Price'].fillna(df.loc[df['Section']==section_value, 'Price'].mean())
hope it helps! peace

Related

Apply a softmax function on groupby in the same pandas dataframe

I have been looking to apply the following softmax function from https://machinelearningmastery.com/softmax-activation-function-with-python/
from scipy.special import softmax
# define data
data = [1, 3, 2]
# calculate softmax
result = softmax(data)
# report the probabilities
print(result)
[0.09003057 0.66524096 0.24472847]
I am trying to apply this to a dataframe which is split by groups, and return the probabilites row by row for a group.
My dataframe is:
import pandas as pd
#Create DF
d = {
'EventNo': ['10','10','12','12','12'],
'Name': ['Joe','Jack','John','James','Jim'],
'Rating':[30,32,2.5,3,4],
}
df = pd.DataFrame(data=d)
df
EventNo Name Rating
0 10 Joe 30.0
1 10 Jack 32.0
2 12 John 2.5
3 12 James 3.0
4 12 Jim 4
In this instance there are two different events (10 and 12) where for event 10 the values are data = [30,32] and event 12 data = [2.5,3,4]
My expected result would be a new column probabilities with the results:
EventNo Name Rating Probabilities
0 10 Joe 30.0 0.1192
1 10 Jack 32.0 0.8807
2 12 John 2.5 0.1402
3 12 James 3.0 0.2312
4 12 Jim 4 0.6285
Any help on how to do this on all groups in the dataframe would be much appreciated! Thanks!
You can use groupby followed by transform which returns results indexed by the original dataframe. A simple way to do it would be
df["Probabilities"] = df.groupby('EventNo')["Rating"].transform(softmax)
The result is
EventNo Name Rating Probabilities
0 10 Joe 30.0 0.119203
1 10 Jack 32.0 0.880797
2 12 John 2.5 0.140244
3 12 James 3.0 0.231224
4 12 Jim 4.0 0.628532

Calculate Mean on Multiple Groups

I have a Table
Sex Value1 Value2 City
M 2 1 Berlin
W 3 5 Paris
W 1 3 Paris
M 2 5 Berlin
M 4 2 Paris
I want to calculate the average of Value1 and Value2 for different groups. In my origial Dataset I have 10 Group variables (with a max of 5 characteristics like 5 Cities) that I have shortened to Sex and City (2 Characteristics) in this example. The result should look like this
AvgOverall AvgM AvgW AvgBerlin AvgParis
Value1 2,4 2,6 2 2 2,66
Value2 3,2 2,6 4 3 3,3
I am familiar with the group by and tried
df.groupby('City').mean()
But here we have the problem that Sex is getting also into the calculation. Does anyone has an idea how to solve this? Thanks in advance!
You can grouping by 2 columns to 2 dataframes and then use concat also with means of numeric columns (non numeric are excluded):
df1 = df.groupby('City').mean().T
df2 = df.groupby('Sex').mean().T
df3 = pd.concat([df.mean().rename('Overall'), df2, df1], axis=1).add_prefix('Avg')
print (df3)
AvgOverall AvgM AvgW AvgBerlin AvgParis
Value1 2.4 2.666667 2.0 2.0 2.666667
Value2 3.2 2.666667 4.0 3.0 3.333333

Derive multiple df from single df such that each df has no NaN values

I want to convert this table
0 thg John 3.0
1 thg James 4.0
2 mol NaN 5.0
3 mol NaN NaN
4 lob NaN NaN
In this following tables
df1
movie name rating
0 thg John 3.0
1 thg James 4.0
df2
movie rating
2 mol 5.0
df3
movie
3 mol
4 lob
Where each dataframe has no Nan value, Also tell method if I need to separate with respect to blank value instead of Nan.
I think that start of a new target DataFrame should occur not
only when the number of NaN values changes (compared to
previous row), but also when this number is the same, but
NaN values are in different columns.
So I propose the following formula:
dfs = [g.dropna(how='all',axis=1) for _,g in
df.groupby(df.isna().ne(df.isna().shift()).any(axis=1).cumsum())]
You can print partial DataFrames (any number of them) running:
n = 0
for grp in dfs:
print(f'\ndf No {n}:\n{grp}')
n += 1
The advantage of my solution over the other becomes obvious when you add
to the source DataFrame another row containing:
5 NaN NaN 3.0
It contains also 1 non-null value (like two previous rows).
The other solution will treat all these rows as one partial DataFrame
containing:
movie rating
3 mol NaN
4 lob NaN
5 NaN 3.0
as you can see, with NaN values, whereas my solution divides these
rows into 2 separate DataFrames, without any NaN.
create a list of dfs , with a groupby and dropna:
dfs = [g.dropna(how='all',axis=1) for _,g in df.groupby(df.isna().sum(1))]
print(dfs[0],'\n\n',dfs[1],'\n\n',dfs[2])
Or dict:
d = {f"df{e+1}": g[1].dropna(how='all',axis=1)
for e,g in enumerate(df.groupby(df.isna().sum(1)))}
print(d['df1'],'\n\n',d['df2'],'\n\n',d['df3']) #read the keys of d
movie name rating
0 thg John 3.0
1 thg James 4.0
movie rating
2 mol 5.0
movie
3 mol
4 lob

Moving average with pandas using the 2 prior occurrences

I was able to find the proper formula for a Moving average here: Moving Average SO Question
The issue is it is using the 1 occurrence prior and the current rows input. I am trying to use the 2 prior occurrence to the row I am trying to predict.
import pandas as pd
import numpy as np
df = pd.DataFrame({'person':['john','mike','john','mike','john','mike'],
'pts':[10,9,2,2,5,5]})
df['avg'] = df.groupby('person')['pts'].transform(lambda x: x.rolling(2).mean())
OUTPUT:
From the output we see that Johns second entry is using his first and the current row to Avg. What I am looking for is John and Mikes last occurrences to be John: 6 and Mike: 5.5 using the prior two, not the previous one and the current rows input. I am using this for a prediction and would not know the current rows pts because they haven't happend yet. New to Machine Learning and this was my first thought for a feature.
If want shift per groups add Series.shift to lambda function:
df['avg'] = df.groupby('person')['pts'].transform(lambda x: x.rolling(2).mean().shift())
print (df)
person pts avg
0 john 10 NaN
1 mike 9 NaN
2 john 2 NaN
3 mike 2 NaN
4 john 5 6.0
5 mike 5 5.5
Try:
df['avg'] = df.groupby('person').rolling(3)['pts'].sum().reset_index(level=0, drop=True)
df['avg']=df['avg'].sub(df['pts']).div(2)
Outputs:
person pts avg
0 john 10 NaN
1 mike 9 NaN
2 john 2 NaN
3 mike 2 NaN
4 john 5 6.0
5 mike 5 5.5

Loop that counts unique values in a pandas df

I am trying to create a loop or a more efficient process that can count the amount of current values in a pandas df. At the moment I'm selecting the value I want to perform the function on.
So for the df below, I'm trying to determine two counts.
1) ['u'] returns the count of the same remaining values left in ['Code', 'Area']. So how many remaining times the same values occur.
2) ['On'] returns the amount of values that are currently occurring in ['Area']. It achieves this by parsing through the df to see if those values occur again. So it essentially looks into the future to see if those values occur again.
import pandas as pd
d = ({
'Code' : ['A','A','A','A','B','A','B','A','A','A'],
'Area' : ['Home','Work','Shops','Park','Cafe','Home','Cafe','Work','Home','Park'],
})
df = pd.DataFrame(data=d)
#Select value
df1 = df[df.Code == 'A'].copy()
df1['u'] = df1[::-1].groupby('Area').Area.cumcount()
ids = [1]
seen = set([df1.iloc[0].Area])
dec = False
for val, u in zip(df1.Area[1:], df1.u[1:]):
ids.append(ids[-1] + (val not in seen) - dec)
seen.add(val)
dec = u == 0
df1['On'] = ids
df1 = df1.reindex(df.index).fillna(df1)
The problem is I want to run this script on all values in Code. Instead of selecting one at a time. For instance, if I want to do the same thing on Code['B'], I would have to change: df2 = df1[df1.Code == 'B'].copy() and the run the script again.
If I have numerous values in Code it becomes very inefficient. I need a loop where it finds all unique values in 'Code'Ideally, the script would look like:
df1 = df[df.Code == 'All unique values'].copy()
Intended Output:
Code Area u On
0 A Home 2.0 1.0
1 A Work 1.0 2.0
2 A Shops 0.0 3.0
3 A Park 1.0 3.0
4 B Cafe 1.0 1.0
5 A Home 1.0 3.0
6 B Cafe 0.0 1.0
7 A Work 0.0 3.0
8 A Home 0.0 2.0
9 A Park 0.0 1.0
I find your "On" logic very confusing. That said, I think I can reproduce it:
df["u"] = df.groupby(["Code", "Area"]).cumcount(ascending=False)
df["nunique"] = pd.get_dummies(df.Area).groupby(df.Code).cummax().sum(axis=1)
df["On"] = (df["nunique"] -
(df["u"] == 0).groupby(df.Code).cumsum().groupby(df.Code).shift().fillna(0)
which gives me
In [212]: df
Out[212]:
Code Area u nunique On
0 A Home 2 1 1.0
1 A Work 1 2 2.0
2 A Shops 0 3 3.0
3 A Park 1 4 3.0
4 B Cafe 1 1 1.0
5 A Home 1 4 3.0
6 B Cafe 0 1 1.0
7 A Work 0 4 3.0
8 A Home 0 4 2.0
9 A Park 0 4 1.0
In this, u is the number of matching (Code, Area) pairs after that row. nunique is the number of unique Area values seen so far in that Code.
On is the number of unique Areas seen so far, except that once we "run out" of an Area -- once it's not used any more -- we start subtracting it from nuniq.
Using GroupBy with size and cumcount, you can construct your u series.
Your logic for On isn't clear: this requires clarification.
g = df.groupby(['Code', 'Area'])
df['u'] = g['Code'].transform('size') - (g.cumcount() + 1)
print(df)
Code Area u
0 A Home 2
1 A Home 1
2 B Shops 1
3 A Park 1
4 B Cafe 1
5 B Shops 0
6 A Home 0
7 B Cafe 0
8 A Work 0
9 A Park 0

Categories

Resources