I've got the following dataframe
lst=[['01012021','',100],['01012021','','50'],['01022021',140,5],['01022021',160,12],['01032021','',20],['01032021',200,25]]
df1=pd.DataFrame(lst,columns=['Date','AuM','NNA'])
I am looking for a code which sums the columns AuM and NNA only if the values of column AuM contains a value. The result is showed below:
lst=[['01012021','',100,''],['01012021','','50',''],['01022021',140,5,145],['01022021',160,12,172],['01032021','',20,'']]
df2=pd.DataFrame(lst,columns=['Date','AuM','NNA','Sum'])
It is not a good practice to use '' in place of NaN when you have numeric data.
That said, a generic solution to your issue would be to use sum with the skipna=False option:
df1['Sum'] = (df1[['AuM', 'NNA']] # you can use as many columns as you want
.apply(pd.to_numeric, errors='coerce') # convert to numeric
.sum(1, skipna=False) # sum if all are non-NaN
.fillna('') # fill NaN with empty string (bad practice)
)
output:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0
I assume you mean to include the last row too:
df2 = (df1.assign(Sum=df1.loc[df1.AuM.ne(""), ["AuM", "NNA"]].sum(axis=1))
.fillna(""))
print(df2)
Result:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0
I have a dataset, df, where I have a new value for each day. I would like to output the percent difference of these values from row to row as well as the raw value difference:
Date Value
10/01/2020 1
10/02/2020 2
10/03/2020 5
10/04/2020 8
Desired output:
Date Value PercentDifference ValueDifference
10/01/2020 1
10/02/2020 2 100 2
10/03/2020 5 150 3
10/04/2020 8 60 3
This is what I am doing:
import pandas as pd
df = pd.read_csv('df.csv')
df = (df.merge(df.assign(Date=df['Date'] - pd.to_timedelta('1D')),
on='Date')
.assign(Value = lambda x: x['Value_y']-x['Value_x'])
[['Date','Value']]
)
df['PercentDifference'] = [f'{x:.2%}' for x in (df['Value'].div(df['Value'].shift(1)) -
1).fillna(0)]
A member has helped me with the code above, I am also trying to incorporate the value difference as shown in my desired output.
Note - Is there a way to incorporate a 'period' - say, checking the percent difference and value difference over a 7 day period and 30 day period and so on?
Any suggestion is appreciated
Use Series.pct_change and Series.diff
df['PercentageDiff'] = df['Value'].pct_change().mul(100)
df['ValueDiff'] = df['Value'].diff()
Date Value PercentageDiff ValueDiff
0 10/01/2020 1 NaN NaN
1 10/02/2020 2 100.0 1.0
2 10/03/2020 5 150.0 3.0
3 10/04/2020 8 60.0 3.0
Or you use df.assign
df.assign(
percentageDiff=df["Value"].pct_change().mul(100),
ValueDiff=df["Value"].diff()
)
My intention is to replace labels. I found out about using a dictionary and map it to the dataframe. To that end, I first extracted the necessary fields and created a dictionary which I then fed to the map function.
My programme is as follows:
factor_name = 'Help in household'
df = pd.read_csv('dat.csv')
labels = pd.read_csv('labels.csv')
fact_df = labels.loc[labels['Column'] == factor_name]
fact_dict = dict(zip(fact_df['Level'], fact_df['Rename']))
print df.index.to_series().map(fact_dict)
My labels.csv is as follows:
Column,Name,Level,Rename
Help in household,Every day,4,Every day
Help in household,Never,1,Never
Help in household,Once a month,2,Once a month
Help in household,Once a week,3,Once a week
State,AN,AN,Andaman & Nicobar
State,AP,AP,Andhra Pradesh
State,AR,AR,Arunachal Pradesh
State,BR,BR,Bihar
State,CG,CG,Chattisgarh
State,CH,CH,Chandigarh
State,DD,DD,Daman & Diu
State,DL,DL,Delhi
State,DN,DN,Dadra & Nagar Haveli
State,GA,GA,Goa
State,GJ,GJ,Gujarat
State,HP,HP,Himachal Pradesh
State,HR,HR,Haryana
State,JH,JH,Jharkhand
State,JK,JK,Jammu & Kashmir
State,KA,KA,Karnataka
State,KL,KL,Kerala
State,MG,MG,Meghalaya
State,MH,MH,Maharashtra
State,MN,MN,Manipur
State,MP,MP,Madhya Pradesh
State,MZ,MZ,Mizoram
State,NG,NG,Nagaland
State,OR,OR,Orissa
State,PB,PB,Punjab
State,PY,PY,Pondicherry
State,RJ,RJ,Rajasthan
State,SK,SK,Sikkim
State,TN,TN,Tamil Nadu
State,TR,TR,Tripura
State,UK,UK,Uttarakhand
State,UP,UP,Uttar Pradesh
State,WB,WB,West Bengal
My dat.csv is as follows:
Id,Help in household,Maths,Reading,Science,Social
11011001001,4,20.37,,27.78,
11011001002,3,12.96,,38.18,
11011001003,4,27.78,70,,
11011001004,4,,56.67,,36
11011001005,1,,,14.55,8.33
11011001006,4,,23.33,,30
11011001007,4,40.74,70,,
11011001008,3,,26.67,,22.92
Intended result is as follows:
4 Every day
1 Never
2 Once a month
3 Once a week
The mapping fails. The result always causes NaNs to appear which I do not want. Can anyone tell me why?
Try this:
In [140]: df['Help in household'] \
.astype(str) \
.map(labels.loc[labels['Column']=='Help in household',['Level','Rename']]
.set_index('Level')['Rename'])
Out[140]:
0 Every day
1 Once a week
2 Every day
3 Every day
4 Never
5 Every day
6 Every day
7 Once a week
Name: Help in household, dtype: object
You may also consider using merge:
In [147]: df.assign(Level=df['Help in household'].astype(str)) \
.merge(labels.loc[labels['Column']=='Help in household',['Level','Rename']],
on='Level')
Out[147]:
Id Help in household Maths Reading Science Social Level Rename
0 11011001001 4 20.37 NaN 27.78 NaN 4 Every day
1 11011001003 4 27.78 70.00 NaN NaN 4 Every day
2 11011001004 4 NaN 56.67 NaN 36.00 4 Every day
3 11011001006 4 NaN 23.33 NaN 30.00 4 Every day
4 11011001007 4 40.74 70.00 NaN NaN 4 Every day
5 11011001002 3 12.96 NaN 38.18 NaN 3 Once a week
6 11011001008 3 NaN 26.67 NaN 22.92 3 Once a week
7 11011001005 1 NaN NaN 14.55 8.33 1 Never
I'm currently working with panel data in Python and I'm trying to compute the rolling average for each time series observation within a given group (ID).
Given the size of my data set (thousands of groups with multiple time periods), the .groupby and .apply() functions are taking way too long to compute (has been running over an hour and still nothing -- entire data set only contains around 300k observations).
I'm ultimately wanting to iterate over multiple columns, doing the following:
Compute a rolling average for each time step in a given column, per group ID
Create a new column containing the difference between the original value and the moving average [x_t - (x_t-1 + x_t)/2]
Store column in a new DataFrame, which would be identical to original data set, except that it has the residual from #2 instead of the original value.
Repeat and append new residuals to df_resid (as seen below)
df_resid
date id rev_resid exp_resid
2005-09-01 1 NaN NaN
2005-12-01 1 -10000 -5500
2006-03-01 1 -352584 -262058.5
2006-06-01 1 240000 190049.5
2006-09-01 1 82648.75 37724.25
2005-09-01 2 NaN NaN
2005-12-01 2 4206.5 24353
2006-03-01 2 -302574 -331951
2006-06-01 2 103179 117405.5
2006-09-01 2 -52650 -72296.5
Here's small sample of the original data.
df
date id rev exp
2005-09-01 1 745168.0 545168.0
2005-12-01 1 725168.0 534168.0
2006-03-01 1 20000.0 10051.0
2006-06-01 1 500000.0 390150.0
2006-09-01 1 665297.5 465598.5
2005-09-01 2 956884.0 736987.0
2005-12-01 2 965297.0 785693.0
2006-03-01 2 360149.0 121791.0
2006-06-01 2 566507.0 356602.0
2006-09-01 2 461207.0 212009.0
And the (very slow) code:
df['rev_resid'] = df.groupby('id')['rev'].apply(lambda x:x.rolling(center=False,window=2).mean())
I'm hoping there is a much more computationally efficient way to do this (primarily with respect to #1), and could be extended to multiple columns.
Any help would be truly appreciated.
To quicken up the calculation, if dataframe is already sorted on 'id' then you don't have to do rolling within a groupby (if it isn't sorted... do so). Then since your window is only length 2 then we mask the result by checking where id == id.shift This works because it's sorted.
d1 = df[['rev', 'exp']]
df.join(
d1.rolling(2).mean().rsub(d1).add_suffix('_resid')[df.id.eq(df.id.shift())]
)
date id rev exp rev_resid exp_resid
0 2005-09-01 1 745168.0 545168.0 NaN NaN
1 2005-12-01 1 725168.0 534168.0 -10000.00 -5500.00
2 2006-03-01 1 20000.0 10051.0 -352584.00 -262058.50
3 2006-06-01 1 500000.0 390150.0 240000.00 190049.50
4 2006-09-01 1 665297.5 465598.5 82648.75 37724.25
5 2005-09-01 2 956884.0 736987.0 NaN NaN
6 2005-12-01 2 965297.0 785693.0 4206.50 24353.00
7 2006-03-01 2 360149.0 121791.0 -302574.00 -331951.00
8 2006-06-01 2 566507.0 356602.0 103179.00 117405.50
9 2006-09-01 2 461207.0 212009.0 -52650.00 -72296.50
I've got a 'DataFrame` which has occasional missing values, and looks something like this:
Monday Tuesday Wednesday
================================================
Mike 42 NaN 12
Jenna NaN NaN 15
Jon 21 4 1
I'd like to add a new column to my data frame where I'd calculate the average across all columns for every row.
Meaning, for Mike, I'd need
(df['Monday'] + df['Wednesday'])/2, but for Jenna, I'd simply use df['Wednesday amt.']/1
Does anyone know the best way to account for this variation that results from missing values and calculate the average?
You can simply:
df['avg'] = df.mean(axis=1)
Monday Tuesday Wednesday avg
Mike 42 NaN 12 27.000000
Jenna NaN NaN 15 15.000000
Jon 21 4 1 8.666667
because .mean() ignores missing values by default: see docs.
To select a subset, you can:
df['avg'] = df[['Monday', 'Tuesday']].mean(axis=1)
Monday Tuesday Wednesday avg
Mike 42 NaN 12 42.0
Jenna NaN NaN 15 NaN
Jon 21 4 1 12.5
Alternative - using iloc (can also use loc here):
df['avg'] = df.iloc[:,0:2].mean(axis=1)
Resurrecting this Question because all previous answers currently print a Warning.
In most cases, use assign():
df = df.assign(avg=df.mean(axis=1))
For specific columns, one can input them by name:
df = df.assign(avg=df.loc[:, ["Monday", "Tuesday", "Wednesday"]].mean(axis=1))
Or by index, using one more than the last desired index as it is not inclusive:
df = df.assign(avg=df.iloc[:,0:3]].mean(axis=1))