I have a dataset, df, where I have a new value for each day. I would like to output the percent difference of these values from row to row as well as the raw value difference:
Date Value
10/01/2020 1
10/02/2020 2
10/03/2020 5
10/04/2020 8
Desired output:
Date Value PercentDifference ValueDifference
10/01/2020 1
10/02/2020 2 100 2
10/03/2020 5 150 3
10/04/2020 8 60 3
This is what I am doing:
import pandas as pd
df = pd.read_csv('df.csv')
df = (df.merge(df.assign(Date=df['Date'] - pd.to_timedelta('1D')),
on='Date')
.assign(Value = lambda x: x['Value_y']-x['Value_x'])
[['Date','Value']]
)
df['PercentDifference'] = [f'{x:.2%}' for x in (df['Value'].div(df['Value'].shift(1)) -
1).fillna(0)]
A member has helped me with the code above, I am also trying to incorporate the value difference as shown in my desired output.
Note - Is there a way to incorporate a 'period' - say, checking the percent difference and value difference over a 7 day period and 30 day period and so on?
Any suggestion is appreciated
Use Series.pct_change and Series.diff
df['PercentageDiff'] = df['Value'].pct_change().mul(100)
df['ValueDiff'] = df['Value'].diff()
Date Value PercentageDiff ValueDiff
0 10/01/2020 1 NaN NaN
1 10/02/2020 2 100.0 1.0
2 10/03/2020 5 150.0 3.0
3 10/04/2020 8 60.0 3.0
Or you use df.assign
df.assign(
percentageDiff=df["Value"].pct_change().mul(100),
ValueDiff=df["Value"].diff()
)
Related
This question already has answers here:
How to join two dataframes for which column values are within a certain range?
(9 answers)
Closed 7 days ago.
Existing dataframe :
df_1
Id dates time(sec)_1 time(sec)_2
1 02/02/2022 15 20
1 04/02/2022 20 30
1 03/02/2022 30 40
1 06/02/2022 50 40
2 10/02/2022 10 10
2 11/02/2022 15 20
df_2
Id min_date action_date
1 02/02/2022 04/02/2022
2 06/02/2022 10/02/2022
Expected Dataframe :
df_2
Id min_date action_date count_of_dates avg_time_1 avg_time_2
1 02/02/2022 04/02/2022 3 21.67 30
2 06/02/2022 10/02/2022 1 10 10
count of dates, avg_time_1 , avg_time_2 is to be created from the df_1.
count of dates is calculated considering the min_date and action_date i.e. number of dates from from df_1 falling under min_date and action_date.
avg_time_1 and avg_time_2 are calculated w.r.t. to count of dates
stuck with applying the condition for dates :-( any leads.?
If small data is possible filter per rows by custom function:
df_1['dates'] = df_1['dates'].apply(pd.to_datetime)
df_2[['min_date','action_date']] = df_2[['min_date','action_date']].apply(pd.to_datetime)
def f(x):
m = df_1['Id'].eq(x['Id']) & df_1['dates'].between(x['min_date'], x['action_date'])
s = df_1.loc[m, ['time(sec)_1','time(sec)_2']].mean()
return pd.Series([m.sum()] + s.to_list(), index=['count_of_dates'] + s.index.tolist())
df = df_2.join(df_2.apply(f, axis=1))
print (df)
Id min_date action_date count_of_dates time(sec)_1 time(sec)_2
0 1 2022-02-02 2022-04-02 3.0 21.666667 30.0
1 2 2022-06-02 2022-10-02 1.0 10.000000 10.0
If Id in df_2 is unique is possible improve performance by merge df_1 with aggregate size and mean:
df = df_2.merge(df_1, on='Id')
d = {'count_of_dates':('Id','size'),
'time(sec)_1':('time(sec)_1','mean'),
'time(sec)_2':('time(sec)_2','mean')}
df = df_2.join(df[df['dates'].between(df['min_date'], df['action_date'])]
.groupby('Id').agg(**d), on='Id')
print (df)
Id min_date action_date count_of_dates time(sec)_1 time(sec)_2
0 1 2022-02-02 2022-04-02 3 21.666667 30
1 2 2022-06-02 2022-10-02 1 10.000000 10
df: (DataFrame)
Open High Close Volume
2020/1/1 1 2 3 323232
2020/1/2 2 3 4 321321
....
2020/12/31 4 5 6 123213
....
2021
The performance i needed is : (Graph NO.1)
Open High Close Volume Year_Sum_Volume
2020/1/1 1 2 3 323232 (323232 + 321321 +....+ 123213)
2020/1/2 2 3 4 321321 (323232 + 321321 +....+ 123213)
....
2020/12/31 4 5 6 123213 (323232 + 321321 +....+ 123213)
....
2021 (x+x+x.....x)
I want a sum of Volume in different year (the Year_Sum_Volume is the volume of each year)
This is the code i try to calculate the sum of volume in each year but how can i add this data
to daily data , i want to add Year_Sum_Volume to df,like(Graph no.1)
df.resample('Y', on='Date')['Volume'].sum()
thanks you for answering
I believe groupby.sum() and merge should be your friends
import pandas as pd
df = pd.DataFrame({"date":['2021-12-30', '2021-12-31', '2022-01-01'], "a":[1,2.1,3.2]})
df.date = pd.to_datetime(df.date)
df["year"] = df.date.dt.year
df_sums = df.groupby("year").sum().rename(columns={"a":"a_sum"})
df = df.merge(df_sums, right_index=True, left_on="year")
which gives:
date
a
year
a_sum
0
2021-12-30 00:00:00
1
2021
3.1
1
2021-12-31 00:00:00
2.1
2021
3.1
2
2022-01-01 00:00:00
3.2
2022
3.2
Based on your output, Year_Sum_Volume is the same value for every row and can be calculated using df['Volume'].sum().
Then you join a column of a scaled list:
df.join(pd.DataFrame( {'Year_Sum_Volume': [your_sum_val] * len(df['Volume'])} ))
Try below code (after converting date column to pd.to_datetime)
df.assign(Year_Sum_Volume = df.groupby(df['date'].dt.year)['a'].transform('sum'))
I've got the following dataframe
lst=[['01012021','',100],['01012021','','50'],['01022021',140,5],['01022021',160,12],['01032021','',20],['01032021',200,25]]
df1=pd.DataFrame(lst,columns=['Date','AuM','NNA'])
I am looking for a code which sums the columns AuM and NNA only if the values of column AuM contains a value. The result is showed below:
lst=[['01012021','',100,''],['01012021','','50',''],['01022021',140,5,145],['01022021',160,12,172],['01032021','',20,'']]
df2=pd.DataFrame(lst,columns=['Date','AuM','NNA','Sum'])
It is not a good practice to use '' in place of NaN when you have numeric data.
That said, a generic solution to your issue would be to use sum with the skipna=False option:
df1['Sum'] = (df1[['AuM', 'NNA']] # you can use as many columns as you want
.apply(pd.to_numeric, errors='coerce') # convert to numeric
.sum(1, skipna=False) # sum if all are non-NaN
.fillna('') # fill NaN with empty string (bad practice)
)
output:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0
I assume you mean to include the last row too:
df2 = (df1.assign(Sum=df1.loc[df1.AuM.ne(""), ["AuM", "NNA"]].sum(axis=1))
.fillna(""))
print(df2)
Result:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0
I have a dataframe df_corp:
ID arrival_date leaving_date
1 01/02/20 05/02/20
2 01/03/20 07/03/20
1 12/02/20 20/02/20
1 07/03/20 10/03/20
2 10/03/20 15/03/20
I would like to find the difference between leaving_date of a row and arrival date of the next entry with respect to ID. Basically I want to know how long before they book again.
So it'll look something like this.
ID arrival_date leaving_date time_between
1 01/02/20 05/02/20 NaN
2 01/03/20 07/03/20 NaN
1 12/02/20 20/02/20 7
1 07/03/20 10/03/20 15
2 10/03/20 15/03/20 3
I've tried grouping by ID to do the sum but I'm seriously lost on how to get the value from the next row and a different column in one.
You need to convert to_datetime and to perform a GroupBy.shift to get the previous departure date:
# arrival
a = pd.to_datetime(df_corp['arrival_date'], dayfirst=True)
# previous departure per ID
l = pd.to_datetime(df_corp['leaving_date'], dayfirst=True).groupby(df_corp['ID']).shift()
# difference in days
df_corp['time_between'] = (a-l).dt.days
output:
ID arrival_date leaving_date time_between
0 1 01/02/20 05/02/20 NaN
1 2 01/03/20 07/03/20 NaN
2 1 12/02/20 20/02/20 7.0
3 1 07/03/20 10/03/20 16.0
4 2 10/03/20 15/03/20 3.0
I'm currently working with panel data in Python and I'm trying to compute the rolling average for each time series observation within a given group (ID).
Given the size of my data set (thousands of groups with multiple time periods), the .groupby and .apply() functions are taking way too long to compute (has been running over an hour and still nothing -- entire data set only contains around 300k observations).
I'm ultimately wanting to iterate over multiple columns, doing the following:
Compute a rolling average for each time step in a given column, per group ID
Create a new column containing the difference between the original value and the moving average [x_t - (x_t-1 + x_t)/2]
Store column in a new DataFrame, which would be identical to original data set, except that it has the residual from #2 instead of the original value.
Repeat and append new residuals to df_resid (as seen below)
df_resid
date id rev_resid exp_resid
2005-09-01 1 NaN NaN
2005-12-01 1 -10000 -5500
2006-03-01 1 -352584 -262058.5
2006-06-01 1 240000 190049.5
2006-09-01 1 82648.75 37724.25
2005-09-01 2 NaN NaN
2005-12-01 2 4206.5 24353
2006-03-01 2 -302574 -331951
2006-06-01 2 103179 117405.5
2006-09-01 2 -52650 -72296.5
Here's small sample of the original data.
df
date id rev exp
2005-09-01 1 745168.0 545168.0
2005-12-01 1 725168.0 534168.0
2006-03-01 1 20000.0 10051.0
2006-06-01 1 500000.0 390150.0
2006-09-01 1 665297.5 465598.5
2005-09-01 2 956884.0 736987.0
2005-12-01 2 965297.0 785693.0
2006-03-01 2 360149.0 121791.0
2006-06-01 2 566507.0 356602.0
2006-09-01 2 461207.0 212009.0
And the (very slow) code:
df['rev_resid'] = df.groupby('id')['rev'].apply(lambda x:x.rolling(center=False,window=2).mean())
I'm hoping there is a much more computationally efficient way to do this (primarily with respect to #1), and could be extended to multiple columns.
Any help would be truly appreciated.
To quicken up the calculation, if dataframe is already sorted on 'id' then you don't have to do rolling within a groupby (if it isn't sorted... do so). Then since your window is only length 2 then we mask the result by checking where id == id.shift This works because it's sorted.
d1 = df[['rev', 'exp']]
df.join(
d1.rolling(2).mean().rsub(d1).add_suffix('_resid')[df.id.eq(df.id.shift())]
)
date id rev exp rev_resid exp_resid
0 2005-09-01 1 745168.0 545168.0 NaN NaN
1 2005-12-01 1 725168.0 534168.0 -10000.00 -5500.00
2 2006-03-01 1 20000.0 10051.0 -352584.00 -262058.50
3 2006-06-01 1 500000.0 390150.0 240000.00 190049.50
4 2006-09-01 1 665297.5 465598.5 82648.75 37724.25
5 2005-09-01 2 956884.0 736987.0 NaN NaN
6 2005-12-01 2 965297.0 785693.0 4206.50 24353.00
7 2006-03-01 2 360149.0 121791.0 -302574.00 -331951.00
8 2006-06-01 2 566507.0 356602.0 103179.00 117405.50
9 2006-09-01 2 461207.0 212009.0 -52650.00 -72296.50