Python/Pandas - Sum dataframe items if indexes have the same month - python

I have this two DataFrames:
Seasonal_Component:
# DataFrame that has the seasonal component of a time series
Date
2014-12 -1.08
2015-01 -0.28
2015-02 0.15
2015-03 0.46
2015-04 0.48
2015-05 0.37
2015-06 0.20
2015-07 0.15
2015-08 0.12
2015-09 -0.02
2015-10 -0.17
2015-11 -0.39
Prediction_df:
# DataFrame with the prediction of the trend of that same time series
Prediction MAPE Score
2015-11-01 7.93 1.83 1
2015-12-01 7.93 1.67 1
2016-01-01 7.92 1.71 1
2016-02-01 7.95 1.84 1
2016-03-01 7.94 1.53 1
2016-04-01 7.87 1.45 1
2016-05-01 7.91 1.53 1
2016-06-01 7.87 1.40 1
2016-07-01 7.84 1.40 1
2016-08-01 7.89 1.77 1
2016-09-01 7.87 1.99 1
What I need to do:
Check which Prediction_df index have the same months as the Seasonal_Component index and sum the correspondent seasonal component with the prediction, so the Prediction_df looks like this:
Prediction MAPE Score
2015-11-01 7,54 1.83 1
2015-12-01 6.85 1.67 1
2016-01-01 7.64 1.71 1
2016-02-01 8.10 1.84 1
2016-03-01 8.40 1.53 1
2016-04-01 8.35 1.45 1
2016-05-01 8.28 1.53 1
2016-06-01 8.07 1.40 1
2016-07-01 7.99 1.40 1
2016-08-01 8.01 1.77 1
2016-09-01 7.85 1.99 1
Anyone available to enlight my journey?
I'm already on the "almost mad" stage trying to solve this.
EDIT
Important note to make it clearer: I need to disconsider the year and consider only the month to make the sum. Something like "everytime that an April appears (doesn't matter if it is 2006 or 2025) I need to sum with the April value of the Seasonal_Component frame.

Consider a data frame merge on the date fields (month values), then a simple addition of the two fields. The date fields may require conversion from string values:
import datetime as dt
...
# IF DATES ARE REGULAR COLUMNS
seasonal_component['Date'] = pd.to_datetime(seasonal_component['Date'])
seasonal_component['Month'] = seasonal_component['Date'].dt.month
predict_df['Date'] = pd.to_datetime(predict_df['Date'])
predict_df['Month'] = predict_df['Date'].dt.month
# IF DATES ARE INDICES
seasonal_component.index = pd.to_datetime(seasonal_component.index)
seasonal_component['Month'] = seasonal_component.index.month
predict_df.index = pd.to_datetime(predict_df.index)
predict_df['Month'] = predict_df.index.month
However, think about how you need to join the two data sets (akin to SQL's join clauses):
inner (default) - keeps only records matching both
left - keeps records of predict_df and only those matching seasonal_component where predict_df is first argument
right - keeps records of seasonal_component and only those matching predict_df where predict_df is first argument
outer - keeps all records, those that match and those that don't match
Below assumes an outer join where data on both sides remain with NaNs to fill for missing values.
# MERGING DATA FRAMES
merge_df = pd.merge(predict_df, seasonal_component[['Month', 'SeasonalComponent']],
on=['Month'], how='outer')
# ADDING COLUMNS
merge_df['Prediction'] = merge_df['Prediction'] + merge_df['SeasonalComponent']
Outcome (using posted data)
Date Prediction MAPE Score Month SeasonalComponent
0 2015-11-01 7.54 1.83 1 11 -0.39
1 2015-12-01 6.85 1.67 1 12 -1.08
2 2016-01-01 7.64 1.71 1 1 -0.28
3 2016-02-01 8.10 1.84 1 2 0.15
4 2016-03-01 8.40 1.53 1 3 0.46
5 2016-04-01 8.35 1.45 1 4 0.48
6 2016-05-01 8.28 1.53 1 5 0.37
7 2016-06-01 8.07 1.40 1 6 0.20
8 2016-07-01 7.99 1.40 1 7 0.15
9 2016-08-01 8.01 1.77 1 8 0.12
10 2016-09-01 7.85 1.99 1 9 -0.02
11 NaT NaN NaN NaN 10 -0.17

Firstly separate the month from both dataframes and then merge on basis of month. Further add the required columns and create new column with desired output. Here is the code below:
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
from numpy.random import randn
Seasonal_Component = DataFrame({
'Date': ['2014-12','2015-01','2015-02','2015-03','2015-04','2015-05','2015-06','2015-07','2015-08','2015-09','2015-10','2015-11'],
'Value': [-1.08,-0.28,0.15,0.46,0.48,0.37,0.20,0.15,0.12,-0.02,-0.17,-0.39]
})
Prediction_df = DataFrame({
'Date': ['2015-11-01','2015-12-01','2016-01-01','2016-02-01','2016-03-01','2016-04-01','2016-05-01','2016-06-01','2016-07-01','2016-08-01','2016-09-01'],
'Prediction': [7.93,7.93,7.92,7.95,7.94,7.87,7.91,7.87,7.84,7.89,7.87],
'MAPE':[1.83,1.67,1.71,1.84,1.53,1.45,1.53,1.40,1.40,1.77,1.99],
'Score':[1,1,1,1,1,1,1,1,1,1,1]
})
def mon_extract(date):
return date.split('-')[1]
Seasonal_Component['Month']=Seasonal_Component['Date'].apply(mon_extract)
def mon_extract(date):
return date.split('-')[1].split('-')[0]
Prediction_df['Month']=Prediction_df['Date'].apply(mon_extract)
FinalDF=pd.merge(Seasonal_Component,Prediction_df,on='Month',how='right')
FinalDF
FinalDF['PredictionF']=FinalDF['Value']+FinalDF['Prediction']
FinalDF.loc[:,['Date_y','PredictionF','MAPE','Score']]

Related

Sum all columns by month?

I have a dataframe:
date C P
0 15.4.21 0.06 0.94
1 16.4.21 0.15 1.32
2 2.5.21 0.06 1.17
3 8.5.21 0.20 0.82
4 9.6.21 0.04 -5.09
5 1.2.22 0.05 7.09
I need to create 2 columns where I Sum both C and P for each month.
So new df will have 2 columns, for example for the month 4 (April) (0.06+0.94+0.15+1.32) = 2.47, so new df:
4/21 5/21 6/21 2/22
0 2.47 2.25 .. ..
Columns names and order doesn't matter, actualy a string month name even better(April 22).
I was playing with something like this, which is not what i need:
df[['C','P']].groupby(df['date'].dt.to_period('M')).sum()
You almost had it, you need to convert first to_datetime:
out = (df[['C','P']]
.groupby(pd.to_datetime(df['date'], day_first=True)
.dt.to_period('M'))
.sum()
)
Output:
C P
date
2021-02 0.06 1.17
2021-04 0.21 2.26
2021-08 0.20 0.82
2021-09 0.04 -5.09
2022-01 0.05 7.09
If you want the grand total, sum again:
out = (df[['C','P']]
.groupby(pd.to_datetime(df['date']).dt.to_period('M'))
.sum().sum(axis=1)
)
Output:
date
2021-02 1.23
2021-04 2.47
2021-08 1.02
2021-09 -5.05
2022-01 7.14
Freq: M, dtype: float64
as "Month year"
If you want a string, better convert it in the end to keep the order:
out.index = out.index.strftime('%B %y')
Output:
date
February 21 1.23
April 21 2.47
August 21 1.02
September 21 -5.05
January 22 7.14
dtype: float64

week of the year aggregation using python (week starts from 01 01 YYYY)

I search in previous questions, and it does not resolve what i am searching, please can u help me
I have a dataset from
Date T2M Y T F H G Week_Number
0 1981-01-01 11.08 17.35 6.94 0.00 5.37 4.63 1
1 1981-01-02 10.82 16.41 7.51 0.00 5.55 2.73 1
2 1981-01-03 10.74 15.64 7.35 0.00 6.23 2.33 1
3 1981-01-04 11.17 15.99 8.46 0.00 6.16 1.66 1
4 1981-01-05 10.20 15.60 6.87 0.12 6.10 2.78 2
5 1981-01-06 10.35 16.16 5.95 0.00 6.59 3.92 2
6 1981-01-07 12.26 18.24 9.30 0.00 6.10 2.30 2
7 1981-01-08 12.76 19.23 8.72 0.00 6.29 3.96 2
8 1981-01-09 12.61 17.80 8.90 0.00 6.71 2.05 2
I already created a column of the week number using this code
df['Week_Number'] = df['Date'].dt.week
but it gives me only the first four days of the year that design the first week, maybe it means that the week start from monday. In my cases I don t give interest if it start from monday or another day, I just want to subdivise each year every seven days (group every 7 days of each year like from 1 1 1980 to 07 1 1980 FISRT WEEK, and go on, and every next year the first week starts too from 1 1 xxxx
If you want your week numbers to start from the 1st of January, irrespective of the day of week, simply get the day of year, subtract 1 and compute the integer division by 7:
df['Date'] = pd.to_datetime(df['Date'])
df['week_number'] = df['Date'].dt.dayofyear.sub(1).floordiv(7).add(1)
NB. you do not need to add 1 if you want the first week to start with 0
output:
Date T2M Y T F H G Week_Number week_number
0 1981-01-01 11.08 17.35 6.94 0.00 5.37 4.63 1 1
1 1981-01-02 10.82 16.41 7.51 0.00 5.55 2.73 1 1
2 1981-01-03 10.74 15.64 7.35 0.00 6.23 2.33 1 1
3 1981-01-04 11.17 15.99 8.46 0.00 6.16 1.66 1 1
4 1981-01-05 10.20 15.60 6.87 0.12 6.10 2.78 2 1
5 1981-01-06 10.35 16.16 5.95 0.00 6.59 3.92 2 1
6 1981-01-07 12.26 18.24 9.30 0.00 6.10 2.30 2 1
7 1981-01-08 12.76 19.23 8.72 0.00 6.29 3.96 2 2
8 1981-01-09 12.61 17.80 8.90 0.00 6.71 2.05 2 2
Then you can use the new column to groupby, for example:
df.groupby('week_number').agg({'Date': ['min', 'max'], 'T2M': 'sum'})
output:
Date T2M
min max sum
week_number
1 1981-01-01 1981-01-07 76.62
2 1981-01-08 1981-01-09 25.37

How to decompose cohort data?

I'm trying to decompose cohort data into time series for further analysis. I'm imagining the algorithm pretty well, but my code doesn't work at all.
The input data in df is like:
Cohort Day
0
1
2
3
4
5
2020-12-27
5.87
4.9
2.89
1.47
1.38
0.95
2020-12-28
13.2
3.1
0.79
1.47
1.38
0.95
I'm trying to decompose it in this format:
day
sum
2020-12-27
5.87
2020-12-28
4.9
2020-12-29
2.89
2020-12-30
1.47
2020-12-31
1.38
2020-01-01
0.95
2020-12-28
13.2
2020-12-29
3.1
2020-12-30
0.79
2020-12-31
1.47
2020-01-01
1.38
2020-01-02
0.95
To achieve that I created an empty dataframe test and then I'm using for loop to create a column with dates at first:
for row in test.itertuples():
test[0:5, 0] = df['Cohort Day'] + df.apply(lambda x: int(str(df.iloc[0, 4:].columns)) for x in df.iteritems())
test[0:5, 1] = df[0, 1:].transpose()
But all I receive is an empty test dataframe.
Any suggestions will be appreciated!
Avoid using looping codes which are slow. Use fast vectorized Pandas built-in functions whenever possible.
You can transform the dataframe from wide to long by .stack(). Set day as Cohort Day plus the day offsets 0, 1, ..., 5, as follows:
# convert `Cohort Day` to datetime format
df['Cohort Day'] = pd.to_datetime(df['Cohort Day'])
# transform from wide to long
df2 = (df.set_index('Cohort Day')
.rename_axis(columns='day_offset')
.stack()
.reset_index(name='sum')
)
# convert day offsets 0, 1, 2, ..., 5 to timedelta format
df2['day_offset'] = pd.to_timedelta(df2['day_offset'].astype(int), unit='d')
# set up column `day` as the `Cohort Day` + day offsets
df2['day'] = df2['Cohort Day'] + df2['day_offset']
# Get the desired columns
df_out = df2[['day', 'sum']]
Result:
print(df_out)
day sum
0 2020-12-27 5.87
1 2020-12-28 4.90
2 2020-12-29 2.89
3 2020-12-30 1.47
4 2020-12-31 1.38
5 2021-01-01 0.95
6 2020-12-28 13.20
7 2020-12-29 3.10
8 2020-12-30 0.79
9 2020-12-31 1.47
10 2021-01-01 1.38
11 2021-01-02 0.95

Reindexing data frame Pandas

I am trying to split a data set for training and testing using Pandas.
data = pd.read_csv("housingdata.csv", header=None)
train = testing.sample(frac=0.6)
train.reindex()
test = testing.loc[~testing.index.isin(train.index)]
print train
print test
when I print the data, I get
0 1 2 3 4
9 0.17004 12.5 7.87 0 0.524
1 0.02731 0.0 7.07 0 0.469
5 0.02985 0.0 2.18 0 0.458
3 0.03237 0.0 2.18 0 0.458
7 0.14455 12.5 7.87 0 0.524
6 0.08829 12.5 7.87 0 0.524
0 1 2 3 4
0 0.00632 18.0 2.31 0 0.538
2 0.02729 0.0 7.07 0 0.469
4 0.06905 0.0 2.18 0 0.458
8 0.21124 12.5 7.87 0 0.524
As noticed, the row indices are re-shuffled. How to re-index the rows in both the data sets?
This however does not change global settings. Eg.,
train.iloc[0,4]
gives 0.524
As #EdChum's comments point out, it's not exactly clear what behavior you're looking for. But if all you want to do is to give both new dataframes indices going from 0, 1, 2 ... n then you can use reset_index():
train.reset_index(inplace=True, drop=True)
test.reset_index(inplace=True, drop=True)

Re-shaping Dataframe so that Column Headers are made into Rows

I am trying to reshape the dataframe below.
Tenor 2013M06D12 2013M06D13 2013M06D14 \
1 1 1.24 1.26 1.23
4 2 2.01 0.43 0.45
5 3 1.21 2.24 1.03
8 4 0.39 2.32 1.23
So, that it looks as follows. I was looking at using pivot_table, but this is sort of the opposite of what that would do as I need to convert Column Headers to rows and not the other way around. Hence, I am not sure how to proceed in order to obtain this dataframe.
Date Tenor Rate
1 2013-06-12 1 1.24
2 2013-06-13 1 1.26
4 2013-06-14 1 1.23
The code just involves reading from a CSV:
result = pd.DataFrame.read_csv("BankofEngland.csv")
I think you can do with with a melt, a sort, a date parse, and some column shuffling:
dfm = pd.melt(df, id_vars="Tenor", var_name="Date", value_name="Rate")
dfm = dfm.sort("Tenor").reset_index(drop=True)
dfm["Date"] = pd.to_datetime(dfm["Date"], format="%YM%mD%d")
dfm = dfm[["Date", "Tenor", "Rate"]]
produces
In [104]: dfm
Out[104]:
Date Tenor Rate
0 2013-06-12 1 1.24
1 2013-06-13 1 1.26
2 2013-06-14 1 1.23
3 2013-06-12 2 2.01
4 2013-06-13 2 0.43
5 2013-06-14 2 0.45
6 2013-06-12 3 1.21
7 2013-06-13 3 2.24
8 2013-06-14 3 1.03
9 2013-06-12 4 0.39
10 2013-06-13 4 2.32
11 2013-06-14 4 1.23
import pandas as pd
import numpy as np
# try to read your sample data, replace with your read_csv func
df = pd.read_clipboard()
Out[139]:
Tenor 2013M06D12 2013M06D13 2013M06D14
1 1 1.24 1.26 1.23
4 2 2.01 0.43 0.45
5 3 1.21 2.24 1.03
8 4 0.39 2.32 1.23
# reshaping
df.set_index('Tenor', inplace=True)
df = df.stack().reset_index()
df.columns=['Tenor', 'Date', 'Rate']
# suggested by DSM, use the date parser
df.Date = pd.to_datetime(df.Date, format='%YM%mD%d')
Out[147]:
Tenor Date Rate
0 1 2013-06-12 1.24
1 1 2013-06-13 1.26
2 1 2013-06-14 1.23
3 2 2013-06-12 2.01
4 2 2013-06-13 0.43
.. ... ... ...
7 3 2013-06-13 2.24
8 3 2013-06-14 1.03
9 4 2013-06-12 0.39
10 4 2013-06-13 2.32
11 4 2013-06-14 1.23
[12 rows x 3 columns]

Categories

Resources