How to decompose cohort data? - python

I'm trying to decompose cohort data into time series for further analysis. I'm imagining the algorithm pretty well, but my code doesn't work at all.
The input data in df is like:
Cohort Day
0
1
2
3
4
5
2020-12-27
5.87
4.9
2.89
1.47
1.38
0.95
2020-12-28
13.2
3.1
0.79
1.47
1.38
0.95
I'm trying to decompose it in this format:
day
sum
2020-12-27
5.87
2020-12-28
4.9
2020-12-29
2.89
2020-12-30
1.47
2020-12-31
1.38
2020-01-01
0.95
2020-12-28
13.2
2020-12-29
3.1
2020-12-30
0.79
2020-12-31
1.47
2020-01-01
1.38
2020-01-02
0.95
To achieve that I created an empty dataframe test and then I'm using for loop to create a column with dates at first:
for row in test.itertuples():
test[0:5, 0] = df['Cohort Day'] + df.apply(lambda x: int(str(df.iloc[0, 4:].columns)) for x in df.iteritems())
test[0:5, 1] = df[0, 1:].transpose()
But all I receive is an empty test dataframe.
Any suggestions will be appreciated!

Avoid using looping codes which are slow. Use fast vectorized Pandas built-in functions whenever possible.
You can transform the dataframe from wide to long by .stack(). Set day as Cohort Day plus the day offsets 0, 1, ..., 5, as follows:
# convert `Cohort Day` to datetime format
df['Cohort Day'] = pd.to_datetime(df['Cohort Day'])
# transform from wide to long
df2 = (df.set_index('Cohort Day')
.rename_axis(columns='day_offset')
.stack()
.reset_index(name='sum')
)
# convert day offsets 0, 1, 2, ..., 5 to timedelta format
df2['day_offset'] = pd.to_timedelta(df2['day_offset'].astype(int), unit='d')
# set up column `day` as the `Cohort Day` + day offsets
df2['day'] = df2['Cohort Day'] + df2['day_offset']
# Get the desired columns
df_out = df2[['day', 'sum']]
Result:
print(df_out)
day sum
0 2020-12-27 5.87
1 2020-12-28 4.90
2 2020-12-29 2.89
3 2020-12-30 1.47
4 2020-12-31 1.38
5 2021-01-01 0.95
6 2020-12-28 13.20
7 2020-12-29 3.10
8 2020-12-30 0.79
9 2020-12-31 1.47
10 2021-01-01 1.38
11 2021-01-02 0.95

Related

Sum all columns by month?

I have a dataframe:
date C P
0 15.4.21 0.06 0.94
1 16.4.21 0.15 1.32
2 2.5.21 0.06 1.17
3 8.5.21 0.20 0.82
4 9.6.21 0.04 -5.09
5 1.2.22 0.05 7.09
I need to create 2 columns where I Sum both C and P for each month.
So new df will have 2 columns, for example for the month 4 (April) (0.06+0.94+0.15+1.32) = 2.47, so new df:
4/21 5/21 6/21 2/22
0 2.47 2.25 .. ..
Columns names and order doesn't matter, actualy a string month name even better(April 22).
I was playing with something like this, which is not what i need:
df[['C','P']].groupby(df['date'].dt.to_period('M')).sum()
You almost had it, you need to convert first to_datetime:
out = (df[['C','P']]
.groupby(pd.to_datetime(df['date'], day_first=True)
.dt.to_period('M'))
.sum()
)
Output:
C P
date
2021-02 0.06 1.17
2021-04 0.21 2.26
2021-08 0.20 0.82
2021-09 0.04 -5.09
2022-01 0.05 7.09
If you want the grand total, sum again:
out = (df[['C','P']]
.groupby(pd.to_datetime(df['date']).dt.to_period('M'))
.sum().sum(axis=1)
)
Output:
date
2021-02 1.23
2021-04 2.47
2021-08 1.02
2021-09 -5.05
2022-01 7.14
Freq: M, dtype: float64
as "Month year"
If you want a string, better convert it in the end to keep the order:
out.index = out.index.strftime('%B %y')
Output:
date
February 21 1.23
April 21 2.47
August 21 1.02
September 21 -5.05
January 22 7.14
dtype: float64

Can i add new column in DataFrame with Interpolation?

this is my current DataFrame:
Df:
DATA
4.15
4.02
3.70
3.51
3.17
2.95
2.86
NaN
NaN
i alredy know that 4.15(first value) is 100%, 2.86(last value) is 30% and 2.5 is 0%. firstly, i want to interpolate first column the NaN(second last)value based on last NaN is 2.5(this is alredy predfined). after this i want to create second column and interpolate based on first coumn and available these three percentage value.
is it possible?
i have tried this code but it is not giving expected results:
df = pd.DataFrame({'DATA':range(df.DATA.min(), df.DATA.max()+1)}).merge(df, on='DATA', how='left')
df.Voltage = df.Voltage.interpolate()
Expected output:
Df:
DATA %
4.15 100%
4.02 89%
3.70 75%
3.51 70%
3.17 50%
2.95 35%
2.86 30%
2.74 15%
2.5 0%
Your logic is unclear, my understanding is that you want to compute a rank, but the provided output is unclear, please details the computations.
What I would do:
df.loc[df.index[-1], 'DATA'] = 2.5
df['DATA'] = df['DATA'].interpolate()
# compute rank
s = df['DATA'].rank(pct=True)
# rescale to 0-1 and convert to %
df['%'] = ((s-s.min())/(1-s.min())).mul(100)
output:
DATA %
0 4.15 100.0
1 4.02 87.5
2 3.70 75.0
3 3.51 62.5
4 3.17 50.0
5 2.95 37.5
6 2.86 25.0
7 2.68 12.5
8 2.50 0.0

Getting the average of the previous "x" amount of days into the current position of the new Pandas column

I need help with getting the average of the previous X amount of days into the current position of the new column.
The problem I am having is at the line of code df['avg'] = (df['Close'].shift(0) + df['Close'].shift(1)) / 2.
This is what I want, but of course, I want it to be dynamic. That is where I need help! I can't figure out how to do so because I am having issues with how it already seems to by looping itself when called.
I understand what it is doing and why (...I think) but can't figure out a way around it to get my desired result.
import pandas as pd
import os
import sys
import NasdaqTickerSymbols as nts
class MY_PANDA_INDICATORS():
def __init__(self, days, csvFile):
self.days = days
self.df = None
self.csvFile = csvFile
def GetDataFrame(self):
modpath = os.path.dirname(os.path.abspath(sys.argv[0]))
datapath = os.path.join(modpath, "CSV\\"+ self.csvFile + ".csv")
df = pd.read_csv(datapath)
return(df)
def GetEMA(self):
df['avg'] = df['Close'].shift(0) + df['Close'].shift(1)
return(df)
myD = MY_PANDA_INDICATORS(2,nts.matches[0])
print(myD.GetEMA())
Here is what I am getting and also what I want, but I want to be able to change the number of days and get the average of that "x" amount I pass to it. I have tried looping but none work as intended.
Date Open High Low Close Adj Close Volume avg
0 2020-11-16 1.15 1.15 1.11 1.12 1.12 17100 NaN
1 2020-11-17 1.15 1.15 1.11 1.13 1.13 29900 1.125
2 2020-11-18 1.15 1.20 1.12 1.16 1.16 127700 1.145
3 2020-11-19 1.17 1.22 1.16 1.16 1.16 64500 1.160
4 2020-11-20 1.18 1.18 1.14 1.15 1.15 32600 1.155
.. ... ... ... ... ... ... ... ...
246 2021-11-08 2.40 2.40 2.31 2.32 2.32 20000 2.340
247 2021-11-09 2.35 2.35 2.28 2.31 2.31 19700 2.315
248 2021-11-10 2.29 2.31 2.20 2.20 2.20 24200 2.255
249 2021-11-11 2.20 2.22 2.18 2.21 2.21 18700 2.205
250 2021-11-12 2.21 2.22 2.18 2.21 2.21 7800 2.210
You can reindex your DataFrame by the date, and then perform a rolling mean and with the argument x number of days as a string (such as "2D"):
df['avg'] = df.set_index(["Date"]).rolling(f"{self.days}D").mean().values
On a smaller example:
df = pd.DataFrame({'date': pd.date_range('2021-01-01','2021-01-05'), 'close':[1,3,5,7,9]})
Input:
>>> df
date close
0 2021-01-01 1
1 2021-01-02 3
2 2021-01-03 5
3 2021-01-04 7
4 2021-01-05 9
df['avg'] = df.set_index(["date"]).rolling("2D").mean().values
Output:
>>> df
date close avg
0 2021-01-01 1 1.0
1 2021-01-02 3 2.0
2 2021-01-03 5 4.0
3 2021-01-04 7 6.0
4 2021-01-05 9 8.0

Time Series: Fill NaNs from another dataframe

I am working with temperature data and I have created a file that has multi-year averages of few thousand cities and the format is as below(df1)
Date City PRCP TMAX TMIN TAVG
01-Jan Zurich 0.94 3.54 0.36 1.95
01-Feb Zurich 4.12 9.14 3.04 6.09
01-Mar Zurich 4.1 5.9 0.3 3.1
01-Apr Zurich 0.32 13.78 4.22 9
01-May Zurich 9.42 11.32 5.34 8.33
.
.
.....
I have the above data for all 365 days with no nulls. Notice that the date column only has day and month because year is irrelevant.
Based on the above data I am trying to clean yearly files, my second dataframe has data in the below format(df2)
ID Date City PRCP TAVG TMAX TMIN
abcd1 2020-01-01 Zurich 0 -1.9 -0.9
abcd1 2020-01-02 Zurich 9.1 12.7 4.9
abcd1 2020-01-03 Zurich 0.8 8.55 13.2 3.9
abcd1 2020-01-04 Zurich 0 4.1 10.8 -2.6
.
.
.....
Each city has a unique ID. The date column has the format %y-%m-%d.
I am trying to replace the nulls in the second dataframe with the values in my first dataframe by matching day and month. This is what I tried
df1["Date"] = pd.to_datetime(df1["Date"], errors = 'coerce') ##date format change##
df1["Date"] = df1['Date'].dt.strftime('%d-%m')
df2 = df2.drop(columns='ID')
df2 = df2.fillna(df1) ##To replace nulls##
df1["Date"] = pd.to_datetime(df1["Date"], errors = 'coerce')
df1["Date"] = df1['Date'].dt.strftime('%Y-%m-%d') ## Change data back to original format##
Even with this I end up with nulls in my yearly file i.e. df2{Note: df1 has no nulls}
Please suggest a better way to replace only nulls or any corrections to the code if necessary.
We can approach by adding a column Date2 onto df2 with the same format as the Date column on df1. Then, while setting both dataframes with this date format and City as index, we perform an update on df2 using .update(), as follows:
df2["Date2"] = pd.to_datetime(df2["Date"], errors = 'coerce').dt.strftime('%d-%b') # dd-MMM (e.g. 01-JAN)
df2a = df2.set_index(['Date2', 'City']) # Create df2a from df2 with set index on Date2 and City
df2a.update(df1.set_index(['Date', 'City']), overwrite=False) # update only NaN values of df2a by corresponding values of df1
df2 = df2a.reset_index(level=1).reset_index(drop=True) # result put back to df2 throwing away the temp `Date2` row index
df2.insert(2, 'City', df2.pop('City')) # relocate column City back to its original position
.update() is to modify in place using non-NA values from another DataFrame. The DataFrame’s length does not increase as a result of the update, only values at matching index/column labels are updated. Hence, we make both dataframe with same row index so that updates will be performed on corresponding columns with same column index/labels.
Note that we use the parameter overwrite=False in .update() to ensure we only update values that are NaN in the original DataFrame df2.
Demo
Data Setup:
Added data onto df1 to showcase replacing values of df2 from df1:
print(df1)
Date City PRCP TMAX TMIN TAVG
0 01-Jan Zurich 0.94 3.54 0.36 1.95
1 02-Jan Zurich 0.95 3.55 0.37 1.96 <=== Added this row
2 01-Feb Zurich 4.12 9.14 3.04 6.09
3 01-Mar Zurich 4.10 5.90 0.30 3.10
4 01-Apr Zurich 0.32 13.78 4.22 9.00
5 01-May Zurich 9.42 11.32 5.34 8.33
print(df2) # before processing
ID Date City PRCP TAVG TMAX TMIN
0 abcd1 2020-01-01 Zurich 0.0 -1.90 -0.9 NaN <=== with NaN value
1 abcd1 2020-01-02 Zurich 9.1 NaN 12.7 4.9 <=== with NaN value
2 abcd1 2020-01-03 Zurich 0.8 8.55 13.2 3.9
3 abcd1 2020-01-04 Zurich 0.0 4.10 10.8 -2.6
Run new codes:
df2["Date2"] = pd.to_datetime(df2["Date"], errors = 'coerce').dt.strftime('%d-%b') # dd-MMM (e.g. 01-JAN)
df2a = df2.set_index(['Date2', 'City']) # Create df2a from df2 with set index on Date2 and City
df2a.update(df1.set_index(['Date', 'City']), overwrite=False) # update only NaN values of df2a by corresponding values of df1
df2 = df2a.reset_index(level=1).reset_index(drop=True) # result put back to df2 throwing away the temp `Date2` row index
df2.insert(2, 'City', df2.pop('City')) # relocate column City back to its original position
Result:
print(df2)
ID Date City PRCP TAVG TMAX TMIN
0 abcd1 2020-01-01 Zurich 0.0 -1.90 -0.9 0.36 <== TMIN updated with df1 value
1 abcd1 2020-01-02 Zurich 9.1 1.96 12.7 4.90 <== TAVG updated with df1 value
2 abcd1 2020-01-03 Zurich 0.8 8.55 13.2 3.90
3 abcd1 2020-01-04 Zurich 0.0 4.10 10.8 -2.60

Python/Pandas - Sum dataframe items if indexes have the same month

I have this two DataFrames:
Seasonal_Component:
# DataFrame that has the seasonal component of a time series
Date
2014-12 -1.08
2015-01 -0.28
2015-02 0.15
2015-03 0.46
2015-04 0.48
2015-05 0.37
2015-06 0.20
2015-07 0.15
2015-08 0.12
2015-09 -0.02
2015-10 -0.17
2015-11 -0.39
Prediction_df:
# DataFrame with the prediction of the trend of that same time series
Prediction MAPE Score
2015-11-01 7.93 1.83 1
2015-12-01 7.93 1.67 1
2016-01-01 7.92 1.71 1
2016-02-01 7.95 1.84 1
2016-03-01 7.94 1.53 1
2016-04-01 7.87 1.45 1
2016-05-01 7.91 1.53 1
2016-06-01 7.87 1.40 1
2016-07-01 7.84 1.40 1
2016-08-01 7.89 1.77 1
2016-09-01 7.87 1.99 1
What I need to do:
Check which Prediction_df index have the same months as the Seasonal_Component index and sum the correspondent seasonal component with the prediction, so the Prediction_df looks like this:
Prediction MAPE Score
2015-11-01 7,54 1.83 1
2015-12-01 6.85 1.67 1
2016-01-01 7.64 1.71 1
2016-02-01 8.10 1.84 1
2016-03-01 8.40 1.53 1
2016-04-01 8.35 1.45 1
2016-05-01 8.28 1.53 1
2016-06-01 8.07 1.40 1
2016-07-01 7.99 1.40 1
2016-08-01 8.01 1.77 1
2016-09-01 7.85 1.99 1
Anyone available to enlight my journey?
I'm already on the "almost mad" stage trying to solve this.
EDIT
Important note to make it clearer: I need to disconsider the year and consider only the month to make the sum. Something like "everytime that an April appears (doesn't matter if it is 2006 or 2025) I need to sum with the April value of the Seasonal_Component frame.
Consider a data frame merge on the date fields (month values), then a simple addition of the two fields. The date fields may require conversion from string values:
import datetime as dt
...
# IF DATES ARE REGULAR COLUMNS
seasonal_component['Date'] = pd.to_datetime(seasonal_component['Date'])
seasonal_component['Month'] = seasonal_component['Date'].dt.month
predict_df['Date'] = pd.to_datetime(predict_df['Date'])
predict_df['Month'] = predict_df['Date'].dt.month
# IF DATES ARE INDICES
seasonal_component.index = pd.to_datetime(seasonal_component.index)
seasonal_component['Month'] = seasonal_component.index.month
predict_df.index = pd.to_datetime(predict_df.index)
predict_df['Month'] = predict_df.index.month
However, think about how you need to join the two data sets (akin to SQL's join clauses):
inner (default) - keeps only records matching both
left - keeps records of predict_df and only those matching seasonal_component where predict_df is first argument
right - keeps records of seasonal_component and only those matching predict_df where predict_df is first argument
outer - keeps all records, those that match and those that don't match
Below assumes an outer join where data on both sides remain with NaNs to fill for missing values.
# MERGING DATA FRAMES
merge_df = pd.merge(predict_df, seasonal_component[['Month', 'SeasonalComponent']],
on=['Month'], how='outer')
# ADDING COLUMNS
merge_df['Prediction'] = merge_df['Prediction'] + merge_df['SeasonalComponent']
Outcome (using posted data)
Date Prediction MAPE Score Month SeasonalComponent
0 2015-11-01 7.54 1.83 1 11 -0.39
1 2015-12-01 6.85 1.67 1 12 -1.08
2 2016-01-01 7.64 1.71 1 1 -0.28
3 2016-02-01 8.10 1.84 1 2 0.15
4 2016-03-01 8.40 1.53 1 3 0.46
5 2016-04-01 8.35 1.45 1 4 0.48
6 2016-05-01 8.28 1.53 1 5 0.37
7 2016-06-01 8.07 1.40 1 6 0.20
8 2016-07-01 7.99 1.40 1 7 0.15
9 2016-08-01 8.01 1.77 1 8 0.12
10 2016-09-01 7.85 1.99 1 9 -0.02
11 NaT NaN NaN NaN 10 -0.17
Firstly separate the month from both dataframes and then merge on basis of month. Further add the required columns and create new column with desired output. Here is the code below:
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
from numpy.random import randn
Seasonal_Component = DataFrame({
'Date': ['2014-12','2015-01','2015-02','2015-03','2015-04','2015-05','2015-06','2015-07','2015-08','2015-09','2015-10','2015-11'],
'Value': [-1.08,-0.28,0.15,0.46,0.48,0.37,0.20,0.15,0.12,-0.02,-0.17,-0.39]
})
Prediction_df = DataFrame({
'Date': ['2015-11-01','2015-12-01','2016-01-01','2016-02-01','2016-03-01','2016-04-01','2016-05-01','2016-06-01','2016-07-01','2016-08-01','2016-09-01'],
'Prediction': [7.93,7.93,7.92,7.95,7.94,7.87,7.91,7.87,7.84,7.89,7.87],
'MAPE':[1.83,1.67,1.71,1.84,1.53,1.45,1.53,1.40,1.40,1.77,1.99],
'Score':[1,1,1,1,1,1,1,1,1,1,1]
})
def mon_extract(date):
return date.split('-')[1]
Seasonal_Component['Month']=Seasonal_Component['Date'].apply(mon_extract)
def mon_extract(date):
return date.split('-')[1].split('-')[0]
Prediction_df['Month']=Prediction_df['Date'].apply(mon_extract)
FinalDF=pd.merge(Seasonal_Component,Prediction_df,on='Month',how='right')
FinalDF
FinalDF['PredictionF']=FinalDF['Value']+FinalDF['Prediction']
FinalDF.loc[:,['Date_y','PredictionF','MAPE','Score']]

Categories

Resources