Difference between multi year timeseries and it's 'standard year' - python

Assume I've a timeseries of a certain number of years as in:
rng = pd.date_range(start = '2001-01-01',periods = 5113)
ts = pd.TimeSeries(np.random.randn(len(rng)), rng)
Than I can calculate it's standard year (the average value of each day over all years) by doing:
std = ts.groupby([ts.index.month, ts.index.day]).mean()
Now I was wondering how I could subtract my multi-year timeseries from this standard year, in order to get a timeseries that show which days were below or above it's standard.

You can do this using the groupby, just subtract each group's mean from the values for that group:
average_diff = ts.groupby([ts.index.month, ts.index.day]).apply(
lambda g: g - g.mean()
)

Related

calculate monthly customer churn with the 1st of each month

I am working with a subscription based data set of which this is an exemplar:
import pandas as pd
import numpy as np
from datetime import timedelta
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
cancel_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
churned = [random.randint(0, 1) for i in range(len(start_date))]; churned = [bool(x) for x in churned]
df = pd.DataFrame(
{"start_date":start_date,
"cancel_date":cancel_date,
"churned":churned}
)
df["cancel_date"] = df["cancel_date"].dt.date
df["cancel_date"] = df["cancel_date"].astype("datetime64[ns]")
I need a way to calculate monthly customer churn in python using the following steps:
Firstly, I need to obtain the number of subscriptions that started before the 1st of each month that are still active
Secondly, I need to obtain the number of subscriptions that started before the 1st of each month and which were cancelled after the 1st of each month
These two steps constitute the denominator of the monthly calculation
Finally, I need to obtain the number of subscriptions that cancelled in each month
This step produces the numerator of the monthly calculation.
The numerator and the denominator are divided and multiplied by 100 to obtain the percentage of customers that churn each month
I am really really lost with this problem can someone please point me in the right direction - I have been working on this problem for so long

Group by custom period annually in Xarray

I'm trying to group an xarray.Dataset object into a custom 5-month period spanning from October-January with an annual frequency. This is complicated because the period crosses New Year.
I've been trying to use the approach
wb_start = temperature.sel(time=temperature.time.dt.month.isin([10,11,12,1]))
wb_start1 = wb_start.groupby('time.year')
But this predictably makes the January month of the same year, instead of +1 year. Any help would be appreciated!
I fixed this in a somewhat clunk albeit effective way by adding a year to the months after January. My method essentially moves the months 10,11,12 up one year while leaving the January data in place, and then does a groupby(year) instance on the reindexed time data.
wb_start = temperature.sel(time=temperature.time.dt.month.isin([10,11,12,1]))
# convert cftime to datetime
datetimeindex = wb_start.indexes['time'].to_datetimeindex()
wb_start['time'] = pd.to_datetime(datetimeindex)
# Add custom group by year functionality
custom_year = wb_start['time'].dt.year
# convert time type to pd.Timestamp
time1 = [pd.Timestamp(i) for i in custom_year['time'].values]
# Add year to Timestamp objects when month is before Jan. (relativedelta does not work from np.datetime64)
time2 = [i + relativedelta(years=1) if i.month>=10 else i for i in time1]
wb_start['time'] = time2
#Groupby using the new time index
wb_start1 = wb_start.groupby('time.year')

How can I get the last 10 records of each day?

I have a DataFrame with 96 records each day, for 5 consecutive days.
Data: {'value': {Timestamp ('2018-05-03 00:07:30'): 13.02657778, Timestamp ('2018-05-03 00:22:30'): 10.89890556, Timestamp ('2018-05-03 00:37:30'): 11.04877222,... (more days and times)
Datatypes: DatetimeIndex (index column) and float64 ('flow' column).
I want to save 10 records before an indicated hour (H), of each day.
I only managed to do that for one day:
df.loc[df['time'] < '09:07:30'].tail(10)
You can group your data by day (or by month or by other ranges) using pandas.Grouper (see also this discussion).
In your case, use something like:
df.groupby(pd.Grouper(freq='D')).tail(10)
EDIT:
For getting all rows before a given hour, use df.loc[df.index.hour < H] (as already proposed in simpleApp's answer) where H is the hour as an integer value.
So in one line:
df.loc[df.index.hour < H].groupby(pd.Grouper(freq='D')).tail(10)
I would suggest filter the record by an hour and then group by date.
Data setup:
import pandas as pd
start, end = '2020-10-01 01:00:00', '2021-04-30 23:30:00'
rng = pd.date_range(start, end, freq='5min')
df=pd.DataFrame(rng,columns=['DateTS'])
set the hour
noon_hour=12 # fill the hour , for filteration
Result, if head or tail does not work on your data, you would need to sort it.
df_before_noon=df.loc[df['DateTS'].dt.hour < noon_hour] # records before noon
df_result=df_before_noon.groupby([df_before_noon['DateTS'].dt.date]).tail(10) # group by date

Select nearest date first day of month in a python dataframe

i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541

Pandas fill a DataFrame from another by DatetimeIndex

I have a DataFrame of sales numbers with a DatetimeIndex, for data that extends over a couple of years at the minute level, and I want to first calculate totals (of sales) per year, month, day, hour and location, then average over years and month.
Then with that date, I want to extrapolate to a new month, per day, hour and location. So to do that, I calculate the sales numbers per hour for each day of the week (expecting that weekend days will behave differently from work week days), then I create a new DataFrame for the month I want to extrapolate to, then for each day in that month, I calculate (day of week, hour, POS) and use the past data for the corresponding (day of week, hour, POS) as my "prediction" for what will be sold at POS at the given hour and day in the given month.
The reason I'm doing it this way is that once I calculate a mean per day of the week in the past, when I populate the DataFrame for the month of June, the 1st of June could be any day of the week, and that is important as weekdays/weekend days behave differently. I want the past sales number for a Friday, if the 1st is a Friday.
I have the following, that is unfortunately too slow - or maybe wrong, in any case, there is no error message but it doesn't complete (on the real data):
import numpy as np
import pandas as pd
# Setup some sales data for the past 2 years for some stores
hours = pd.date_range('2018-01-01', '2019-12-31', freq='h')
sales = pd.DataFrame(index = hours, columns=['Store', 'Count'])
sales['Store'] = np.random.randint(0,10, sales.shape[0])
sales['Count'] = np.random.randint(0,100, sales.shape[0])
# Calculate the average of sales over these 2 years for each hour in
# each day of the week and each store
sales.groupby([sales.index.year, sales.index.month, sales.index.dayofweek, sales.index.hour, 'Store'])['Count'] \
.sum() \
.rename_axis(index=['Year', 'Month', 'DayOfWeek', 'Hour', 'Store']) \
.reset_index() \
.groupby(['DayOfWeek', 'Hour', 'Store'])['Count'] \
.mean() \
.rename_axis(index=['DayOfWeek', 'Hour', 'Store'])
# Setup a DataFrame to predict May sales per store/day/hour
may_hours = pd.date_range('2020-05-01', '2020-05-31', freq='h')
predicted = pd.DataFrame(index = pd.MultiIndex.from_product([may_hours, range(0,11)]), columns = ['Count']) \
.rename_axis(index=['Datetime', 'Store'])
# "Predict" sales for each (day, hour, store) in May 2020
# by retrieving the average sales for the corresponding
# (day of week, hour store)
for idx in predicted.index:
qidx = (idx[0].dayofweek, idx[0].hour, idx[1])
predicted.loc[idx] = sales[qidx] if qidx in sales.index else 0

Categories

Resources