Convert daily data to quarterly data and convert back after forecasting - python

My idea is about forecasting data with different time period :
For example :
A day month year quarter week
Date
2016-01-04 36.81 4 1 2016 1 1
2016-01-05 35.97 5 1 2016 1 1
2016-01-06 33.97 6 1 2016 1 1
2016-01-07 33.29 7 1 2016 1 1
2016-01-08 33.20 8 1 2016 1 2
2016-01-11 31.42 11 1 2016 1 2
2016-01-12 30.42 12 1 2016 1 2
I have daily data and i wanted to forecast data in month and again convert month back to daily.
The method I used is getting percentage of each day and month from it's sum
Here are some code i used:
converted_data = data.groupby( [data['month'], data['day']] )['A'].sum()
average = converted_data/converted_data.sum()
average
which give the following result:
month day A
1 3 0.002218
4 0.003815
5 0.003801
...
12 26 0.002522
27 0.004764
28 0.004822
29 0.004839
30 0.002277
By using this when i want to convert back from yearly data to daily i just multipling the average by result of the forecast
But this does not work when i wanted to convert daily data to quarterly data.
Can anyone suggested any idea on how to do it.
Thank you for your consideration.
Edit
the data i want is the percentage of the data in day restective to quarter
something like:
A = total when day is equal to 1 in data and also for day 2 and 3...
#for example my data is
Date value
1/1/2000 50
1/2/2000 50
1/3/2000 40
than A of day 1 is 140
B = total when quarter is equal to 1 in data and and also for quarter 2 and 3 4
#for example my data is
Date value
1/1/2000 4000 #-->quarter 1
1/4/2000 5000 #-->quarter 2
1/7/2000 2000 #-->quarter 3
1/10/2000 1000 #-->quarter 4
1/1/20001 2000 #-->quarter 1
than average of day 1 respective quarter is 140/6000 as a is in quarter one
The data above is what I converted.
First, I am receiving and input data as daily and I converted from pandas series to dataframe shown above the column, day, month, year, quarter, week was extracted in order to group the data my method work well for converting to year and month.
The reason why I do these because when my input is given in daily and I wanted to convert to year in order to do forecasting.
After the forecasting i will be getting the forecast value in yearly form so i wanted to convert it back to daily and the method i did is to find the portion of previous data.
I am so sorry, for the unclear question.
Again thank for your help.

Related

Evaluation of a data set with conditional selection of columns

I want to evaluate a data set with precipitation data. The data is available as a csv file, which I have read in with pandas as dataframe. From this then follows the following table:
year month day value
0 1981 1 1 0.522592
1 1981 1 2 2.692495
2 1981 1 3 0.556698
3 1981 1 4 0.000000
4 1981 1 5 0.000000
... ... ... ... ...
43824 2100 12 27 0.000000
43825 2100 12 28 0.185120
43826 2100 12 29 10.252080
43827 2100 12 30 13.389290
43828 2100 12 31 3.523566
Now I want to convert the daily precipitation values into monthly precipitation values and that for each month (for this I would need the sum of each day of a month). For this I probably need a loop or something similar. However, I do not know how to proceed. Maybe via a conditional selection over 'year' and 'month'?!
I would be very happy about feedback! :)
That´s what I tried now:
for i in range(len(dataframe)):
print(dataframe.loc[i, 'year'], dataframe.loc[i, 'month'])
I would start out by making a single column with the date:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
From here you can make the date the index:
df.set_index('date', inplace=True)
# I'll drop the unneeded year, month, and day columns as well.
df = df[['value']]
My data now looks like:
value
date
1981-01-01 0.522592
1981-01-02 2.692495
1981-01-03 0.556698
1981-01-04 0.000000
1981-01-05 0.000000
From here, let's try resampling the data!
# let's doing a 2 day sum. To do monthly, you'd replace '2d' with 'M'.
df.resample('2d').sum()
Output:
value
date
1981-01-01 3.215087
1981-01-03 0.556698
1981-01-05 0.000000
Hopefully this gives you something to start with~
Have you tried groupby?
Df.groupby(['year', 'month'])['value'].agg('sum')

Calculating values from time series in pandas multi-indexed pivot tables

I've got a dataframe in pandas that stores the Id of a person, the quality of interaction, and the date of the interaction. A person can have multiple interactions across multiple dates, so to help visualise and plot this I converted it into a pivot table grouping first by Id then by date to analyse the pattern over time.
e.g.
import pandas as pd
df = pd.DataFrame({'Id':['A4G8','A4G8','A4G8','P9N3','P9N3','P9N3','P9N3','C7R5','L4U7'],
'Date':['2016-1-1','2016-1-15','2016-1-30','2017-2-12','2017-2-28','2017-3-10','2019-1-1','2018-6-1','2019-8-6'],
'Quality':[2,3,6,1,5,10,10,2,2]})
pt = df.pivot_table(values='Quality', index=['Id','Date'])
print(pt)
Leads to this:
Id
Date
Quality
A4G8
2016-1-1
2
2016-1-15
4
2016-1-30
6
P9N3
2017-2-12
1
2017-2-28
5
2017-3-10
10
2019-1-1
10
C7R5
2018-6-1
2
L4U7
2019-8-6
2
However, I'd also like to...
Measure the time from the first interaction for each interaction per Id
Measure the time from the previous interaction with the same Id
So I'd get a table similar to the one below
Id
Date
Quality
Time From First
Time To Prev
A4G8
2016-1-1
2
0 days
NA days
2016-1-15
4
14 days
14 days
2016-1-30
6
29 days
14 days
P9N3
2017-2-12
1
0 days
NA days
2017-2-28
5
15 days
15 days
2017-3-10
10
24 days
9 days
The Id column is a string type, and I've converted the date column into datetime, and the Quality column into an integer.
The column is rather large (>10,000 unique ids) so for performance reasons I'm trying to avoid using for loops. I'm guessing the solution is somehow using pd.eval but I'm stuck as to how to apply it correctly.
Apologies I'm a python, pandas, & stack overflow) noob and I haven't found the answer anywhere yet so even some pointers on where to look would be great :-).
Many thanks in advance
Convert Dates to datetimes and then substract minimal datetimes per groups by GroupBy.transformb subtracted by column Date and for second new column use DataFrameGroupBy.diff:
df['Date'] = pd.to_datetime(df['Date'])
df['Time From First'] = df['Date'].sub(df.groupby('Id')['Date'].transform('min'))
df['Time To Prev'] = df.groupby('Id')['Date'].diff()
print (df)
Id Date Quality Time From First Time To Prev
0 A4G8 2016-01-01 2 0 days NaT
1 A4G8 2016-01-15 3 14 days 14 days
2 A4G8 2016-01-30 6 29 days 15 days
3 P9N3 2017-02-12 1 0 days NaT
4 P9N3 2017-02-28 5 16 days 16 days
5 P9N3 2017-03-10 10 26 days 10 days
6 P9N3 2019-01-01 10 688 days 662 days
7 C7R5 2018-06-01 2 0 days NaT
8 L4U7 2019-08-06 2 0 days NaT
df["Date"] = pd.to_datetime(df.Date)
df = df.merge(
df.groupby(["Id"]).Date.first(),
on="Id",
how="left",
suffixes=["", "_first"]
)
df["Time From First"] = df.Date-df.Date_first
df['Time To Prev'] = df.groupby('Id').Date.diff()
df.set_index(["Id", "Date"], inplace=True)
df
output:

How to create a multiindex chart in Pandas that combines categories and numericals

This is a tricky one for me so bear me out. I'm creating a daily dataset that compiles units using timestamps.
day of the week month day hour week of year units
Monday January 3 16 1 1
Monday January 3 19 1 1
Tuesday January 4 21 1 1
Tuesday January 4 22 1 1
Wednesday January 5 23 1 1
Monday January 10 16 2 1
Monday January 10 19 2 1
Tuesday January 11 21 2 1
Tuesday January 11 22 2 1
Wednesday January 12 23 2 1
The various columns are created by using Pandas' excellent time functions and it is relatively trivial to create pivot plots based on a single column, such as day (date of the month), month, hour or even day of the week (thanks to this excellent code sample although lord knows where I found it on SO).
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=cats, ordered=True)
df['day of the week'] = df['day of the week'].astype(cat_type)
As the dataset increases in size, what I'd like to be able to do is pivot on both say week of year and day of the week.
units
week of year day of the week
1 Friday 15.2
Monday 22.8
2 Friday 19.0
3 Thursday 28.0
Unfortunately, when I perform pd.pivot_table using week_of_year (numeric) and then categorical (day_of_the_week) I can get the numeric column, but lose the ordering of the categorical.
I'd also like to be able to be able visualise units trend over time (as in week as well as by day of the week.
My head says create matrix plot by week but that misses out the time (day of the week dimension).
Any ideas? I'm not necessarily looking for a solution, although I'll happily write this up if I fix it as I can't see this as a unique problem.
Update: I have a solution in my head in the way I'd solve this in Excel. I'd select day_of_the_week as row (and then sort), pick the numerical (week_of_year) as column, aggregate units as necessary and then plot.
With the sample data and code you provided, you could try this:
new_df = (
df.groupby(["week of year", "day of the week"]).sum().drop(columns=["day", "hour"])
)
new_df = new_df[new_df["units"] > 0]
So that:
print(new_df)
# Ouput
units
week of year day of the week
1 Monday 2
Tuesday 2
Wednesday 1
2 Monday 2
Tuesday 2
Wednesday 1

Crate and append rows based on average of previous rows and condition columns

I'm working on a dataframe named df that contains a year of daily information for a float variable (balance) for many account values (used as main key). I'm trying to create a new column expected_balance by matching the date of previous months, calculating an average and using it as expected future value. I'll explain in detail now:
The dataset is generated after appending and parsing multiple json values, once I finish working on it, I get this:
date balance account day month year fdate
0 2018-04-13 470.57 SP014 13 4 2018 201804
1 2018-04-14 375.54 SP014 14 4 2018 201804
2 2018-04-15 375.54 SP014 15 4 2018 201804
3 2018-04-16 229.04 SP014 16 4 2018 201804
4 2018-04-17 216.62 SP014 17 4 2018 201804
... ... ... ... ... ... ... ...
414857 2019-02-24 381.26 KO012 24 2 2019 201902
414858 2019-02-25 181.26 KO012 25 2 2019 201902
414859 2019-02-26 160.82 KO012 26 2 2019 201902
414860 2019-02-27 0.82 KO012 27 2 2019 201902
414861 2019-02-28 109.50 KO012 28 2 2019 201902
Each account value has 365 values (a starting date when the information was obtained and a year of info), resampled by day. After that, I'm splitting this dataframe into train and test. Train consists of all previous values except for the last 2 months of information and test are these last 2 months (the last month is not necesarilly full, if the last/max date value is 20-04-2019, then train will be from 20-04-2018 to 31-03-2019 and test 01-03-2019 to 20-04-2019). This is how I manage:
df_test_1 = df[df.fdate==df.groupby('account').fdate.transform('max')].copy()
dft = df.drop(df_test_1.index)
df_test_2 = dft[dft.fdate==dft.groupby('account').fdate.transform('max')].copy()
df_train = dft.drop(df_test_2.index)
df_test = pd.concat([df_test_2,df_test_1])
#print("Shape df: ",df.shape) #for validation purposes
#print("Shape test: ",df_test.shape) #for validation purposes
#print("Shape train: ",df_train.shape) #for validation purposes
What I need to do now is create a new column exp_bal (expected balance) for each date in the df_test that's calculated by averaging all train values for the particular day (this is the method requested so I must follow the instructions).
Here is an example of an expected output/result, I'm only printing account's AA001 values for a specific day for the last 2 train months (suppose these values always repeat for the other 8 months):
date balance account day month year fdate
... ... ... ... ... ... ... ...
0 2019-03-20 200.00 AA000 20 3 2019 201903
1 2019-04-20 100.00 AA000 20 4 2019 201904
I should be able to use this information to append a new column for each day that is the average of the same day value for all months of df_train
date balance account day month year fdate exp_bal
0 2018-05-20 470.57 AA000 20 5 2018 201805 150.00
30 2019-06-20 381.26 AA000 20 6 2019 201906 150.00
So then I can calculate a mse for the that prediction for that account.
First of all I'm using this to iterate over each account:
ids = list(df['account'].unique())
for i in range(0,len(ids)):
dft_train = df_train[df_train['account'] == ids[i]]
dft_test = df_test[df_test['account'] == ids[i]]
first_date = min(dft_test['date'])
last_date = max(df_ttest['date'])
dft_train = dft_train.set_index('date')
dft_test = dft_train.set_index('date')
And after this I'm lost on how to use the dft_train values to create this average for a given day that will be appended in a new column in dft_test.
I appreciate any help or suggestion, also feel free to ask for clarification/ more info, I'll gladly edit this. Thanks in advance!
Not sure if it's the only question you have with the above, but this is how to calculate the expected balance of the train data:
import pandas as pd, numpy as np
# make test data
n = 60
df = pd.DataFrame({'Date': np.tile(pd.date_range('2018-01-01',periods=n).values, 2), 'Account': np.repeat(['A', 'B'], n), 'Balance': range(2*n)})
df['Day'] = df.Date.dt.day
# calculate expected balance
df['exp_bal'] = df.groupby(['Account', 'Day']).Balance.transform('mean')
# example output for day 5
print(df[df.Day==5])
Output:
Date Account Balance Day exp_bal
4 2018-01-05 A 4 5 19.5
35 2018-02-05 A 35 5 19.5
64 2018-01-05 B 64 5 79.5
95 2018-02-05 B 95 5 79.5

How to subtract the mean of past calendar weeks from the current value?

I have a dataframe df_pct_Max with the following shape:
Date Value1 Value2
01.01.2015 5 6
08.01.2015 3 2
... ... ...
28.01.2017 7 8
and I would like to calculate the average per calendar week and subtract it from the actual values for the calendar week.
I have created a dataframe with the average per calendar week as follows:
df_weekly_avg_Max = df_pct_Max.groupby(df_pct_Max.index.week).mean()
This results in a dataframe df_weekly_avg_Max:
KW Value1 Value2
1 3.5 4.3
2 4 3
… … …
52 8.33 6.2
Now I am trying substract df_weekly_avg_Max from df_pct_Max and would like to do this by calendar week.
I have tried adding a column 'KW' and then
dfresult = df_pct_Max.sub(df_weekly_avg_Max, axis='KW')
But I am getting erros there.
Is there also a way of doing this on a rolling basis (suntracting the average of calendar week 1 over the past 3 years from calendar week 1 of 2015 and the 2016...)?
Could someone please help with this issue?
I have found a solution for the whole dataframe.
I added a column 'KW' for the calendar week and then performed a groupby on it with a lambda function that subtracts the mean for the calendar weeks "1" from the current value of calendar week "1"...
df_pct_Max ['KW'] = df_pct_Max.index.week
dfresult = df_pct_Max.groupby(by='KW').transform(lambda x: x-x.mean())
This works for me.
It would have been nicer to be able to adjust the timeframe of the mean, e.g. I substract from the current calendar week "1" value the mean for calendar week one of the past 3 years or so. But this seems rather complicated and this solution works for the current analysis.
This answer isn't clean as it doesn't make use of pandas well, but I also don't think it will be slow (depends on how large your dataframe is), the basic idea is to build up a list of the means repeated once for each day so you can subtract simply.
CODE:
from collections import Counter
import pandas as pd
import numpy as np
#Build up example data frame
num_days = 15
dates = pd.date_range('1/1/2015', periods=num_days, freq='D')
val1s = np.random.random_integers(1, 30, num_days)
val2s = np.random.random_integers(1, 30, num_days)
df_pct_MAX = pd.DataFrame({'Date':dates, 'Value1':val1s, 'Value2':val2s})
df_pct_MAX['Day'] = df_pct_MAX['Date'].dt.weekday_name
df_pct_MAX['Week'] = df_pct_MAX['Date'].dt.week
#OPs logic to get means
df_weekly_avg_Max = df_pct_MAX.groupby(df_pct_MAX['Week']).mean()
#Build up a list of the means repeated once for each day in that week
mean_fields = ['Value1','Value2'] #<-- only hardcoded portion
means_dict = {k:list(df_weekly_avg_Max[k]) for k in mean_fields} #<-- convert means into lists keyed by field
week_counts = Counter(df_pct_MAX['Week']).values() #<-- count how many days are represented in each week
#Build up a dict keyed by field with the means repeated the correct number of times
means = {k:[means_dict[k][i] for i,count in enumerate(week_counts)
for x in range(count)] for k in mean_fields}
#Assign a new column to the means for each field (not necessary, just to show done correctly)
for k in mean_fields:
df_pct_MAX[k+'Mean'] = means[k]
print(df_pct_MAX)
OUTPUT:
Date Value1 Value2 Day Week Value1Mean Value2Mean
0 2015-01-01 12 19 Thursday 1 9.000000 19.250000
1 2015-01-02 15 27 Friday 1 9.000000 19.250000
2 2015-01-03 2 30 Saturday 1 9.000000 19.250000
3 2015-01-04 7 1 Sunday 1 9.000000 19.250000
4 2015-01-05 6 20 Monday 2 17.571429 14.142857
5 2015-01-06 9 24 Tuesday 2 17.571429 14.142857
6 2015-01-07 25 17 Wednesday 2 17.571429 14.142857
7 2015-01-08 22 8 Thursday 2 17.571429 14.142857
8 2015-01-09 30 7 Friday 2 17.571429 14.142857
9 2015-01-10 10 1 Saturday 2 17.571429 14.142857
10 2015-01-11 21 22 Sunday 2 17.571429 14.142857
11 2015-01-12 23 29 Monday 3 23.750000 19.750000
12 2015-01-13 23 16 Tuesday 3 23.750000 19.750000
13 2015-01-14 21 17 Wednesday 3 23.750000 19.750000
14 2015-01-15 28 17 Thursday 3 23.750000 19.750000

Categories

Resources