Evaluation of a data set with conditional selection of columns - python

I want to evaluate a data set with precipitation data. The data is available as a csv file, which I have read in with pandas as dataframe. From this then follows the following table:
year month day value
0 1981 1 1 0.522592
1 1981 1 2 2.692495
2 1981 1 3 0.556698
3 1981 1 4 0.000000
4 1981 1 5 0.000000
... ... ... ... ...
43824 2100 12 27 0.000000
43825 2100 12 28 0.185120
43826 2100 12 29 10.252080
43827 2100 12 30 13.389290
43828 2100 12 31 3.523566
Now I want to convert the daily precipitation values into monthly precipitation values and that for each month (for this I would need the sum of each day of a month). For this I probably need a loop or something similar. However, I do not know how to proceed. Maybe via a conditional selection over 'year' and 'month'?!
I would be very happy about feedback! :)
That´s what I tried now:
for i in range(len(dataframe)):
print(dataframe.loc[i, 'year'], dataframe.loc[i, 'month'])

I would start out by making a single column with the date:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
From here you can make the date the index:
df.set_index('date', inplace=True)
# I'll drop the unneeded year, month, and day columns as well.
df = df[['value']]
My data now looks like:
value
date
1981-01-01 0.522592
1981-01-02 2.692495
1981-01-03 0.556698
1981-01-04 0.000000
1981-01-05 0.000000
From here, let's try resampling the data!
# let's doing a 2 day sum. To do monthly, you'd replace '2d' with 'M'.
df.resample('2d').sum()
Output:
value
date
1981-01-01 3.215087
1981-01-03 0.556698
1981-01-05 0.000000
Hopefully this gives you something to start with~

Have you tried groupby?
Df.groupby(['year', 'month'])['value'].agg('sum')

Related

calc churn rate in pandas

i have a sales dataset (simplified) with sales from existing customers (first_order = 0)
import pandas as pd
import datetime as dt
df = pd.DataFrame({'Date':['2020-06-30 00:00:00','2020-05-05 00:00:00','2020-04-10 00:00:00','2020-02-26 00:00:00'],
'email':['1#abc.de','2#abc.de','3#abc.de','1#abc.de'],
'first_order':[1,1,1,1],
'Last_Order_Date':['2020-06-30 00:00:00','2020-05-05 00:00:00','2020-04-10 00:00:00','2020-02-26 00:00:00']
})
I would like to analyze how many existing customers we lose per month.
my idea is to
group(count) by month and
then count how many have made their last purchase in the following months which gives me a churn cross table where I can see that e.g. we had 300 purchases in January, and 10 of them bought the last time in February.
like this:
Col B is the total number of repeating customers and column C and further is the last month they bought something.
E.g. we had 2400 customers in January, 677 of them made their last purchase in this month, 203 more followed in February etc.
I guess I could first group the total number of sales per month and then group a second dataset by Last_Order_Date and filter by month.
but I guess there is a handy python way ?! :)
any ideas?
thanks!
The below code helps you to identify how many purchases are done in each month.
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.groupby(pd.Grouper(freq="M")).size()
O/P:
Date
2020-02-29 1
2020-03-31 0
2020-04-30 1
2020-05-31 1
2020-06-30 1
Freq: M, dtype: int64
I couldn't find the required data and explanation clearly. This could be your starting point. Please let me know, if it is helped you in anyway.
Update-1:
df.pivot_table(index='Date', columns='email', values='Last_Order_Date', aggfunc='count')
Output:
email 1#abc.de 2#abc.de 3#abc.de
Date
2020-02-26 00:00:00 1.0 NaN NaN
2020-04-10 00:00:00 NaN NaN 1.0
2020-05-05 00:00:00 NaN 1.0 NaN
2020-06-30 00:00:00 1.0 NaN NaN

Incorrect Parsing of dates in pandas dataframe

Hello I am trying to change my dataframe dates into a format i can use to extract useful information.
The dataset comes with a 'week' feature that is in the form DD/MM/YY as follows:
In [128]: df_train[['week', 'units_sold']]
Out[128]:
week units_sold
0 17/01/11 20
1 17/01/11 28
2 17/01/11 19
3 17/01/11 44
4 17/01/11 52
I have changed the dates as follows:
df_train['new_date'] = pd.to_datetime(df_train['week'])
new_date units_sold
0 2011-01-17 20.0
1 2011-01-17 28.0
2 2011-01-17 19.0
3 2011-01-17 44.0
4 2011-01-17 52.0
Using the 'new_date' feature I created, I did the following for some information extraction:
df_train['weekday'] = df_train['new_date'].dt.weekofyear #week day of the year
df_train['QTR'] = df_train['new_date'].apply(lambda x: x.quarter) #current quarter of the year
df_train['month'] = df_train['new_date'].apply(lambda x: x.month) #current month
df_train['year'] = df_train['new_date'].dt.year #current year
However, when reviewing my data I run into some errors. For example a certain date in my dataset is 07/02/11 which should translate to a month of 2. except my parsing shows that the month is 7, which I know is incorrect: see entry 3483
Out[127]:
week month
18 17/01/11 1
1173 24/01/11 1
2328 31/01/11 1
3483 07/02/11 7
4638 14/02/11 2
Can anyone tell me where i went wrong?
Any help is apprecaited!
Use dayfirst=True parameter:
df_train['new_date'] = pd.to_datetime(df_train['week'], dayfirst=True)
And then .dt accessor for improve performance, because in apply are loops under the hood:
df_train['weekday'] = df_train['new_date'].dt.weekofyear #week day of the year
df_train['QTR'] = df_train['new_date'].dt.quarter #current quarter of the year
df_train['month'] = df_train['new_date'].dt.month #current month
df_train['year'] = df_train['new_date'].dt.year

Crate and append rows based on average of previous rows and condition columns

I'm working on a dataframe named df that contains a year of daily information for a float variable (balance) for many account values (used as main key). I'm trying to create a new column expected_balance by matching the date of previous months, calculating an average and using it as expected future value. I'll explain in detail now:
The dataset is generated after appending and parsing multiple json values, once I finish working on it, I get this:
date balance account day month year fdate
0 2018-04-13 470.57 SP014 13 4 2018 201804
1 2018-04-14 375.54 SP014 14 4 2018 201804
2 2018-04-15 375.54 SP014 15 4 2018 201804
3 2018-04-16 229.04 SP014 16 4 2018 201804
4 2018-04-17 216.62 SP014 17 4 2018 201804
... ... ... ... ... ... ... ...
414857 2019-02-24 381.26 KO012 24 2 2019 201902
414858 2019-02-25 181.26 KO012 25 2 2019 201902
414859 2019-02-26 160.82 KO012 26 2 2019 201902
414860 2019-02-27 0.82 KO012 27 2 2019 201902
414861 2019-02-28 109.50 KO012 28 2 2019 201902
Each account value has 365 values (a starting date when the information was obtained and a year of info), resampled by day. After that, I'm splitting this dataframe into train and test. Train consists of all previous values except for the last 2 months of information and test are these last 2 months (the last month is not necesarilly full, if the last/max date value is 20-04-2019, then train will be from 20-04-2018 to 31-03-2019 and test 01-03-2019 to 20-04-2019). This is how I manage:
df_test_1 = df[df.fdate==df.groupby('account').fdate.transform('max')].copy()
dft = df.drop(df_test_1.index)
df_test_2 = dft[dft.fdate==dft.groupby('account').fdate.transform('max')].copy()
df_train = dft.drop(df_test_2.index)
df_test = pd.concat([df_test_2,df_test_1])
#print("Shape df: ",df.shape) #for validation purposes
#print("Shape test: ",df_test.shape) #for validation purposes
#print("Shape train: ",df_train.shape) #for validation purposes
What I need to do now is create a new column exp_bal (expected balance) for each date in the df_test that's calculated by averaging all train values for the particular day (this is the method requested so I must follow the instructions).
Here is an example of an expected output/result, I'm only printing account's AA001 values for a specific day for the last 2 train months (suppose these values always repeat for the other 8 months):
date balance account day month year fdate
... ... ... ... ... ... ... ...
0 2019-03-20 200.00 AA000 20 3 2019 201903
1 2019-04-20 100.00 AA000 20 4 2019 201904
I should be able to use this information to append a new column for each day that is the average of the same day value for all months of df_train
date balance account day month year fdate exp_bal
0 2018-05-20 470.57 AA000 20 5 2018 201805 150.00
30 2019-06-20 381.26 AA000 20 6 2019 201906 150.00
So then I can calculate a mse for the that prediction for that account.
First of all I'm using this to iterate over each account:
ids = list(df['account'].unique())
for i in range(0,len(ids)):
dft_train = df_train[df_train['account'] == ids[i]]
dft_test = df_test[df_test['account'] == ids[i]]
first_date = min(dft_test['date'])
last_date = max(df_ttest['date'])
dft_train = dft_train.set_index('date')
dft_test = dft_train.set_index('date')
And after this I'm lost on how to use the dft_train values to create this average for a given day that will be appended in a new column in dft_test.
I appreciate any help or suggestion, also feel free to ask for clarification/ more info, I'll gladly edit this. Thanks in advance!
Not sure if it's the only question you have with the above, but this is how to calculate the expected balance of the train data:
import pandas as pd, numpy as np
# make test data
n = 60
df = pd.DataFrame({'Date': np.tile(pd.date_range('2018-01-01',periods=n).values, 2), 'Account': np.repeat(['A', 'B'], n), 'Balance': range(2*n)})
df['Day'] = df.Date.dt.day
# calculate expected balance
df['exp_bal'] = df.groupby(['Account', 'Day']).Balance.transform('mean')
# example output for day 5
print(df[df.Day==5])
Output:
Date Account Balance Day exp_bal
4 2018-01-05 A 4 5 19.5
35 2018-02-05 A 35 5 19.5
64 2018-01-05 B 64 5 79.5
95 2018-02-05 B 95 5 79.5

Convert daily data to quarterly data and convert back after forecasting

My idea is about forecasting data with different time period :
For example :
A day month year quarter week
Date
2016-01-04 36.81 4 1 2016 1 1
2016-01-05 35.97 5 1 2016 1 1
2016-01-06 33.97 6 1 2016 1 1
2016-01-07 33.29 7 1 2016 1 1
2016-01-08 33.20 8 1 2016 1 2
2016-01-11 31.42 11 1 2016 1 2
2016-01-12 30.42 12 1 2016 1 2
I have daily data and i wanted to forecast data in month and again convert month back to daily.
The method I used is getting percentage of each day and month from it's sum
Here are some code i used:
converted_data = data.groupby( [data['month'], data['day']] )['A'].sum()
average = converted_data/converted_data.sum()
average
which give the following result:
month day A
1 3 0.002218
4 0.003815
5 0.003801
...
12 26 0.002522
27 0.004764
28 0.004822
29 0.004839
30 0.002277
By using this when i want to convert back from yearly data to daily i just multipling the average by result of the forecast
But this does not work when i wanted to convert daily data to quarterly data.
Can anyone suggested any idea on how to do it.
Thank you for your consideration.
Edit
the data i want is the percentage of the data in day restective to quarter
something like:
A = total when day is equal to 1 in data and also for day 2 and 3...
#for example my data is
Date value
1/1/2000 50
1/2/2000 50
1/3/2000 40
than A of day 1 is 140
B = total when quarter is equal to 1 in data and and also for quarter 2 and 3 4
#for example my data is
Date value
1/1/2000 4000 #-->quarter 1
1/4/2000 5000 #-->quarter 2
1/7/2000 2000 #-->quarter 3
1/10/2000 1000 #-->quarter 4
1/1/20001 2000 #-->quarter 1
than average of day 1 respective quarter is 140/6000 as a is in quarter one
The data above is what I converted.
First, I am receiving and input data as daily and I converted from pandas series to dataframe shown above the column, day, month, year, quarter, week was extracted in order to group the data my method work well for converting to year and month.
The reason why I do these because when my input is given in daily and I wanted to convert to year in order to do forecasting.
After the forecasting i will be getting the forecast value in yearly form so i wanted to convert it back to daily and the method i did is to find the portion of previous data.
I am so sorry, for the unclear question.
Again thank for your help.

Pandas DataFrame.resample monthly offset from particular day of month

I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....

Categories

Resources