I have a dataframe of the foll. form:
datetime JD YEAR
2000-01-01 1 2000
2000-01-02 2 2000
2000-01-03 3 2000
2000-01-04 4 2000
2000-01-05 5 2000
2000-01-06 6 2000
2000-01-07 7 2000
2000-01-08 8 2000
2000-01-09 9 2000
...
2010-12-31 365 2014
The JD value is the julian day i.e it starts at 1 on Jan 1st of each year (going upto 366 for leap years and 365 for others). I would like to reduce the JD value by 1, for each day starting on Feb 29th of each leap year. JD values should not be changed for non-leap years. Here is what I am doing right now:
def reduce_JD(row):
if calendar.isleap(row.YEAR) & row.JD > 59:
row.JD = row.JD - 1
return row
def remove_leap_JD(df):
# Reduce JD by 1 for each day following Feb 29th
df.apply(reduce_JD, axis=1)
return df
pdf = remove_leap_JD(pdf)
However, I do not see any change in JD values for leap years. What I am doing wrong?
--EDIT:
datetime is the index column
There are two issues:
In reduce_JD(), and should be used instead of &. Otherwise, due to operator precedence, the second part of the condition df.iloc[59].JD > 59 should be bracketed. Note that:
calendar.isleap(df.iloc[59].YEAR) & (df.iloc[59].JD > 59)
# True
calendar.isleap(df.iloc[59].YEAR) & df.iloc[59].JD > 59
# False!
The apply function returns a new DataFrame instead of modifying the input in-place. Therefore, in remove_leap_JD(), the code should be changed to something like:
df = df.apply(reduce_JD, axis=1)
Related
I have a datetime column of elements which are something like 2010-05-31 00:00:00 and continue for various months. I want to find a way to extract the hours per month for each datetime object as I have another column I want to divide by the hours per month. Is there a way to do this cleanly?
The values are between a couple of months like this:
Time Value
2000-01-02 00:00:00 200
2000-01-03 00:00:00 300
...
I would want another column that is basically Value/total hours per month for Time so for january it would be 200/744, 300/744 (31 days=744 hours) etc and this continues for feb, march etc.
pandas comes with all you need; use .days_in_month and multiply by 24 to get hours, e.g.
df['Time'] = pd.to_datetime(df['Time'])
df['hours_in_month'] = df['Time'].dt.days_in_month * 24
print(df)
Time Value hours_in_month
0 2000-01-02 200 744
1 2000-01-03 300 744
2 2000-02-01 400 696
What would be the correct way to show what was the average sales volume in Carlisle city for each year between
2010-2020?
Here is an abbreviated form of the large data frame showing only the columns and rows relevant to the question:
import pandas as pd
df = pd.DataFrame({'Date': ['01/09/2009','01/10/2009','01/11/2009','01/12/2009','01/01/2010','01/02/2010','01/03/2010','01/04/2010','01/05/2010','01/06/2010','01/07/2010','01/08/2010','01/09/2010','01/10/2010','01/11/2010','01/12/2010','01/01/2011','01/02/2011'],
'RegionName': ['Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle'],
'SalesVolume': [118,137,122,132,83,81,105,114,110,106,137,130,129,121,129,100,84,62]})
This is what I've tried:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv ('C:/Users/user/AppData/Local/Programs/Python/Python39/Scripts/uk_hpi_dataset_2021_01.csv')
df.Date = pd.to_datetime(df.Date)
df['Year'] = pd.to_datetime(df['Date']).apply(lambda x:
'{year}'.format(year=x.year).zfill(2))
carlisle_vol = df[df['RegionName'].str.contains('Carlisle')]
carlisle_vol.groupby('Year')['SalesVolume'].mean()
print(sales_vol)
When I try to run this code, it doesn't filter the 'Date' column to only calculate the average SalesVolume for the years beginning in '01/01/2010' and ending at '01/12/2020'. For some reason, it also prints out every other column is well. Can anyone please help me to answer this question correctly?
This is the result I've got
>>> df.loc[(df["Date"].dt.year.between(2010, 2020))
& (df["RegionName"] == "Carlisle")] \
.groupby([pd.Grouper(key="Date", freq="Y")])["SalesVolume"].mean()
Date
2010-01-01 112.083333
2011-01-01 73.000000
Freq: A-DEC, Name: SalesVolume, dtype: float64
For further:
The only difference between the answer of #nocibambi is the groupby parameter and particularly the freq argument of pd.Grouper. Imagine your accounting year starts the 1st september.
Sales each 3 months:
>>> df
Date Sales
0 2010-09-01 1 # 1st group: mean=2.5
1 2010-12-01 2
2 2011-03-01 3
3 2011-06-01 4
4 2011-09-01 5 # 2nd group: mean=6.5
5 2011-12-01 6
6 2012-03-01 7
7 2012-06-01 8
>>> df.groupby(pd.Grouper(key="Date", freq="AS-SEP")).mean()
Sales
Date
2010-09-01 2.5
2011-09-01 6.5
Check the documentation to know all values of freq aliases and anchoring suffix
You can access year with the datetime accessor:
df[
(df["RegionName"] == "Carlisle")
& (df["Date"].dt.year >= 2010)
& (df["Date"].dt.year <= 2020)
].groupby(df.Date.dt.year)["SalesVolume"].mean()
>>>
Date
2010 112.083333
2011 73.000000
Name: SalesVolume, dtype: float64
I'm working on a dataframe named df that contains a year of daily information for a float variable (balance) for many account values (used as main key). I'm trying to create a new column expected_balance by matching the date of previous months, calculating an average and using it as expected future value. I'll explain in detail now:
The dataset is generated after appending and parsing multiple json values, once I finish working on it, I get this:
date balance account day month year fdate
0 2018-04-13 470.57 SP014 13 4 2018 201804
1 2018-04-14 375.54 SP014 14 4 2018 201804
2 2018-04-15 375.54 SP014 15 4 2018 201804
3 2018-04-16 229.04 SP014 16 4 2018 201804
4 2018-04-17 216.62 SP014 17 4 2018 201804
... ... ... ... ... ... ... ...
414857 2019-02-24 381.26 KO012 24 2 2019 201902
414858 2019-02-25 181.26 KO012 25 2 2019 201902
414859 2019-02-26 160.82 KO012 26 2 2019 201902
414860 2019-02-27 0.82 KO012 27 2 2019 201902
414861 2019-02-28 109.50 KO012 28 2 2019 201902
Each account value has 365 values (a starting date when the information was obtained and a year of info), resampled by day. After that, I'm splitting this dataframe into train and test. Train consists of all previous values except for the last 2 months of information and test are these last 2 months (the last month is not necesarilly full, if the last/max date value is 20-04-2019, then train will be from 20-04-2018 to 31-03-2019 and test 01-03-2019 to 20-04-2019). This is how I manage:
df_test_1 = df[df.fdate==df.groupby('account').fdate.transform('max')].copy()
dft = df.drop(df_test_1.index)
df_test_2 = dft[dft.fdate==dft.groupby('account').fdate.transform('max')].copy()
df_train = dft.drop(df_test_2.index)
df_test = pd.concat([df_test_2,df_test_1])
#print("Shape df: ",df.shape) #for validation purposes
#print("Shape test: ",df_test.shape) #for validation purposes
#print("Shape train: ",df_train.shape) #for validation purposes
What I need to do now is create a new column exp_bal (expected balance) for each date in the df_test that's calculated by averaging all train values for the particular day (this is the method requested so I must follow the instructions).
Here is an example of an expected output/result, I'm only printing account's AA001 values for a specific day for the last 2 train months (suppose these values always repeat for the other 8 months):
date balance account day month year fdate
... ... ... ... ... ... ... ...
0 2019-03-20 200.00 AA000 20 3 2019 201903
1 2019-04-20 100.00 AA000 20 4 2019 201904
I should be able to use this information to append a new column for each day that is the average of the same day value for all months of df_train
date balance account day month year fdate exp_bal
0 2018-05-20 470.57 AA000 20 5 2018 201805 150.00
30 2019-06-20 381.26 AA000 20 6 2019 201906 150.00
So then I can calculate a mse for the that prediction for that account.
First of all I'm using this to iterate over each account:
ids = list(df['account'].unique())
for i in range(0,len(ids)):
dft_train = df_train[df_train['account'] == ids[i]]
dft_test = df_test[df_test['account'] == ids[i]]
first_date = min(dft_test['date'])
last_date = max(df_ttest['date'])
dft_train = dft_train.set_index('date')
dft_test = dft_train.set_index('date')
And after this I'm lost on how to use the dft_train values to create this average for a given day that will be appended in a new column in dft_test.
I appreciate any help or suggestion, also feel free to ask for clarification/ more info, I'll gladly edit this. Thanks in advance!
Not sure if it's the only question you have with the above, but this is how to calculate the expected balance of the train data:
import pandas as pd, numpy as np
# make test data
n = 60
df = pd.DataFrame({'Date': np.tile(pd.date_range('2018-01-01',periods=n).values, 2), 'Account': np.repeat(['A', 'B'], n), 'Balance': range(2*n)})
df['Day'] = df.Date.dt.day
# calculate expected balance
df['exp_bal'] = df.groupby(['Account', 'Day']).Balance.transform('mean')
# example output for day 5
print(df[df.Day==5])
Output:
Date Account Balance Day exp_bal
4 2018-01-05 A 4 5 19.5
35 2018-02-05 A 35 5 19.5
64 2018-01-05 B 64 5 79.5
95 2018-02-05 B 95 5 79.5
My idea is about forecasting data with different time period :
For example :
A day month year quarter week
Date
2016-01-04 36.81 4 1 2016 1 1
2016-01-05 35.97 5 1 2016 1 1
2016-01-06 33.97 6 1 2016 1 1
2016-01-07 33.29 7 1 2016 1 1
2016-01-08 33.20 8 1 2016 1 2
2016-01-11 31.42 11 1 2016 1 2
2016-01-12 30.42 12 1 2016 1 2
I have daily data and i wanted to forecast data in month and again convert month back to daily.
The method I used is getting percentage of each day and month from it's sum
Here are some code i used:
converted_data = data.groupby( [data['month'], data['day']] )['A'].sum()
average = converted_data/converted_data.sum()
average
which give the following result:
month day A
1 3 0.002218
4 0.003815
5 0.003801
...
12 26 0.002522
27 0.004764
28 0.004822
29 0.004839
30 0.002277
By using this when i want to convert back from yearly data to daily i just multipling the average by result of the forecast
But this does not work when i wanted to convert daily data to quarterly data.
Can anyone suggested any idea on how to do it.
Thank you for your consideration.
Edit
the data i want is the percentage of the data in day restective to quarter
something like:
A = total when day is equal to 1 in data and also for day 2 and 3...
#for example my data is
Date value
1/1/2000 50
1/2/2000 50
1/3/2000 40
than A of day 1 is 140
B = total when quarter is equal to 1 in data and and also for quarter 2 and 3 4
#for example my data is
Date value
1/1/2000 4000 #-->quarter 1
1/4/2000 5000 #-->quarter 2
1/7/2000 2000 #-->quarter 3
1/10/2000 1000 #-->quarter 4
1/1/20001 2000 #-->quarter 1
than average of day 1 respective quarter is 140/6000 as a is in quarter one
The data above is what I converted.
First, I am receiving and input data as daily and I converted from pandas series to dataframe shown above the column, day, month, year, quarter, week was extracted in order to group the data my method work well for converting to year and month.
The reason why I do these because when my input is given in daily and I wanted to convert to year in order to do forecasting.
After the forecasting i will be getting the forecast value in yearly form so i wanted to convert it back to daily and the method i did is to find the portion of previous data.
I am so sorry, for the unclear question.
Again thank for your help.
I'm trying to create a simply column using Pandas that will calculate the number of days in the year of the adjacent date column.
I've already done this fairly easily for the numbers of days in the month using the daysinmonth attribute of DatetimeIndex, with the following:
def daysinmonth(row):
x = pd.DatetimeIndex(row['Date']).daysinmonth
return x
daysinmonth(df)
I'm having trouble to mimic these results for year without the nifty pre-defined attribute.
my dataframe looks like the following (sans the days_in_year column since i'm trying to create that):
Date Days_in_month Days_in_year
1 2/28/2018 28 365
2 4/14/2019 30 365
3 1/1/2020 31 366
4 2/15/2020 29 366
Thanks to anyone who takes a look!
Get the mode of year by 4 , equal to 0 means 366, else means 365(Notice this will not include the special cases , You can check the update function and the link I provided)
(pd.to_datetime(df.Date,format='%m/%d/%Y').dt.year%4).eq(0).map({True:366,False:365})
Out[642]:
1 365
2 365
3 366
4 366
Name: Date, dtype: int64
You can using this which is more accurate for decided leap year ,definition from this site
def daysinyear(x):
if x%4==0 :
if x%100==0:
if x%400==0:
return(366)
else:
return (365)
else :
return(365)
else:
return(365)
(pd.to_datetime(df.Date,format='%m/%d/%Y').dt.year%4).apply(daysinyear)
Out[656]:
1 365
2 365
3 366
4 366
Name: Date, dtype: int64
You can also use YearEnd. You'll get a timedelta64 column with this method.
import pandas as pd
from pandas.tseries.offsets import YearEnd
df['Date'] = pd.to_datetime(df.Date)
(df.Date + YearEnd(1)) - (df.Date - YearEnd(1))
1 365 days
2 365 days
3 366 days
4 366 days
Name: Date, dtype: timedelta64[ns]
Here's another way using periods:
df['Date'].dt.to_period('A').dt.to_timestamp('A').dt.dayofyear
Output:
1 365
2 365
3 366
4 366
Name: Date, dtype: int64
I would do something like this>
import datetime
import numpy as np
def func(date):
year = date.year
begin = datetime.datetime(year,1,1)
end = datetime.datetime(year,12,31)
diff = (end - begin)
result = np.timedelta64(diff, "D").astype("int")
return result
print(func(datetime.datetime(2016,12,31)))
One solution is to take the first day of one year and of the next year. Then calculate the difference. You can then apply this using pd.Series.apply:
def days_in_year(x):
day1 = x.replace(day=1, month=1)
day2 = day1.replace(year=day1.year+1)
return (day2 - day1).days
df['Date'] = pd.to_datetime(df['Date'])
df['Days_in_year'] = df['Date'].apply(days_in_year)
print(df)
Date Days_in_month Days_in_year
1 2018-02-28 28 365
2 2019-04-14 30 365
3 2020-01-01 31 366
4 2020-02-15 29 366
You can use the basic formula to check if a year is a leap year and add the result to 365 to get the number of days in a year.
# Not needed if df ['Date'] is already of type datetime
dates = pd.to_datetime(df['Date'])
years = dates.dt.year
ndays = 365 + ((years % 4 == 0) & ((years % 100 != 0) | (years % 400 == 0))).astype(int)