I'm trying to create a simply column using Pandas that will calculate the number of days in the year of the adjacent date column.
I've already done this fairly easily for the numbers of days in the month using the daysinmonth attribute of DatetimeIndex, with the following:
def daysinmonth(row):
x = pd.DatetimeIndex(row['Date']).daysinmonth
return x
daysinmonth(df)
I'm having trouble to mimic these results for year without the nifty pre-defined attribute.
my dataframe looks like the following (sans the days_in_year column since i'm trying to create that):
Date Days_in_month Days_in_year
1 2/28/2018 28 365
2 4/14/2019 30 365
3 1/1/2020 31 366
4 2/15/2020 29 366
Thanks to anyone who takes a look!
Get the mode of year by 4 , equal to 0 means 366, else means 365(Notice this will not include the special cases , You can check the update function and the link I provided)
(pd.to_datetime(df.Date,format='%m/%d/%Y').dt.year%4).eq(0).map({True:366,False:365})
Out[642]:
1 365
2 365
3 366
4 366
Name: Date, dtype: int64
You can using this which is more accurate for decided leap year ,definition from this site
def daysinyear(x):
if x%4==0 :
if x%100==0:
if x%400==0:
return(366)
else:
return (365)
else :
return(365)
else:
return(365)
(pd.to_datetime(df.Date,format='%m/%d/%Y').dt.year%4).apply(daysinyear)
Out[656]:
1 365
2 365
3 366
4 366
Name: Date, dtype: int64
You can also use YearEnd. You'll get a timedelta64 column with this method.
import pandas as pd
from pandas.tseries.offsets import YearEnd
df['Date'] = pd.to_datetime(df.Date)
(df.Date + YearEnd(1)) - (df.Date - YearEnd(1))
1 365 days
2 365 days
3 366 days
4 366 days
Name: Date, dtype: timedelta64[ns]
Here's another way using periods:
df['Date'].dt.to_period('A').dt.to_timestamp('A').dt.dayofyear
Output:
1 365
2 365
3 366
4 366
Name: Date, dtype: int64
I would do something like this>
import datetime
import numpy as np
def func(date):
year = date.year
begin = datetime.datetime(year,1,1)
end = datetime.datetime(year,12,31)
diff = (end - begin)
result = np.timedelta64(diff, "D").astype("int")
return result
print(func(datetime.datetime(2016,12,31)))
One solution is to take the first day of one year and of the next year. Then calculate the difference. You can then apply this using pd.Series.apply:
def days_in_year(x):
day1 = x.replace(day=1, month=1)
day2 = day1.replace(year=day1.year+1)
return (day2 - day1).days
df['Date'] = pd.to_datetime(df['Date'])
df['Days_in_year'] = df['Date'].apply(days_in_year)
print(df)
Date Days_in_month Days_in_year
1 2018-02-28 28 365
2 2019-04-14 30 365
3 2020-01-01 31 366
4 2020-02-15 29 366
You can use the basic formula to check if a year is a leap year and add the result to 365 to get the number of days in a year.
# Not needed if df ['Date'] is already of type datetime
dates = pd.to_datetime(df['Date'])
years = dates.dt.year
ndays = 365 + ((years % 4 == 0) & ((years % 100 != 0) | (years % 400 == 0))).astype(int)
Related
What would be the correct way to show what was the average sales volume in Carlisle city for each year between
2010-2020?
Here is an abbreviated form of the large data frame showing only the columns and rows relevant to the question:
import pandas as pd
df = pd.DataFrame({'Date': ['01/09/2009','01/10/2009','01/11/2009','01/12/2009','01/01/2010','01/02/2010','01/03/2010','01/04/2010','01/05/2010','01/06/2010','01/07/2010','01/08/2010','01/09/2010','01/10/2010','01/11/2010','01/12/2010','01/01/2011','01/02/2011'],
'RegionName': ['Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle'],
'SalesVolume': [118,137,122,132,83,81,105,114,110,106,137,130,129,121,129,100,84,62]})
This is what I've tried:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv ('C:/Users/user/AppData/Local/Programs/Python/Python39/Scripts/uk_hpi_dataset_2021_01.csv')
df.Date = pd.to_datetime(df.Date)
df['Year'] = pd.to_datetime(df['Date']).apply(lambda x:
'{year}'.format(year=x.year).zfill(2))
carlisle_vol = df[df['RegionName'].str.contains('Carlisle')]
carlisle_vol.groupby('Year')['SalesVolume'].mean()
print(sales_vol)
When I try to run this code, it doesn't filter the 'Date' column to only calculate the average SalesVolume for the years beginning in '01/01/2010' and ending at '01/12/2020'. For some reason, it also prints out every other column is well. Can anyone please help me to answer this question correctly?
This is the result I've got
>>> df.loc[(df["Date"].dt.year.between(2010, 2020))
& (df["RegionName"] == "Carlisle")] \
.groupby([pd.Grouper(key="Date", freq="Y")])["SalesVolume"].mean()
Date
2010-01-01 112.083333
2011-01-01 73.000000
Freq: A-DEC, Name: SalesVolume, dtype: float64
For further:
The only difference between the answer of #nocibambi is the groupby parameter and particularly the freq argument of pd.Grouper. Imagine your accounting year starts the 1st september.
Sales each 3 months:
>>> df
Date Sales
0 2010-09-01 1 # 1st group: mean=2.5
1 2010-12-01 2
2 2011-03-01 3
3 2011-06-01 4
4 2011-09-01 5 # 2nd group: mean=6.5
5 2011-12-01 6
6 2012-03-01 7
7 2012-06-01 8
>>> df.groupby(pd.Grouper(key="Date", freq="AS-SEP")).mean()
Sales
Date
2010-09-01 2.5
2011-09-01 6.5
Check the documentation to know all values of freq aliases and anchoring suffix
You can access year with the datetime accessor:
df[
(df["RegionName"] == "Carlisle")
& (df["Date"].dt.year >= 2010)
& (df["Date"].dt.year <= 2020)
].groupby(df.Date.dt.year)["SalesVolume"].mean()
>>>
Date
2010 112.083333
2011 73.000000
Name: SalesVolume, dtype: float64
Hello I am trying to change my dataframe dates into a format i can use to extract useful information.
The dataset comes with a 'week' feature that is in the form DD/MM/YY as follows:
In [128]: df_train[['week', 'units_sold']]
Out[128]:
week units_sold
0 17/01/11 20
1 17/01/11 28
2 17/01/11 19
3 17/01/11 44
4 17/01/11 52
I have changed the dates as follows:
df_train['new_date'] = pd.to_datetime(df_train['week'])
new_date units_sold
0 2011-01-17 20.0
1 2011-01-17 28.0
2 2011-01-17 19.0
3 2011-01-17 44.0
4 2011-01-17 52.0
Using the 'new_date' feature I created, I did the following for some information extraction:
df_train['weekday'] = df_train['new_date'].dt.weekofyear #week day of the year
df_train['QTR'] = df_train['new_date'].apply(lambda x: x.quarter) #current quarter of the year
df_train['month'] = df_train['new_date'].apply(lambda x: x.month) #current month
df_train['year'] = df_train['new_date'].dt.year #current year
However, when reviewing my data I run into some errors. For example a certain date in my dataset is 07/02/11 which should translate to a month of 2. except my parsing shows that the month is 7, which I know is incorrect: see entry 3483
Out[127]:
week month
18 17/01/11 1
1173 24/01/11 1
2328 31/01/11 1
3483 07/02/11 7
4638 14/02/11 2
Can anyone tell me where i went wrong?
Any help is apprecaited!
Use dayfirst=True parameter:
df_train['new_date'] = pd.to_datetime(df_train['week'], dayfirst=True)
And then .dt accessor for improve performance, because in apply are loops under the hood:
df_train['weekday'] = df_train['new_date'].dt.weekofyear #week day of the year
df_train['QTR'] = df_train['new_date'].dt.quarter #current quarter of the year
df_train['month'] = df_train['new_date'].dt.month #current month
df_train['year'] = df_train['new_date'].dt.year
I have a relatively large data set of weather data for 10 years and I wanna group by day of year to get the 10 years low or high for each and every single day so to use groupby I created a column in this way:
df['dms'] = df['Date'].dt.strftime('%j')
thing is when I use dt.strftime('%j') I get two numbers for the same day which is weird, for instance when I filter only by Dec 31st and I do value_counts() I get this:
365 363
366 82
Name: dms, dtype: int64
on the other hand everything work just fine if I did dt.strftime('%m-%d)
Dec-31 445
Name: dm, dtype: int64
I even did dt.strftime('%b-%d-%r').value_counts() and I got the same right filter
Dec-31-12:00:00 AM 445
Name: Date, dtype: int64
what is actually going wrong (or to sound less newbie) what is happening behind the scene in the %j case
Let us consider an example with the following data:
df = pd.DataFrame({'Date' : ['2016-12-31', '2017-12-31', '2018-12-31', '2019-12-31', '2020-12-31']})
df['Date'] = pd.to_datetime(df['Date'])
df
Date
0 2016-12-31
1 2017-12-31
2 2018-12-31
3 2019-12-31
4 2020-12-31
In the data above, 2016 and 2020 are leap years with an extra day on February 29th to make up for the fact that an actual year is 365 days and eight hours long (so every fourth year, Leap Year/Leap Day exists, because we take the sum of the extra eight hours from the previous 3 years (3 X 8 = 24), and that is why we have leap day!), so we should expect to return 366 with %j for said years when we do:
import pandas as pd
df = pd.DataFrame({'Date' : ['2016-12-31', '2017-12-31', '2018-12-31', '2019-12-31', '2020-12-31']})
df['Date'] = pd.to_datetime(df['Date'])
df['Day'] = df['Date'].dt.strftime('%j')
df
Date Day
0 2016-12-31 366
1 2017-12-31 365
2 2018-12-31 365
3 2019-12-31 365
4 2020-12-31 366
However, when you do value_counts(), then it returns:
365 3
366 2
Name: Day, dtype: int64
This is also expected behavior, so %j is working correctly behind the scenes as it is accommodating for Leap Years.
%j returns a day number of year 001-366 (366 for leap year, 365 otherwise). Since your data spans 10 years, 366 would be a valid day for leap year.
Python pandas (0.24.1) is adding a seemingly arbitrary number of hours, minutes, and seconds to my datetime objects. This seems unexpected as default behavior; I would expect the time component to default to midnight (00:00:00). Is this a bug?
import pandas as pd
df = pd.DataFrame( {'yr': [2019, 2019],
'mo': [9, 9],
'dy': [25, 26]} )
df['dtime'] = ( pd.to_datetime(df['yr'],format='%Y')
+pd.to_timedelta(df['mo']-1,unit='M')
+pd.to_timedelta(df['dy']-1,unit='d') )
print('pandas version == '+pd.__version__)
df
################################################
OUTPUT:
################################################
pandas version == 0.24.1
yr mo dy dtime
0 2019 9 25 2019-09-25 11:52:48
1 2019 9 26 2019-09-26 11:52:48
Problem is with converting months, here is used 'rounded' year (because leap year) and divided by 12 for 'rounded' month:
print (pd.to_timedelta(365.2425, unit='d') / 12)
30 days 10:29:06
print (pd.to_timedelta(1, unit='M'))
30 days 10:29:06
print (pd.to_timedelta(df['mo']-1,unit='M'))
0 243 days 11:52:48
1 243 days 11:52:48
Name: mo, dtype: timedelta64[ns]
Better solution is use to_datetime with year, monht and day columns and if necessary filter it by subset with list(d.values()) (if another columns in real data):
d = {'yr':'year', 'mo':'month', 'dy':'day'}
df['dtime'] = pd.to_datetime(df.rename(columns=d)[list(d.values())])
print (df)
yr mo dy dtime
0 2019 9 25 2019-09-25
1 2019 9 26 2019-09-26
To add detail as to the issue with timedelta that Jezrael pointed out above, the issue with the month conversion is as follows: Pandas timedelta defines a month as 1/12 of a year, which is 365.2425 days based on leap year logic.
243 days 11:52:48 is 21037968 seconds.
>>> 243*60*60*24+11*60*60+52*60+48
21037968
Some dimensional analysis confirms this is 8/12 of a year that is 365.2425 days long.
>>> 21037968/((8/12)*365.2425*60*60*24)
1.0
As noted above, use to_datetime to avoid this.
I have a dataframe of the foll. form:
datetime JD YEAR
2000-01-01 1 2000
2000-01-02 2 2000
2000-01-03 3 2000
2000-01-04 4 2000
2000-01-05 5 2000
2000-01-06 6 2000
2000-01-07 7 2000
2000-01-08 8 2000
2000-01-09 9 2000
...
2010-12-31 365 2014
The JD value is the julian day i.e it starts at 1 on Jan 1st of each year (going upto 366 for leap years and 365 for others). I would like to reduce the JD value by 1, for each day starting on Feb 29th of each leap year. JD values should not be changed for non-leap years. Here is what I am doing right now:
def reduce_JD(row):
if calendar.isleap(row.YEAR) & row.JD > 59:
row.JD = row.JD - 1
return row
def remove_leap_JD(df):
# Reduce JD by 1 for each day following Feb 29th
df.apply(reduce_JD, axis=1)
return df
pdf = remove_leap_JD(pdf)
However, I do not see any change in JD values for leap years. What I am doing wrong?
--EDIT:
datetime is the index column
There are two issues:
In reduce_JD(), and should be used instead of &. Otherwise, due to operator precedence, the second part of the condition df.iloc[59].JD > 59 should be bracketed. Note that:
calendar.isleap(df.iloc[59].YEAR) & (df.iloc[59].JD > 59)
# True
calendar.isleap(df.iloc[59].YEAR) & df.iloc[59].JD > 59
# False!
The apply function returns a new DataFrame instead of modifying the input in-place. Therefore, in remove_leap_JD(), the code should be changed to something like:
df = df.apply(reduce_JD, axis=1)