Ultimately I want to calculate the number of days to the last day of the month from every date in df['start'] and populate the 'count' column with the result.
As a first step towards that goal the calendar.monthrange
method takes (year, month) arguments and returns a (first weekday, number of days) tuple.
There seems to be a general mistake regarding applying functions to dataframes or series objects. I would like to understand, why this isn't working.
import numpy as np
import pandas as pd
import calendar
def last_day(row):
return calendar.monthrange(row['start'].dt.year, row['start'].dt.month)
This line raises an AttributeError: "Timestamp object has no attribute 'dt'":
df['count'] = df.apply(last_day, axis=1)
this is what my dataframe looks like:
start count
0 2016-02-15 NaN
1 2016-02-20 NaN
2 2016-04-23 NaN
df.dtypes
start datetime64[ns]
count float64
dtype: object
Remove the .dt. This is generally needed when accessing a vector of some sort. But when accessing an individual element it will already be a datetime object:
Code:
def last_day(row):
return calendar.monthrange(row['start'].year, row['start'].month)
Why:
This apply calls last_day and passes a Series.
df['count'] = df.apply(last_day, axis=1)
In last_day you then select a single element of the series:
row['start'].year
I would do it like this:
from pandas.tseries.offsets import MonthEnd
## sample data
d = pd.DataFrame({'start':['2016-02-15','2016-02-20','2016-04-23']})
## solution
d['start'] = pd.to_datetime(d['start'])
d['end'] = d['start'] + MonthEnd(1)
d['count'] = (d['start'] - d['end']) / np.timedelta64(-1, 'D')
Related
This question already has answers here:
removing time from date&time variable in pandas?
(3 answers)
Closed last year.
solar["DATE"]= solar['DATE'].strftime('%Y-%m-%d')
display(solar)
I want to remove the time function from the DATE column. I only want the date, how do I get rid of it but keep the date?
[1]: https://i.stack.imgur.com/8G8Jg.png
The error I get is below:
AttributeError: 'Series' object has no attribute 'strftime'
According to the error i think so you are using pandas dataframe and to edit the values you will have to use .apply() function.
You can do it via:
#IF the values are already a datetime object
solar['DATE'].apply(lambda d: d.date())
#ELSE IF dates are a string:
solar['DATE'].apply(lambda d: d.stftime('%Y-%m-%d'))
What I came up with is what follows:
import pandas as pd
import datetime
date = pd.date_range("2018-01-01", periods=500, freq="H")
dataframe = pd.DataFrame({"date":date})
def removeDayTime(date):
dateStr = str(date) # This line is just to change the timestamp format to str. You probably do not need this line to include in your code.
dateWitoutTime = datetime.datetime.strptime(dateStr, "%Y-%m-%d %H:%M:%S").strftime("%Y-%m-%d")
return dateWitoutTime
dataframe["date"] = dataframe["date"].apply(removeDayTime)
dataframe.head()
Note that in order to have example data to work with, I have generated 500 periods of dates. You probably do not need to use my dataframe. So just use the rest of the code.
Output
date
0
2018-01-01
1
2018-01-01
2
2018-01-01
3
2018-01-01
4
2018-01-01
i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541
date['Maturity_date'] = data.apply(lambda data: relativedelta(months=int(data['TRM_LNTH_MO'])) + data['POL_EFF_DT'], axis=1)
Tried this also:
date['Maturity_date'] = date['POL_EFF_DT'] + date['TRM_LNTH_MO'].values.astype("timedelta64[M]")
TypeError: 'type' object does not support item assignment
import pandas as pd
import datetime
#Convert the date column to date format
date['date_format'] = pd.to_datetime(date['Maturity_date'])
#Add a month column
date['Month'] = date['date_format'].apply(lambda x: x.strftime('%b'))
If you are using Pandas, you may use a resource called: "Frequency Aliases". Something very out of the box:
# For "periods": 1 (is the current date you have) and 2 the result, plus 1, by the frequency of 'M' (month).
import pandas as pd
_new_period = pd.date_range(_existing_date, periods=2, freq='M')
Now you can get exactly the period you want as the second element returned:
# The index for your information is 1. Index 0 is the existing date.
_new_period.strftime('%Y-%m-%d')[1]
# You can format in different ways. Only Year, Month or Day. Whatever.
Consult this link for further information
I have a dataframe with the date and month_diff variable. I would like to get a new date (name it as Target_Date) based on the following logic:
For example, the date is 2/13/2019, month_diff is 3, then the target date should be the month-end of the original date plus 3 months, which is 5/31/2019
I tried the following method to get the traget date first:
df["Target_Date"] = df["Date"] + pd.DateOffset(months = df["month_diff"])
But it failed, as I know, the parameter in the dateoffset should be a varaible or a fixed number.
I also tried:
df["Target_Date"] = df["Date"] + relativedelta(months = df["month_diff"])
It failes too.
Anyone can help? thank you.
edit:
this is a large dataset with millions rows.
You could try this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({'Date': [pd.datetime(2019,1,1), pd.datetime(2019,2,1)], 'month_diff': [1,2]})
df.apply(lambda row: row.Date + relativedelta(months=row.month_diff), axis=1)
Or list comprehension
[date + relativedelta(months=month_diff) for date, month_diff in df[['Date', 'month_diff']].values]
I would approach in the following method to compute your "target_date".
Apply the target month offset (in your case +3months), using your pd.DateOffset.
Get the last day of that target month (using for example calendar.monthrange, see also "Get last day of the month"). This will provide you with the "flexible" part of that date" offset.
Apply the flexible day offset, when comparing the result of step 1. and step 2. This could be a new pd.DateOffset.
A solution could look something like this:
import calendar
from dateutil.relativedelta import relativedelta
for ii in df.index:
new_ = df.at[ii, 'start_date'] + relativedelta(months=df.at[ii, 'month_diff'])
max_date = calendar.monthrange(new_.year, new_.month)[1]
end_ = new_ + relativedelta(days=max_date - new_.day)
print(end_)
Further "cleaning" into a function and / or list comprehension will probably make it much faster
import pandas as pd
from datetime import datetime
from datetime import timedelta
This is my approach in solving your issue.
However for some reason I am getting a semantic error in my output even though I am sure it is the correct way. Please everyone correct me if you notice something wrong.
today = datetime.now()
today = today.strftime("%d/%m/%Y")
month_diff =[30,5,7]
n = 30
for i in month_diff:
b = {'Date': today, 'month_diff':month_diff,"Target_Date": datetime.now()+timedelta(days=i*n)}
df = pd.DataFrame(data=b)
Output:
For some reason the i is not getting updated.
I was looking for a solution I can write in one line only and apply does the job. However, by default apply function performs action on each column, so you have to remember to specify correct axis: axis=1.
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Create a new column with date adjusted by number of months from 'month_diff' column and later adjust to the last day of month
df['Target_Date'] = df.apply(lambda row: row.Date # to current date
+ relativedelta(months=row.month_diff) # add month_diff
+ relativedelta(day=+31) # and adjust to the last day of month
, axis=1) # 1 or ‘columns’: apply function to each row.
I have a basic code snippet that I need to recreate in pandas:
import datetime as dt
date1 = dt.date(2016,10,10)
date2 = dt.date.today()
print('Week#', round((date2 - date1).days / 7 +.5))
# output: Week# 36
I have a datetime64[ns] datatype column and I cannot crack it. Using this basic example I'm stumped:
import pandas as pd
import numpy as np
import datetime as dt
dfp = pd.DataFrame({'A' : [dt.date(2016,10,6)]})
dfp['A'] = pd.to_datetime(dfp['A'])
def week(col):
print((col.dt.date - dt.date(2015, 10, 6)))
week(dfp['A']) #output: 366 days
When I try re-creating the week number calculation I'm running into multiple errors: print((col.dt.date - dt.date(2015, 10, 6)).days) returns AttributeError: 'Series' object has no attribute 'days'
I'd like to try and solve this on my own so I can learn from the pain.
How do I return the pandas column values in terms of "number of days" or as an int like using the first calculation in the first code snippet? (ie, instead of 366 days, just 366)
If you're feeling adventurous how could i get the Week# xxx output in a more efficient way?
I think you forget .dt:
dfp = pd.DataFrame({'A' : [date2]})
dfp['A'] = pd.to_datetime(dfp['A'])
print (dfp)
print (((dfp['A'].dt.date - dt.date(2016, 10, 10)).dt.days / 7 + .5).round().astype(int))
0 36
Name: A, dtype: int32