Python Dataframe Date plus months variable which comes from the other column - python

I have a dataframe with the date and month_diff variable. I would like to get a new date (name it as Target_Date) based on the following logic:
For example, the date is 2/13/2019, month_diff is 3, then the target date should be the month-end of the original date plus 3 months, which is 5/31/2019
I tried the following method to get the traget date first:
df["Target_Date"] = df["Date"] + pd.DateOffset(months = df["month_diff"])
But it failed, as I know, the parameter in the dateoffset should be a varaible or a fixed number.
I also tried:
df["Target_Date"] = df["Date"] + relativedelta(months = df["month_diff"])
It failes too.
Anyone can help? thank you.
edit:
this is a large dataset with millions rows.

You could try this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({'Date': [pd.datetime(2019,1,1), pd.datetime(2019,2,1)], 'month_diff': [1,2]})
df.apply(lambda row: row.Date + relativedelta(months=row.month_diff), axis=1)
Or list comprehension
[date + relativedelta(months=month_diff) for date, month_diff in df[['Date', 'month_diff']].values]

I would approach in the following method to compute your "target_date".
Apply the target month offset (in your case +3months), using your pd.DateOffset.
Get the last day of that target month (using for example calendar.monthrange, see also "Get last day of the month"). This will provide you with the "flexible" part of that date" offset.
Apply the flexible day offset, when comparing the result of step 1. and step 2. This could be a new pd.DateOffset.
A solution could look something like this:
import calendar
from dateutil.relativedelta import relativedelta
for ii in df.index:
new_ = df.at[ii, 'start_date'] + relativedelta(months=df.at[ii, 'month_diff'])
max_date = calendar.monthrange(new_.year, new_.month)[1]
end_ = new_ + relativedelta(days=max_date - new_.day)
print(end_)
Further "cleaning" into a function and / or list comprehension will probably make it much faster

import pandas as pd
from datetime import datetime
from datetime import timedelta
This is my approach in solving your issue.
However for some reason I am getting a semantic error in my output even though I am sure it is the correct way. Please everyone correct me if you notice something wrong.
today = datetime.now()
today = today.strftime("%d/%m/%Y")
month_diff =[30,5,7]
n = 30
for i in month_diff:
b = {'Date': today, 'month_diff':month_diff,"Target_Date": datetime.now()+timedelta(days=i*n)}
df = pd.DataFrame(data=b)
Output:
For some reason the i is not getting updated.

I was looking for a solution I can write in one line only and apply does the job. However, by default apply function performs action on each column, so you have to remember to specify correct axis: axis=1.
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Create a new column with date adjusted by number of months from 'month_diff' column and later adjust to the last day of month
df['Target_Date'] = df.apply(lambda row: row.Date # to current date
+ relativedelta(months=row.month_diff) # add month_diff
+ relativedelta(day=+31) # and adjust to the last day of month
, axis=1) # 1 or ‘columns’: apply function to each row.

Related

Python pandas select rows based on datetime condition

Here is the code for sample simulated data. Actual data can have varying start and end dates.
import pandas as pd
import numpy as np
dates = pd.date_range("20100121", periods=3653)
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))
dfb=df.resample('B').apply(lambda x:x[-1])
From the dfb, I want to select the rows that contain values for all the days of the month.
In dfb, 2010 January and 2020 January have incomplete data. So I would like data from 2010 Feb till 2019 December.
For this particular dataset, I could do
df_out=dfb['2010-02':'2019-12']
But please help me with a better solution
Edit-- Seems there is plenty of confusion in the question. I want to omit rows that does not begin with first day of the month and rows that does not end on last day of the month. Hope that's clear.
When you say "better" solution - I assume you mean make the range dynamic based on input data.
OK, since you mention that your data is continuous after the start date - it is a safe assumption that dates are sorted in increasing order. With this in mind, consider the code:
import pandas as pd
import numpy as np
from datetime import date, timedelta
dates = pd.date_range("20100121", periods=3653)
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))
print(df)
dfb=df.resample('B').apply(lambda x:x[-1])
# fd is the first index in your dataframe
fd = df.index[0]
first_day_of_next_month = fd
# checks if the first month data is incomplete, i.e. does not start with date = 1
if ( fd.day != 1 ):
new_month = fd.month + 1
if ( fd.month == 12 ):
new_month = 1
first_day_of_next_month = fd.replace(day=1).replace(month=new_month)
else:
first_day_of_next_month = fd
# ld is the last index in your dataframe
ld = df.index[-1]
# computes the next day
next_day = ld + timedelta(days=1)
if ( next_day.month > ld.month ):
last_day_of_prev_month = ld # keeps the index if month is changed
else:
last_day_of_prev_month = ld.replace(day=1) - timedelta(days=1)
df_out=dfb[first_day_of_next_month:last_day_of_prev_month]
There is another way to use dateutil.relativedelta but you will need to install python-dateutil module. The above solution attempts to do it without using any extra modules.
I assume that in the general case the table is chronologically ordered (if not use .sort_index). The idea is to extract the year and month from the date and select only the lines where (year, month) is not equal to the first and last lines.
dfb['year'] = dfb.index.year # col#1
dfb['month'] = dfb.index.month # col#2
first_month = (dfb['year']==dfb.iloc[0, 1]) & (dfb['month']==dfb.iloc[0, 2])
last_month = (dfb['year']==dfb.iloc[-1, 1]) & (dfb['month']==dfb.iloc[-1, 2])
dfb = dfb.loc[(~first_month) & (~last_month)]
dfb = dfb.drop(['year', 'month'], axis=1)

Select nearest date first day of month in a python dataframe

i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541

How to add month column to a date column in python?

date['Maturity_date'] = data.apply(lambda data: relativedelta(months=int(data['TRM_LNTH_MO'])) + data['POL_EFF_DT'], axis=1)
Tried this also:
date['Maturity_date'] = date['POL_EFF_DT'] + date['TRM_LNTH_MO'].values.astype("timedelta64[M]")
TypeError: 'type' object does not support item assignment
import pandas as pd
import datetime
#Convert the date column to date format
date['date_format'] = pd.to_datetime(date['Maturity_date'])
#Add a month column
date['Month'] = date['date_format'].apply(lambda x: x.strftime('%b'))
If you are using Pandas, you may use a resource called: "Frequency Aliases". Something very out of the box:
# For "periods": 1 (is the current date you have) and 2 the result, plus 1, by the frequency of 'M' (month).
import pandas as pd
_new_period = pd.date_range(_existing_date, periods=2, freq='M')
Now you can get exactly the period you want as the second element returned:
# The index for your information is 1. Index 0 is the existing date.
_new_period.strftime('%Y-%m-%d')[1]
# You can format in different ways. Only Year, Month or Day. Whatever.
Consult this link for further information

Python calculate the number of year in date column

I've recently start coding with Python, and I'm struggling to calculate the number of years between the current date and a given date.
Dataframe
I would like to calculate the number of year for each column.
I tried this but it's not working:
def Number_of_years(d1,d2):
if d1 is not None:
return relativedelta(d2,d1).years
for col in df.select_dtypes(include=['datetime64[ns]']):
df[col]=Number_of_years(df[col],date.today())
Can anyone help me find a solution to this?
I see that the format of dates is day/month/year.
Given this format is same for all the grids, you can parse the date using the datetime module like so:
from datetime import datetime # import module
def numberOfYears(element):
# parse the date string according to the fixed format
date = datetime.strptime(element, '%d/%m/%Y')
# return the difference in the years
return datetime.today().year - date.year
# make things more interesting by vectorizing this function
function = np.vectorize(numberOfYears)
# This returns a numpy array containing difference between years.
# call this for each column, and you should be good
difference = function(df.Date_creation)
You code is basically right, but you're operating over a pandas series so you can't just call relativedelta directly:
def number_of_years(d1,d2):
return relativedelta(d2,d1).years
for col in df.select_dtypes(include=['datetime64[ns]']):
df[col]= df[col].apply(lambda d: number_of_years(x, date.today()))

Python: Pick an specific date from a dataframe with Pandas

The following short script uses findatapy to collect data from Dukascopy website. Note that this package uses Pandas and it doesn't require to import it separately.
from findatapy.market import Market, MarketDataRequest, MarketDataGenerator
market = Market(market_data_generator=MarketDataGenerator())
md_request = MarketDataRequest(start_date='08 Feb 2017', finish_date='09 Feb 2017', category='fx', fields=['bid', 'ask'], freq='tick', data_source='dukascopy', tickers=['EURUSD'])
df = market.fetch_market(md_request)
#Group everything by an hourly frequency.
df=df.groupby(pd.TimeGrouper('1H')).head(1)
#Deleting the milliseconds from the Dateframe
df.index =df.index.map(lambda t: t.strftime('%Y-%m-%d %H:%M:%S'))
#Computing Average between columns 1 and 2, and storing it in a new one.
df['Avg'] = (df['EURUSD.bid'] + df['EURUSD.ask'])/2
The outcome looks like this:
Until this point, everything runs properly, but I need to extract an specific hour from this dataframe. I'd like to pick, let's say, all the values (bid, ask, avg... or just one of them) at a certain hour, 10:00:00AM.
By seeing other posts, I thought I could do something like this:
match_timestamp = "10:00:00"
df.loc[(df.index.strftime("%H:%M:%S") == match_timestamp)]
But the outcome is an error message saying:
AttributeError: 'Index' object has no attribute 'strftime'
I can't even perform df.index.hour, it used to work before the line where I remove the milliseconds (the dtype is datetime64[ns] until that point), after that the dtype is an 'Object'. Looks like I need to reverse this format in order to use strftime.
Can you help me out?
you should take a look at resample :
df = df.resample('H').first() # resample for each hour and use first value of hour
then:
df.loc[df.index.hour == 10] # index is still a date object, play with it
if you dont like that, you can just set your index to a datetime object like so:
df.index = pd.to_datetime(df.index)
then your code should work as is
try to reset the index
match_timestamp = "10:00:00"
df = df.reset_index()
df = df.assign(Date=pd.to_datetime(df.Date))
df.loc[(df.Date.strftime("%H:%M:%S") == match_timestamp)]

Categories

Resources