Combining pandas with datetime - python

I have a dataframe with start and end dates. I am trying to create a third column with the following conditions:
if dt < 24 hours; return the actual difference between start and end date
if dt > 24 hours; return start date + 24 hours
I have been able to create a column with a 24 hour difference, but I am not able to create a loc-statement that can do the above. Any help?
df2['end_shutdown_analysis'] = df2['Shutdown timestamp'] + timedelta(hours=24)

you can try via np.where():
import numpy as np
df2['end_shutdown_analysis'] =np.where(
df2['Shutdown timestamp'].dt.hour<24, # condition
df['start']-df['end'], # value if true
df2['Start']+pd.DateOffset(hours=24) # else value.
)
OR
via loc:
m=df2['Shutdown timestamp'].dt.hour<24
df.loc[m,'end_shutdown_analysis']=df['start']-df['end']
df.loc[~m,'end_shutdown_analysis']=df2['start']+pd.DateOffset(hours=24)
Note: you can also use pd.Timedelta(hours=24) in place of pd.DateOffset(hours=24)

Related

How do I create a future pandas datetime index based on a previous datetime index without specifying the frequency?

I have a pandas dataframe and it has a datetime index. I would like to take that index and create a new index that starts from one freq step after the last time and extends for n future steps. My problem is in creating the pd.DateOffset I need to specify the frequency, but I don't want to hardcode that. Is there a way to determine the future index's frequency from the original index? Here is my hardcoded example:
import pandas as pd
base_idx = pd.date_range('2022-10-05', '2022-10-12', name='times', freq='D')
print(base_idx)
DatetimeIndex(['2022-10-05', '2022-10-06', '2022-10-07', '2022-10-08',
'2022-10-09', '2022-10-10', '2022-10-11', '2022-10-12'],
dtype='datetime64[ns]', name='times', freq='D')
n = 5
future_idx = pd.date_range(base_idx.max() + pd.DateOffset(days=1), base_idx.max() + pd.DateOffset(days=5))
print(future_idx)
DatetimeIndex(['2022-10-13', '2022-10-14', '2022-10-15', '2022-10-16',
'2022-10-17'],
dtype='datetime64[ns]', freq='D')
I want to not have to state that it is in days because I might end up needing seconds, or weeks, etc.
You can get the frequency from the base_idx index and use it to construct the new index:
future_idx = pd.date_range(
base_idx.max() + base_idx.freq, base_idx.max() + n * base_idx.freq
)

Python pandas select rows based on datetime condition

Here is the code for sample simulated data. Actual data can have varying start and end dates.
import pandas as pd
import numpy as np
dates = pd.date_range("20100121", periods=3653)
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))
dfb=df.resample('B').apply(lambda x:x[-1])
From the dfb, I want to select the rows that contain values for all the days of the month.
In dfb, 2010 January and 2020 January have incomplete data. So I would like data from 2010 Feb till 2019 December.
For this particular dataset, I could do
df_out=dfb['2010-02':'2019-12']
But please help me with a better solution
Edit-- Seems there is plenty of confusion in the question. I want to omit rows that does not begin with first day of the month and rows that does not end on last day of the month. Hope that's clear.
When you say "better" solution - I assume you mean make the range dynamic based on input data.
OK, since you mention that your data is continuous after the start date - it is a safe assumption that dates are sorted in increasing order. With this in mind, consider the code:
import pandas as pd
import numpy as np
from datetime import date, timedelta
dates = pd.date_range("20100121", periods=3653)
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))
print(df)
dfb=df.resample('B').apply(lambda x:x[-1])
# fd is the first index in your dataframe
fd = df.index[0]
first_day_of_next_month = fd
# checks if the first month data is incomplete, i.e. does not start with date = 1
if ( fd.day != 1 ):
new_month = fd.month + 1
if ( fd.month == 12 ):
new_month = 1
first_day_of_next_month = fd.replace(day=1).replace(month=new_month)
else:
first_day_of_next_month = fd
# ld is the last index in your dataframe
ld = df.index[-1]
# computes the next day
next_day = ld + timedelta(days=1)
if ( next_day.month > ld.month ):
last_day_of_prev_month = ld # keeps the index if month is changed
else:
last_day_of_prev_month = ld.replace(day=1) - timedelta(days=1)
df_out=dfb[first_day_of_next_month:last_day_of_prev_month]
There is another way to use dateutil.relativedelta but you will need to install python-dateutil module. The above solution attempts to do it without using any extra modules.
I assume that in the general case the table is chronologically ordered (if not use .sort_index). The idea is to extract the year and month from the date and select only the lines where (year, month) is not equal to the first and last lines.
dfb['year'] = dfb.index.year # col#1
dfb['month'] = dfb.index.month # col#2
first_month = (dfb['year']==dfb.iloc[0, 1]) & (dfb['month']==dfb.iloc[0, 2])
last_month = (dfb['year']==dfb.iloc[-1, 1]) & (dfb['month']==dfb.iloc[-1, 2])
dfb = dfb.loc[(~first_month) & (~last_month)]
dfb = dfb.drop(['year', 'month'], axis=1)

Select nearest date first day of month in a python dataframe

i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541

Python Dataframe Date plus months variable which comes from the other column

I have a dataframe with the date and month_diff variable. I would like to get a new date (name it as Target_Date) based on the following logic:
For example, the date is 2/13/2019, month_diff is 3, then the target date should be the month-end of the original date plus 3 months, which is 5/31/2019
I tried the following method to get the traget date first:
df["Target_Date"] = df["Date"] + pd.DateOffset(months = df["month_diff"])
But it failed, as I know, the parameter in the dateoffset should be a varaible or a fixed number.
I also tried:
df["Target_Date"] = df["Date"] + relativedelta(months = df["month_diff"])
It failes too.
Anyone can help? thank you.
edit:
this is a large dataset with millions rows.
You could try this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({'Date': [pd.datetime(2019,1,1), pd.datetime(2019,2,1)], 'month_diff': [1,2]})
df.apply(lambda row: row.Date + relativedelta(months=row.month_diff), axis=1)
Or list comprehension
[date + relativedelta(months=month_diff) for date, month_diff in df[['Date', 'month_diff']].values]
I would approach in the following method to compute your "target_date".
Apply the target month offset (in your case +3months), using your pd.DateOffset.
Get the last day of that target month (using for example calendar.monthrange, see also "Get last day of the month"). This will provide you with the "flexible" part of that date" offset.
Apply the flexible day offset, when comparing the result of step 1. and step 2. This could be a new pd.DateOffset.
A solution could look something like this:
import calendar
from dateutil.relativedelta import relativedelta
for ii in df.index:
new_ = df.at[ii, 'start_date'] + relativedelta(months=df.at[ii, 'month_diff'])
max_date = calendar.monthrange(new_.year, new_.month)[1]
end_ = new_ + relativedelta(days=max_date - new_.day)
print(end_)
Further "cleaning" into a function and / or list comprehension will probably make it much faster
import pandas as pd
from datetime import datetime
from datetime import timedelta
This is my approach in solving your issue.
However for some reason I am getting a semantic error in my output even though I am sure it is the correct way. Please everyone correct me if you notice something wrong.
today = datetime.now()
today = today.strftime("%d/%m/%Y")
month_diff =[30,5,7]
n = 30
for i in month_diff:
b = {'Date': today, 'month_diff':month_diff,"Target_Date": datetime.now()+timedelta(days=i*n)}
df = pd.DataFrame(data=b)
Output:
For some reason the i is not getting updated.
I was looking for a solution I can write in one line only and apply does the job. However, by default apply function performs action on each column, so you have to remember to specify correct axis: axis=1.
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Create a new column with date adjusted by number of months from 'month_diff' column and later adjust to the last day of month
df['Target_Date'] = df.apply(lambda row: row.Date # to current date
+ relativedelta(months=row.month_diff) # add month_diff
+ relativedelta(day=+31) # and adjust to the last day of month
, axis=1) # 1 or ‘columns’: apply function to each row.

PYTHON Numpy where time condition

I have the following target: I need to compare two date columns in the same table and create a 3rd column based on the result of the comparison. I do not know how to compare dates in a np.where statement.
This is my current code:
now = datetime.datetime.now() #set the date to compare
delta = datetime.timedelta(days=7) #set delta
time_delta = now+delta #now+7 days
And here is the np.where statement:
DB['s_date'] = np.where((DB['Start Date']<=time_delta | DB['Start Date'] = (None,"")),DB['Start Date'],RW['date'])
There is an OR condition to take into account the possibility that Start Date column might be empty
Would lambda apply work for you Filippo? It looks at a series row-wise, then applies a function of your choice to every value of the row. Whatever is returned in the function will fill up the series with the values it returns.
def compare(date):
if date <= time_delta or date == None:
#return something
else:
#return something else
DB['s_date'] = DB.apply(lambda x: compare(x))
EDIT: This will work as well (thanks EyuelDK)
DB['s_date'] = DB.apply(compare)
Thank you for the insights. I updated (and adjusted for my purposes) the code as following and it works:
now = datetime.datetime.now() #set the date to compare
delta = datetime.timedelta(days=7) #set delta
time_delta = now+delta #now+7 days
DB['Start'] = np.where(((DB['Start Date']<=time_delta) | (DB['Start Date'].isnull()) | (DB['Start Date'] == "")),DB['Start'],DB['Start Date'])
They key was to add () in each condition separated by |. Otherwise was giving an error by comparing two different data types.

Categories

Resources