PYTHON Numpy where time condition - python

I have the following target: I need to compare two date columns in the same table and create a 3rd column based on the result of the comparison. I do not know how to compare dates in a np.where statement.
This is my current code:
now = datetime.datetime.now() #set the date to compare
delta = datetime.timedelta(days=7) #set delta
time_delta = now+delta #now+7 days
And here is the np.where statement:
DB['s_date'] = np.where((DB['Start Date']<=time_delta | DB['Start Date'] = (None,"")),DB['Start Date'],RW['date'])
There is an OR condition to take into account the possibility that Start Date column might be empty

Would lambda apply work for you Filippo? It looks at a series row-wise, then applies a function of your choice to every value of the row. Whatever is returned in the function will fill up the series with the values it returns.
def compare(date):
if date <= time_delta or date == None:
#return something
else:
#return something else
DB['s_date'] = DB.apply(lambda x: compare(x))
EDIT: This will work as well (thanks EyuelDK)
DB['s_date'] = DB.apply(compare)

Thank you for the insights. I updated (and adjusted for my purposes) the code as following and it works:
now = datetime.datetime.now() #set the date to compare
delta = datetime.timedelta(days=7) #set delta
time_delta = now+delta #now+7 days
DB['Start'] = np.where(((DB['Start Date']<=time_delta) | (DB['Start Date'].isnull()) | (DB['Start Date'] == "")),DB['Start'],DB['Start Date'])
They key was to add () in each condition separated by |. Otherwise was giving an error by comparing two different data types.

Related

update Pandas DataFrame time column based on a date range

I have uploaded a big file and created a DataFrame for it.
Now i want to update some of the columns containing timestamps as well if possible update columns with dates based on that.
The reason is that i want to adjust for daylight saving time, and the list i am working with is GMT time so i need to adjust the timestamps on it.
Example that works:
df_winter2['Confirmation_Time'] = pd.to_datetime(df_winter2['Confirmation_Time'].astype(str)) + pd.DateOffset(hours=7)
df_summer['Confirmation_Time'] = pd.to_datetime(df_summer['Confirmation_Time'].astype(str)) + pd.DateOffset(hours=6)
I want to write a function that first add the 6 or 7 hours to the DataFrame based on if it is summertime or wintertime.
If it is possible as well i want to update the date column if the timestamp is > 16:00 with + 1 day,
the date column is called df['Creation_Date']
This should work for the function if it is wintertime.
def wintertime(date_time):
year, month, day = dt.timetuple()[0:3]
if (month < 3) or (month == 12 and day < 21):
return True
else:
return False
Now I am guessing you also want to loop through your df and update the time respectively which you could do with the following:
for i, length in enumerate (df):
date_time = df['Confirmation_Time'][i]
if wintertime(date_time):
df['Confirmation_Time'][i] = pd.to_datetime(df['Confirmation_Time'][i].astype(str)) + pd.DateOffset(hours=7)
else:
df['Confirmation_Time'][i] = pd.to_datetime(df['Confirmation_Time'][i].astype(str)) + pd.DateOffset(hours=6)
return df

Combining pandas with datetime

I have a dataframe with start and end dates. I am trying to create a third column with the following conditions:
if dt < 24 hours; return the actual difference between start and end date
if dt > 24 hours; return start date + 24 hours
I have been able to create a column with a 24 hour difference, but I am not able to create a loc-statement that can do the above. Any help?
df2['end_shutdown_analysis'] = df2['Shutdown timestamp'] + timedelta(hours=24)
you can try via np.where():
import numpy as np
df2['end_shutdown_analysis'] =np.where(
df2['Shutdown timestamp'].dt.hour<24, # condition
df['start']-df['end'], # value if true
df2['Start']+pd.DateOffset(hours=24) # else value.
)
OR
via loc:
m=df2['Shutdown timestamp'].dt.hour<24
df.loc[m,'end_shutdown_analysis']=df['start']-df['end']
df.loc[~m,'end_shutdown_analysis']=df2['start']+pd.DateOffset(hours=24)
Note: you can also use pd.Timedelta(hours=24) in place of pd.DateOffset(hours=24)

Calculating difference in years for the whole dataframe

I have a dataframe with two dates and I want to add a new column that is the difference between the two in years.
birthDate | created_at | diff_in_years
2000-10 | 2019-06-17 13:15:04.598799+00:00 |
I have written the following code to compute the difference. Since I do not know the exact day from birthDate, I manually set it to 1 for both. It works great for one row.
def convert_to_datetime(str):
x = int(str[0:4])
y = int(str[5:7])
z = 1
return datetime.datetime(x,y,z)
start_date = "2000-10"
start_date = convert_to_datetime(start_date)
end_date = "2019-06-17 13:15:04.598799+00:00"
end_date = convert_to_datetime(end_date)
diff = relativedelta(end_date,start_date)
But the problem is how can I run this computation for the whole dataframe? I've tried the apply function but doesn't work. I'm not using it properly.
data.apply(relativedelta(convert_to_datetime(data["created_at"]),convert_to_datetime(data["birthDate"]), axis=1))
Try the following, try to use the pandas built in function for the datetime conversion :
df['birthDate'] = pd.to_datetime(df['birthDate'])
df['created_at'] = pd.to_datetime(df['created_at'])
#from here you can just simply substract
df['difference'] = df['created_at'] - df['birthDate']
#note that this will give you the difference in days, try to divide by 365 or something like that

Filter df within same date range

I have a weekly report, where I need to find duplicates of IDs in one column (Customer ID) happening within the same date range (Submit Date).
import pandas as pd
from datetime import timedelta
df = pd.read_excel('report.xlsx', engine='openpyxl', parse_dates=['Submit Date'])
customer = report['Customer ID']
submit_date = report['Submit Date']
submit_date = submit_date.dt.date
submit_date.sort_index(inplace=True)
mask1 = customer.duplicated(keep=False) == True
report = report[mask1]
And this part comes easy - I have the result of all duplicated IDs. However, I'm not sure how should I approach the problem of limiting it to the same day, ex 2021-04-12, so I can have only duplicates in this particular date. I tried creating for loop, where there is if statement checking
if day < day + timedelta(days=1)
but that didn't seem to bring any results. I could hard code the dates so create a different masks for every separate date of week but I'd like to keep the report automated.
Thanks in advance for all ideas!
I actually managed to sort it out:
while start_date <= end_date:
same_date_values = report.loc[start_date == submit_date]
print(f" ################# DATE CHECKED: {start_date} #################")
mask1 = same_date_values['Customer ID'].duplicated(keep=False) == True
same_date_values = same_date_values[mask1]
df = df.append(same_date_values)
start_date += timedelta(days=1)
It's a one while loop and I'm missing a after declaration of loop but this basically allowed me to iterarte through every day (12-04, 13-4, etc.) and break it into smaller dataframes with same date.

For a list of dates, check if it is between another list of 2 dates

I'm trying to compare 2 lists of dates, by checking if the date in the first dataframe with column 'timekey' is between the 2 dates, where the 2 dates is the date in timelist and timelist - 1 year.
An example would be checking if 30Aug2020 is between 30Nov2020 and 30Nov2020-1year, I.E 30Nov2019.
I then want to have a 3rd column in the original df where it shows the difference between the timekey date and the compared timelist date.
I'm doing all of this in python using pandas.
import pandas as pd
import datetime as dt
datelist = pd.date_range(start = dt.datetime(2016,8,31), end = dt.datetime(2020,11,30), freq = '3M')
data = {'ID': ['1', '2', '3'], 'timekey': ['31Dec2016', '30Jun2017', '30Aug2018']}
df = pd.DataFrame(data)
df['timekey'] = pd.to_datetime(df['timekey'])
print(df)
print(datelist)
Here is the code I tried, but I have a value error where they say lengths must match to compare. Whats going on?
for date in datelist:
if (df['timekey'] <= datelist) & (df['timekey'] >= (datelist - pd.offsets.DateOffset(years=1))):
df['diff'] = df['timekey'] - (datelist - pd.offsets.DateOffset(years=1))
The expected output should be that for each timekey, if it is within the date range specified by the datelist, it should generate an entire new row with the same ID and timekey with the 3rd new column being the difference in months.
For example, if the timekey is 30Jun2020, it would be between 30Nov2019-30Nov2020, 30Aug2019-30Aug2020. There would be 2 rows created whereby the time difference in months would be 5 and 2 respectively.
Easiest way I could think of to solve your problem would be using the unix timestamp (which will return you the seconds passed since 1970-01-01) to compare. Therefore you would need to convert your dates to unix.
Something like this would work:
unixTime = (pd.to_datetime(<yourTime>) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
so a working example to check if a date is in-between two dates could look like this:
def checkIfInbetween(date1,date2,dateToCheck):
date1 = (pd.to_datetime(date1) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
date2 = (pd.to_datetime(date1) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
dateToCheck = (pd.to_datetime(dateToCheck) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
if(dateToCheck<date2 && dateToCheck>date1):
return True
else:
return False
df['isInbetween'] = df.apply(lamdbda x: checkIfInbetween(x['date1'], x['date2'], x['dateToCheck']))
(Code not tested)

Categories

Resources