I calculated a column using the following code:
df_EVENT5_5['age'] = dt.datetime.now().date() - df_EVENT5_5['dt_old']
df_EVENT5_5['age_no_days'] = df_EVENT5_5['age'].dt.total_seconds()/ (24 * 60 * 60)
The output column contains the timestamp for some reason.
How do I remove the time stamp?
I tried below but didn't work:
remove_timestamp_col = ['COL_1', 'COL_2']
for i in remove_timestamp_col:
df_EVENT5_13[i] = df_EVENT5_13[i].age.days()
I think there might be some dates with time stamps in the data, try subtracting the date only:
df_EVENT5_5['age'] = dt.date.today() - df_EVENT5_5['dt_old'].apply(dt.datetime.date)
#This gives you the days difference as a number
df_EVENT5_5['age_no_days'] = df_EVENT5_5['age'].dt.days
#If you want it to have 'Days' in the end, you can use concatenation:
df_EVENT5_5['age_with_days'] = df_EVENT5_5['age_no_days'].astype('str') + ' Days'
Related
I have uploaded a big file and created a DataFrame for it.
Now i want to update some of the columns containing timestamps as well if possible update columns with dates based on that.
The reason is that i want to adjust for daylight saving time, and the list i am working with is GMT time so i need to adjust the timestamps on it.
Example that works:
df_winter2['Confirmation_Time'] = pd.to_datetime(df_winter2['Confirmation_Time'].astype(str)) + pd.DateOffset(hours=7)
df_summer['Confirmation_Time'] = pd.to_datetime(df_summer['Confirmation_Time'].astype(str)) + pd.DateOffset(hours=6)
I want to write a function that first add the 6 or 7 hours to the DataFrame based on if it is summertime or wintertime.
If it is possible as well i want to update the date column if the timestamp is > 16:00 with + 1 day,
the date column is called df['Creation_Date']
This should work for the function if it is wintertime.
def wintertime(date_time):
year, month, day = dt.timetuple()[0:3]
if (month < 3) or (month == 12 and day < 21):
return True
else:
return False
Now I am guessing you also want to loop through your df and update the time respectively which you could do with the following:
for i, length in enumerate (df):
date_time = df['Confirmation_Time'][i]
if wintertime(date_time):
df['Confirmation_Time'][i] = pd.to_datetime(df['Confirmation_Time'][i].astype(str)) + pd.DateOffset(hours=7)
else:
df['Confirmation_Time'][i] = pd.to_datetime(df['Confirmation_Time'][i].astype(str)) + pd.DateOffset(hours=6)
return df
I want to find a way that could give me next month/quarter/year/bi-annual date given a Pandas timestamp.
If the timestamp is already an end of month/quarter/year/bi-annual date than I can get next quarter date as follows:
pd.Timestamp('1999-12-31') + pd.tseries.offsets.DateOffset(months=3)
What if the time stamp was pd.Timestamp('1999-12-30'), the above won't work.
Expected output
input = pd.Timestamp('1999-12-30')
next_quarter_end = '2000-03-31'
next_month_end = '2000-01-31'
next_year_end = '2000-12-31'
next_biannual_end = '2000-06-30'
This works. I used pandas.tseries.offsets.QuarterEnd, .MonthEnd, and .YearEnd, multiplied by specific factors that change based on the input, to achieve the four values you're looking for.
date = pd.Timestamp('1999-12-31')
month_factor = 1 if date.day == date.days_in_month else 2
year_factor = 1 if date.day == date.days_in_month and date.month == 12 else 2
next_month_end = date + pd.tseries.offsets.MonthEnd() * month_factor
next_quarter_end = date + (pd.tseries.offsets.QuarterEnd() * month_factor)
next_year_end = date + pd.tseries.offsets.YearEnd() * year_factor
next_biannual_end = date + pd.tseries.offsets.DateOffset(months=6)
Technically, the next quarter end after Timestamp('1999-12-30') is Timestamp('1999-12-31 00:00:00')
You can use pandas.tseries.offsets.QuarterEnd
>>> pd.Timestamp('1999-12-30') + pd.tseries.offsets.QuarterEnd()
Timestamp('1999-12-31 00:00:00')
>>> pd.Timestamp('1999-12-30') + pd.tseries.offsets.QuarterEnd()*2
Timestamp('2000-03-31 00:00:00')
Similarly, use pandas.tseries.offsets.MonthEnd() and pandas.tseries.offsets.YearEnd()
For biannual, I guess you can take 2*QuarterEnd().
I'm trying to compare 2 lists of dates, by checking if the date in the first dataframe with column 'timekey' is between the 2 dates, where the 2 dates is the date in timelist and timelist - 1 year.
An example would be checking if 30Aug2020 is between 30Nov2020 and 30Nov2020-1year, I.E 30Nov2019.
I then want to have a 3rd column in the original df where it shows the difference between the timekey date and the compared timelist date.
I'm doing all of this in python using pandas.
import pandas as pd
import datetime as dt
datelist = pd.date_range(start = dt.datetime(2016,8,31), end = dt.datetime(2020,11,30), freq = '3M')
data = {'ID': ['1', '2', '3'], 'timekey': ['31Dec2016', '30Jun2017', '30Aug2018']}
df = pd.DataFrame(data)
df['timekey'] = pd.to_datetime(df['timekey'])
print(df)
print(datelist)
Here is the code I tried, but I have a value error where they say lengths must match to compare. Whats going on?
for date in datelist:
if (df['timekey'] <= datelist) & (df['timekey'] >= (datelist - pd.offsets.DateOffset(years=1))):
df['diff'] = df['timekey'] - (datelist - pd.offsets.DateOffset(years=1))
The expected output should be that for each timekey, if it is within the date range specified by the datelist, it should generate an entire new row with the same ID and timekey with the 3rd new column being the difference in months.
For example, if the timekey is 30Jun2020, it would be between 30Nov2019-30Nov2020, 30Aug2019-30Aug2020. There would be 2 rows created whereby the time difference in months would be 5 and 2 respectively.
Easiest way I could think of to solve your problem would be using the unix timestamp (which will return you the seconds passed since 1970-01-01) to compare. Therefore you would need to convert your dates to unix.
Something like this would work:
unixTime = (pd.to_datetime(<yourTime>) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
so a working example to check if a date is in-between two dates could look like this:
def checkIfInbetween(date1,date2,dateToCheck):
date1 = (pd.to_datetime(date1) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
date2 = (pd.to_datetime(date1) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
dateToCheck = (pd.to_datetime(dateToCheck) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
if(dateToCheck<date2 && dateToCheck>date1):
return True
else:
return False
df['isInbetween'] = df.apply(lamdbda x: checkIfInbetween(x['date1'], x['date2'], x['dateToCheck']))
(Code not tested)
I'm new to Python. After a couple days researching and trying things out, I've landed on a decent solution for creating a list of timestamps, for each hour, between two dates.
Example:
import datetime
from datetime import datetime, timedelta
timestamp_format = '%Y-%m-%dT%H:%M:%S%z'
earliest_ts_str = '2020-10-01T15:00:00Z'
earliest_ts_obj = datetime.strptime(earliest_ts_str, timestamp_format)
latest_ts_str = '2020-10-02T00:00:00Z'
latest_ts_obj = datetime.strptime(latest_ts_str, timestamp_format)
num_days = latest_ts_obj - earliest_ts_obj
num_hours = int(round(num_days.total_seconds() / 3600,0))
ts_raw = []
for ts in range(num_hours):
ts_raw.append(latest_ts_obj - timedelta(hours = ts + 1))
dates_formatted = [d.strftime('%Y-%m-%dT%H:%M:%SZ') for d in ts_raw]
# Need timestamps in ascending order
dates_formatted.reverse()
dates_formatted
Which results in:
['2020-10-01T00:00:00Z',
'2020-10-01T01:00:00Z',
'2020-10-01T02:00:00Z',
'2020-10-01T03:00:00Z',
'2020-10-01T04:00:00Z',
'2020-10-01T05:00:00Z',
'2020-10-01T06:00:00Z',
'2020-10-01T07:00:00Z',
'2020-10-01T08:00:00Z',
'2020-10-01T09:00:00Z',
'2020-10-01T10:00:00Z',
'2020-10-01T11:00:00Z',
'2020-10-01T12:00:00Z',
'2020-10-01T13:00:00Z',
'2020-10-01T14:00:00Z',
'2020-10-01T15:00:00Z',
'2020-10-01T16:00:00Z',
'2020-10-01T17:00:00Z',
'2020-10-01T18:00:00Z',
'2020-10-01T19:00:00Z',
'2020-10-01T20:00:00Z',
'2020-10-01T21:00:00Z',
'2020-10-01T22:00:00Z',
'2020-10-01T23:00:00Z']
Problem:
If I change earliest_ts_str to include minutes, say earliest_ts_str = '2020-10-01T19:45:00Z', the resulting list does not increment the minute intervals accordingly.
Results:
['2020-10-01T20:00:00Z',
'2020-10-01T21:00:00Z',
'2020-10-01T22:00:00Z',
'2020-10-01T23:00:00Z']
I need it to be:
['2020-10-01T20:45:00Z',
'2020-10-01T21:45:00Z',
'2020-10-01T22:45:00Z',
'2020-10-01T23:45:00Z']
Feels like the problem is in the num_days and num_hours calculation, but I can't see how to fix it.
Ideas?
if you don't mind to use a 3rd party package, have a look at pandas.date_range:
import pandas as pd
earliest, latest = '2020-10-01T15:45:00Z', '2020-10-02T00:00:00Z'
dti = pd.date_range(earliest, latest, freq='H') # just specify hourly frequency...
l = dti.strftime('%Y-%m-%dT%H:%M:%SZ').to_list()
print(l)
# ['2020-10-01T15:45:00Z', '2020-10-01T16:45:00Z', '2020-10-01T17:45:00Z', '2020-10-01T18:45:00Z', '2020-10-01T19:45:00Z', '2020-10-01T20:45:00Z', '2020-10-01T21:45:00Z', '2020-10-01T22:45:00Z', '2020-10-01T23:45:00Z']
import datetime
from datetime import datetime, timedelta
timestamp_format = '%Y-%m-%dT%H:%M:%S%z'
earliest_ts_str = '2020-10-01T00:00:00Z'
ts_obj = datetime.strptime(earliest_ts_str, timestamp_format)
latest_ts_str = '2020-10-02T00:00:00Z'
latest_ts_obj = datetime.strptime(latest_ts_str, timestamp_format)
ts_raw = []
while ts_obj <= latest_ts_obj:
ts_raw.append(ts_obj)
ts_obj += timedelta(hours=1)
dates_formatted = [d.strftime('%Y-%m-%dT%H:%M:%SZ') for d in ts_raw]
print(dates_formatted)
EDIT:
Here is example with Maya
import maya
earliest_ts_str = '2020-10-01T00:00:00Z'
latest_ts_str = '2020-10-02T00:00:00Z'
start = maya.MayaDT.from_iso8601(earliest_ts_str)
end = maya.MayaDT.from_iso8601(latest_ts_str)
# end is not included, so we add 1 second
my_range = maya.intervals(start=start, end=end.add(seconds=1), interval=60*60)
dates_formatted = [d.iso8601() for d in my_range]
print(dates_formatted)
Both output
['2020-10-01T00:00:00Z',
'2020-10-01T01:00:00Z',
... some left out ...
'2020-10-01T23:00:00Z',
'2020-10-02T00:00:00Z']
Just change
num_hours = num_days.days*24 + num_days.seconds//3600
The problem is that num_days only takes integer values, so if it is not a multiple of 24h you will get the floor value (i.e for your example you will get 0). So in order to compute the hours you need to use both, days and seconds.
Also, you can create the list directly in the right order, I am not sure if you are doing it like this for some reason.
ts_raw.append(earliest_ts_obj + timedelta(hours = ts + 1))
I have the following target: I need to compare two date columns in the same table and create a 3rd column based on the result of the comparison. I do not know how to compare dates in a np.where statement.
This is my current code:
now = datetime.datetime.now() #set the date to compare
delta = datetime.timedelta(days=7) #set delta
time_delta = now+delta #now+7 days
And here is the np.where statement:
DB['s_date'] = np.where((DB['Start Date']<=time_delta | DB['Start Date'] = (None,"")),DB['Start Date'],RW['date'])
There is an OR condition to take into account the possibility that Start Date column might be empty
Would lambda apply work for you Filippo? It looks at a series row-wise, then applies a function of your choice to every value of the row. Whatever is returned in the function will fill up the series with the values it returns.
def compare(date):
if date <= time_delta or date == None:
#return something
else:
#return something else
DB['s_date'] = DB.apply(lambda x: compare(x))
EDIT: This will work as well (thanks EyuelDK)
DB['s_date'] = DB.apply(compare)
Thank you for the insights. I updated (and adjusted for my purposes) the code as following and it works:
now = datetime.datetime.now() #set the date to compare
delta = datetime.timedelta(days=7) #set delta
time_delta = now+delta #now+7 days
DB['Start'] = np.where(((DB['Start Date']<=time_delta) | (DB['Start Date'].isnull()) | (DB['Start Date'] == "")),DB['Start'],DB['Start Date'])
They key was to add () in each condition separated by |. Otherwise was giving an error by comparing two different data types.