I have uploaded a big file and created a DataFrame for it.
Now i want to update some of the columns containing timestamps as well if possible update columns with dates based on that.
The reason is that i want to adjust for daylight saving time, and the list i am working with is GMT time so i need to adjust the timestamps on it.
Example that works:
df_winter2['Confirmation_Time'] = pd.to_datetime(df_winter2['Confirmation_Time'].astype(str)) + pd.DateOffset(hours=7)
df_summer['Confirmation_Time'] = pd.to_datetime(df_summer['Confirmation_Time'].astype(str)) + pd.DateOffset(hours=6)
I want to write a function that first add the 6 or 7 hours to the DataFrame based on if it is summertime or wintertime.
If it is possible as well i want to update the date column if the timestamp is > 16:00 with + 1 day,
the date column is called df['Creation_Date']
This should work for the function if it is wintertime.
def wintertime(date_time):
year, month, day = dt.timetuple()[0:3]
if (month < 3) or (month == 12 and day < 21):
return True
else:
return False
Now I am guessing you also want to loop through your df and update the time respectively which you could do with the following:
for i, length in enumerate (df):
date_time = df['Confirmation_Time'][i]
if wintertime(date_time):
df['Confirmation_Time'][i] = pd.to_datetime(df['Confirmation_Time'][i].astype(str)) + pd.DateOffset(hours=7)
else:
df['Confirmation_Time'][i] = pd.to_datetime(df['Confirmation_Time'][i].astype(str)) + pd.DateOffset(hours=6)
return df
Related
I want to do a time serie with temperature data from 1850 to 2014. And I have an issue because when I plot the time series the start is 0 and it corresponds to day 1 of January 1850 and it stops day 60 230 with the 31 December of 2014.
I try to do a loop to create a new list with the time in month-years but it didn't succeed, and to create the plot with this new list and my initial temperature list.
This is the kind of loop that I tested :
days = list(range(1,365+1))
years = []
y = 1850
years.append(y)
while y<2015:
for i in days:
years.append(y+i)
y = y+1
del years [-1]
dsetyears = Dataset(years)
I also try with the tool called "datetime" but it didn't work also (maybe this tool is better because it will take into account the bissextile years...).
day_number = "0"
year = "1850"
res = datetime.strptime(year + "-" + day_number, "%Y-%j").strftime("%m-%d-%Y")
If anyone has a clue or a lead I can look into I'm interested.
Thanks by advance !
You can achieve that using datetime module. Let's declare starting and ending date.
import datetime
dates = []
starting_date = datetime.datetime(1850, 1, 1)
ending_date = datetime.datetime(2014, 1, 1)
Then we can create a while loop and check if the ending date is greater or equal to starting date and add 1-day using timedelta function for every iteration. before iteration, we will append the formatted date as a string to the dates list.
while starting_date <= ending_date:
dates.append(starting_date.strftime("%m-%d-%Y"))
starting_date += datetime.timedelta(days=1)
I'm trying to compare 2 lists of dates, by checking if the date in the first dataframe with column 'timekey' is between the 2 dates, where the 2 dates is the date in timelist and timelist - 1 year.
An example would be checking if 30Aug2020 is between 30Nov2020 and 30Nov2020-1year, I.E 30Nov2019.
I then want to have a 3rd column in the original df where it shows the difference between the timekey date and the compared timelist date.
I'm doing all of this in python using pandas.
import pandas as pd
import datetime as dt
datelist = pd.date_range(start = dt.datetime(2016,8,31), end = dt.datetime(2020,11,30), freq = '3M')
data = {'ID': ['1', '2', '3'], 'timekey': ['31Dec2016', '30Jun2017', '30Aug2018']}
df = pd.DataFrame(data)
df['timekey'] = pd.to_datetime(df['timekey'])
print(df)
print(datelist)
Here is the code I tried, but I have a value error where they say lengths must match to compare. Whats going on?
for date in datelist:
if (df['timekey'] <= datelist) & (df['timekey'] >= (datelist - pd.offsets.DateOffset(years=1))):
df['diff'] = df['timekey'] - (datelist - pd.offsets.DateOffset(years=1))
The expected output should be that for each timekey, if it is within the date range specified by the datelist, it should generate an entire new row with the same ID and timekey with the 3rd new column being the difference in months.
For example, if the timekey is 30Jun2020, it would be between 30Nov2019-30Nov2020, 30Aug2019-30Aug2020. There would be 2 rows created whereby the time difference in months would be 5 and 2 respectively.
Easiest way I could think of to solve your problem would be using the unix timestamp (which will return you the seconds passed since 1970-01-01) to compare. Therefore you would need to convert your dates to unix.
Something like this would work:
unixTime = (pd.to_datetime(<yourTime>) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
so a working example to check if a date is in-between two dates could look like this:
def checkIfInbetween(date1,date2,dateToCheck):
date1 = (pd.to_datetime(date1) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
date2 = (pd.to_datetime(date1) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
dateToCheck = (pd.to_datetime(dateToCheck) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
if(dateToCheck<date2 && dateToCheck>date1):
return True
else:
return False
df['isInbetween'] = df.apply(lamdbda x: checkIfInbetween(x['date1'], x['date2'], x['dateToCheck']))
(Code not tested)
I have a large csv file with format below:
date event-type case event
2020-07-23 00:00:00.000257032 wake-up 0 patient wakes
2020-07-23 10:30:00.005042270 meal 1 patient has breakfast
2020-07-23 10:32:30.088683558 lavatory 2 1st - sample collected
I have around 600k entries like this.
The values in the case column doesn't exist beforehand.
The question is - with every changing minute in time in the date column, how to insert a unique number in the case column like:
date case
2020-07-23 10:30:00.005042270 1
2020-07-23 10:31:00.005042270 2
2020-07-23 10:32:00.005042270 3
Also, apart from the change in minute, all other changes are to be ignored i.e. say as long as time in date column is 10:30 the number entered in the rows in case column continues to be 1 until 10:31 appears in the date column.
Being new to python, I am not sure how to do this.
Try this:
from datetime import datetime
df = df.sort_values('date', ascending=True).reset_index(drop=True)
date_to_minute = df['date'].map(lambda d: datetime.strptime(d[:-3],'%Y-%m-%d %H:%M:%S.%f').strftime('%Y-%m-%d %H:%M'))
previous_date_time = date_to_minute[0]
#if you want case column to start from 1, change this variable to 1
current_case = 0
cases = []
for current_date_time in date_to_minute:
if current_date_time > previous_date_time:
current_case += 1
cases.append(current_case)
previous_date_time = current_date_time #missed adding this line previously
df['cases'] = pd.Series(cases, name='cases')
I assume that your dataframe is sorted by date. Try this:
# pandas store Timestamp internally as nanosecond
# You first need to convert it to minutes since epoch (Jan 1, 1970)
minutes = pd.to_datetime(df['date']).astype('int64') // (60 * 10**9)
# Every new minute makes a new case number
df['case'] = minutes.diff().gt(0).cumsum()
I have a data frame with time series data. In one column I have signup dates, and in the other cancel dates. I want to add a date for missing cancel dates that is less than a specific date, but maximum 40 weeks.
How should I proceed?
if df['cancel_date'] is NaT, then add date max. + 40 weeks.
df['cancel_date'] - df['signup_date'] should not be less than 0.
IIUC, you can use Series.fillna with pandas.Timedelta class.
If adding 40 weeks to the records signup_date:
df['cancel_date'] = df['cancel_date'].fillna(df['signup_date'] + pd.Timedelta(40, 'W'))
If adding 40 weeks to maximum date in the sign_up column:
df['cancel_date'] = df['cancel_date'].fillna(df['signup_date'].max() + pd.Timedelta(40, 'W'))
Or if using some predefined max date value, with the constraint that signup_date < cancel_date, chain on the clip method:
max_date = pd.datetime(2018, 4, 30)
df['cancel_date'] = df['cancel_date'].fillna(max_date + pd.Timedelta(40, 'W')).clip(lower=df.signup_date)
I would use numpy.where, if you want to append a difference column directly between singup date and cancel date:
df['date difference between signup and cancel'] = np.where(df['cancel_date'] == np.nan, (df['signup_date'].max() + pd.Timedelta(40, 'W'))-df['signup_date'], df['cancel_date']-df['signup_date'])
This will give you a new column where you would have directly the difference between the signup date and the cancel date
I've written this function to get the last Thursday of the month
def last_thurs_date(date):
month=date.dt.month
year=date.dt.year
cal = calendar.monthcalendar(year, month)
last_thurs_date = cal[4][4]
if month < 10:
thurday_date = str(year)+'-0'+ str(month)+'-' + str(last_thurs_date)
else:
thurday_date = str(year) + '-' + str(month) + '-' + str(last_thurs_date)
return thurday_date
But its not working with the lambda function.
datelist['Date'].map(lambda x: last_thurs_date(x))
Where datelist is
datelist = pd.DataFrame(pd.date_range(start = pd.to_datetime('01-01-2014',format='%d-%m-%Y')
, end = pd.to_datetime('06-03-2019',format='%d-%m-%Y'),freq='D').tolist()).rename(columns={0:'Date'})
datelist['Date']=pd.to_datetime(datelist['Date'])
Jpp already added the solution, but just to add a slightly more readable formatted string - see this awesome website.
import calendar
def last_thurs_date(date):
year, month = date.year, date.month
cal = calendar.monthcalendar(year, month)
# the last (4th week -> row) thursday (4th day -> column) of the calendar
# except when 0, then take the 3rd week (February exception)
last_thurs_date = cal[4][4] if cal[4][4] > 0 else cal[3][4]
return f'{year}-{month:02d}-{last_thurs_date}'
Also added a bit of logic - e.g. you got 2019-02-0 as February doesn't have 4 full weeks.
Scalar datetime objects don't have a dt accessor, series do: see pd.Series.dt. If you remove this, your function works fine. The key is understanding that pd.Series.apply passes scalars to your custom function via a loop, not an entire series.
def last_thurs_date(date):
month = date.month
year = date.year
cal = calendar.monthcalendar(year, month)
last_thurs_date = cal[4][4]
if month < 10:
thurday_date = str(year)+'-0'+ str(month)+'-' + str(last_thurs_date)
else:
thurday_date = str(year) + '-' + str(month) + '-' + str(last_thurs_date)
return thurday_date
You can rewrite your logic more succinctly via f-strings (Python 3.6+) and a ternary statement:
def last_thurs_date(date):
month = date.month
year = date.year
last_thurs_date = calendar.monthcalendar(year, month)[4][4]
return f'{year}{"-0" if month < 10 else "-"}{month}-{last_thurs_date}'
I know that a lot of time has passed since the date of this post, but I think it would be worth adding another option if someone came across this thread
Even though I use pandas every day at work, in that case my suggestion would be to just use the datetutil library. The solution is a simple one-liner, without unnecessary combinations.
from dateutil.rrule import rrule, MONTHLY, FR, SA
from datetime import datetime as dt
import pandas as pd
# monthly options expiration dates calculated for 2022
monthly_options = list(rrule(MONTHLY, count=12, byweekday=FR, bysetpos=3, dtstart=dt(2022,1,1)))
# last satruday of the month
last_saturday = list(rrule(MONTHLY, count=12, byweekday=SA, bysetpos=-1, dtstart=dt(2022,1,1)))
and then of course:
pd.DataFrame({'LAST_ST':last_saturdays}) #or whatever you need
This question answer Calculate Last Friday of Month in Pandas
This can be modified by selecting the appropriate day of the week, here freq='W-FRI'
I think the easiest way is to create a pandas.DataFrame using pandas.date_range and specifying freq='W-FRI.
W-FRI is Weekly Fridays
pd.date_range(df.Date.min(), df.Date.max(), freq='W-FRI')
Creates all the Fridays in the date range between the min and max of the dates in df
Use a .groupby on year and month, and select .last(), to get the last Friday of every month for every year in the date range.
Because this method finds all the Fridays for every month in the range and then chooses .last() for each month, there's not an issue with trying to figure out which week of the month has the last Friday.
With this, use pandas: Boolean Indexing to find values in the Date column of the dataframe that are in last_fridays_in_daterange.
Use the .isin method to determine containment.
pandas: DateOffset objects
import pandas as pd
# test data: given a dataframe with a datetime column
df = pd.DataFrame({'Date': pd.date_range(start=pd.to_datetime('2014-01-01'), end=pd.to_datetime('2020-08-31'), freq='D')})
# create a dateframe with all Fridays in the daterange for min and max of df.Date
fridays = pd.DataFrame({'datetime': pd.date_range(df.Date.min(), df.Date.max(), freq='W-FRI')})
# use groubpy and last, to get the last Friday of each month into a list
last_fridays_in_daterange = fridays.groupby([fridays.datetime.dt.year, fridays.datetime.dt.month]).last()['datetime'].tolist()
# find the data for the last Friday of the month
df[df.Date.isin(last_fridays_in_daterange)]