I recently started using pandas and I am trying to teach myself training models. I have a dataset that has end_time and start_time columns and I am currently struggling to find the time elapsed between these columns in the same row in seconds.
This is the code I tried;
[IN]
from datetime import datetime
from datetime import date
st = pd.to_datetime(df['start_time'], format='%Y-%m-%d')
et = pd.to_datetime(df['end_time'], format='%Y-%m-%d')
print((et-st).dt.days)*60*60*24
[OUT]
0 0
1 0
2 0
3 0
4 0
..
10000 0
Length: 10001, dtype: int64
I looked up other similar questions and where this one differ is, it's connected to a CSV file. I can easily apply the steps with dummy data from the other question solutions but it doesn't work for my case.
See the following. I fabricated some data, if you have a data example that produces the error please feel free to put it in the question.
import pandas as pd
from datetime import datetime
from datetime import date
df = pd.DataFrame({'start_time':pd.date_range('2015-01-01 01:00:00', periods=3), 'end_time':pd.date_range('2015-01-02 02:00:00', periods=3, freq='23H')})
st = pd.to_datetime(df['start_time'], format='%Y-%m-%d')
et = pd.to_datetime(df['end_time'], format='%Y-%m-%d')
diff = et-st
df['seconds'] = diff.dt.total_seconds()
Related
Here I got a pandas data frame with daily return of stocks and columns are date and return rate.
But if I only want to keep the last day of each week, and the data has some missing days, what can I do?
import pandas as pd
df = pd.read_csv('Daily_return.csv')
df.Date = pd.to_datetime(db.Date)
count = 300
for last_day in ('2017-01-01' + 7n for n in range(count)):
Actually my brain stop working at this point with my limited imagination......Maybe one of the biggest point is "+7n" kind of stuff is meaningless with some missing dates.
I'll create a sample dataset with 40 dates and 40 sample returns, then sample 90 percent of that randomly to simulate the missing dates.
The key here is that you need to convert your date column into datetime if it isn't already, and make sure your df is sorted by the date.
Then you can groupby year/week and take the last value. If you run this repeatedly you'll see that the selected dates can change if the value dropped was the last day of the week.
Based on that
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['date'] = pd.date_range(start='04-18-2022',periods=40, freq='D')
df['return'] = np.random.uniform(size=40)
# Keep 90 percent of the records so we can see what happens when some days are missing
df = df.sample(frac=.9)
# In case your dates are actually strings
df['date'] = pd.to_datetime(df['date'])
# Make sure they are sorted from oldest to newest
df = df.sort_values(by='date')
df = df.groupby([df['date'].dt.isocalendar().year,
df['date'].dt.isocalendar().week], as_index=False).last()
print(df)
Output
date return
0 2022-04-24 0.299958
1 2022-05-01 0.248471
2 2022-05-08 0.506919
3 2022-05-15 0.541929
4 2022-05-22 0.588768
5 2022-05-27 0.504419
So, Basically, I got this 2 df columns with data content. The initial content is in the dd/mm/YYYY format, and I want to subtract them. But I can't really subtract string, so I converted it to datetime, but when I do such thing for some reason the format changes to YYYY-dd-mm, so when I try to subtract them, I got a wrong result. For example:
Initial Content:
a: 05/09/2022
b: 30/09/2021
result expected: 25 days.
Converted to DateTime:
a: 2022-05-09
b: 2021-09-30 (For some reason this date stills the same)
result: 144 days.
I'm using pandas and datetime to make this project.
So, I wanted to know a way I can subtract this 2 columns with the proper result.
--- Answer
When I used
pd.to_datetime(date, format="%d/%m/%Y")
It worked. Thank you all for your time. This is my first project in pandas. :)
df = pd.DataFrame({'Date1': ['05/09/2021'], 'Date2': ['30/09/2021']})
df = df.apply(lambda x:pd.to_datetime(x,format=r'%d/%m/%Y')).assign(Delta=lambda x: (x.Date2-x.Date1).dt.days)
print(df)
Date1 Date2 Delta
0 2021-09-05 2021-09-30 25
I just answered a similar query here subtracting dates in python
import datetime
from datetime import date
from datetime import datetime
import pandas as pd
date_format_str = '%Y-%m-%d %H:%M:%S.%f'
date_1 = '2016-09-24 17:42:27.839496'
date_2 = '2017-01-18 10:24:08.629327'
start = datetime.strptime(date_1, date_format_str)
end = datetime.strptime(date_2, date_format_str)
diff = end - start
# Get interval between two timstamps as timedelta object
diff_in_hours = diff.total_seconds() / 3600
print(diff_in_hours)
# get the difference between two dates as timedelta object
diff = end.date() - start.date()
print(diff.days)
Pandas
import datetime
from datetime import date
from datetime import datetime
import pandas as pd
date_1 = '2016-09-24 17:42:27.839496'
date_2 = '2017-01-18 10:24:08.629327'
start = pd.to_datetime(date_1, format='%Y-%m-%d %H:%M:%S.%f')
end = pd.to_datetime(date_2, format='%Y-%m-%d %H:%M:%S.%f')
# get the difference between two datetimes as timedelta object
diff = end - start
print(diff.days)
I have a csv file with a long timestamp column (years):
1990-05-12 14:01
.
.
1999-01-10 10:00
where the time is in hh:mm format. I'm trying to extract each day worth of data into a new csv file. Here's my code:
import datetime
import pandas as pd
df = pd.read_csv("/home/parallels/Desktop/ewh_log/hpwh_log.csv",parse_dates=True)
#change timestmap column format
def extract_months_data(df):
df = pd.to_datetime(df['timestamp'])
print(df)
def write_o_csv(df):
print('writing ..')
#todo
x1 = pd.to_datetime(df['timestamp'],format='%m-%d %H:%M').notnull().all()
if (x1)==True:
extract_months_data(df)
else:
x2 = pd.to_datetime(df['timestamp'])
x2 = x1.dt.strftime('%m-%d %H:%M')
write_to_csv(df)
The issue is that when I get to the following line
def extract_months_data(df):
df = pd.to_datetime(df['timestamp'])
I get the following error:
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime
Is there alternative solution to do it with pandas without ignoring the rest of the data. I saw posts that suggested using coerce but that replaces the rest of the data with NaT.
Thanks
UPDATE:
This post here here answers half of the question which is how to filter hours (or minutes) out of timestamp column. The second part would be how to extract a full day to another csv file. I'll post updates here once I get to a solution.
You are converting to datetime two times which is not needed
Something like that should work
import pandas as pd
df = pd.read_csv('data.csv')
df['month_data'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d %H:%M')
df['month_data'] = df['month_data'].dt.strftime('%m-%d %H:%M')
# If you dont want columns with month_data NaN
df = df[df['month_data'].notna()]
print(df)
I have a dataset and I need to calculate working days from a given date to today, excluding the given list of holidays. I will be including weekends.
Date Sample:
This is the code I tried:
import pandas as pd
import numpy as np
from datetime import date
df = pd.read_excel('C:\\sample.xlsx')
#get todays date
df["today"] = date.today()
#Convert data type
start = df["R_REL_DATE"].values.astype('datetime64[D]')
end = df["today"].values.astype('datetime64[D]')
holiday = ['2021-06-19', '2021-06-20']
#Numpy function to find in between days
days = np.busday_count(start, end, weekmask='1111111', holidays=holiday)
#Add this column to dataframe
df["Days"] = days
df
When I run this code, it gives difference between R_REL_DATE and today, but doesn't subtract given holidays.
Please help, I want the given list of holidays deducted from the days.
Make sure today and R_REL_DATE are in pandas datetime format with pd.to_datetime():
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame({'R_REL_DATE': {0: '7/23/2020', 1: '8/26/2020'},
'DAYS IN QUEUE': {0: 338, 1: 304}})
df["today"] = pd.to_datetime(datetime.date.today())
df["R_REL_DATE"] = pd.to_datetime(df["R_REL_DATE"])
start = df["R_REL_DATE"].values.astype('datetime64[D]')
end = df["today"].values.astype('datetime64[D]')
holiday = ['2021-06-19', '2021-06-20']
#Numpy function to find in between days
days = np.busday_count(start, end, weekmask='1111111', holidays=holiday)
#Add this column to dataframe
df["Days"] = days - 1
df
Out[1]:
R_REL_DATE DAYS IN QUEUE today Days
0 2020-07-23 338 2021-06-27 336
1 2020-08-26 304 2021-06-27 302
I need to calculate hour difference between two dates (format: year-month-dayTHH:MM:SS I could also potentially transform data format to (format: year-month-day HH:MM:SS) from huge excel file. What is the most efficient way to do it in Python? I have tried to use Datatime/Time object (TypeError: expected string or buffer), Timestamp (ValueError) and DataFrame (does not give hour result).
Excel File:
Order_Date Received_Customer Column3
2000-10-06T13:00:58 2000-11-06T13:00:58 1
2000-10-21T15:40:15 2000-12-27T10:09:29 2
2000-10-23T10:09:29 2000-10-26T10:09:29 3
..... ....
Datatime/Time object code (TypeError: expected string or buffer):
import pandas as pd
import time as t
data=pd.read_excel('/path/file.xlsx')
s1 = (data,['Order_Date'])
s2 = (data,['Received_Customer'])
s1Time = t.strptime(s1, "%Y:%m:%d:%H:%M:%S")
s2Time = t.strptime(s2, "%Y:%m:%d:%H:%M:%S")
deltaInHours = (t.mktime(s2Time) - t.mktime(s1Time))
print deltaInHours, "hours"
Timestamp (ValueError) code:
import pandas as pd
import datetime as dt
data=pd.read_excel('/path/file.xlsx')
df = pd.DataFrame(data,columns=['Order_Date','Received_Customer'])
df.to = [pd.Timestamp('Order_Date')]
df.fr = [pd.Timestamp('Received_Customer')]
(df.fr-df.to).astype('timedelta64[h]')
DataFrame (does not return the desired result)
import pandas as pd
data=pd.read_excel('/path/file.xlsx')
df = pd.DataFrame(data,columns=['Order_Date','Received_Customer'])
df['Order_Date'] = pd.to_datetime(df['Order_Date'])
df['Received_Customer'] = pd.to_datetime(df['Received_Customer'])
answer = df.dropna()['Order_Date'] - df.dropna()['Received_Customer']
answer.astype('timedelta64[h]')
print(answer)
Output:
0 24 days 16:38:07
1 0 days 00:00:00
2 20 days 12:39:52
dtype: timedelta64[ns]
Should be something like this:
0 592 hour
1 0 hour
2 492 hour
Is there another way to convert timedelta64[ns] into hours than answer.astype('timedelta64[h]')?
For each of your solutions you mixed up datatypes and methods. Whereas I do not find the time to explicitly explain your mistakes, yet i want to help you by providing a (probably non optimal) solution.
I built the solution out of your previous tries and I combined it with knowledge from other questions such as:
Convert a timedelta to days, hours and minutes
Get total number of hours from a Pandas Timedelta?
Note that i used Python 3. I hope that my solution guides your way. My solution is this one:
import pandas as pd
from datetime import datetime
import numpy as np
d = pd.read_excel('C:\\Users\\nrieble\\Desktop\\check.xlsx',header=0)
start = [pd.to_datetime(e) for e in data['Order_Date'] if len(str(e))>4]
end = [pd.to_datetime(e) for e in data['Received_Customer'] if len(str(e))>4]
delta = np.asarray(s2Time)-np.asarray(s1Time)
deltainhours = [e/np.timedelta64(1, 'h') for e in delta]
print (deltainhours, "hours")