I have a column that came from Excel, that is supposed to contain durations (in hours) - example: 02:00:00 -
It works well if all this durations are less than 24:00 but if one is more than that, it appears in pandas as 1900-01-03 08:00:00 (so datetime)
as a result the datatype is dtype('O').
df = pd.DataFrame({'duration':[datetime.time(2, 0), datetime.time(2, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0), datetime.time(1, 0),
datetime.time(1, 0)]})
# Output
duration
0 02:00:00
1 02:00:00
2 1900-01-03 08:00:00
3 1900-01-03 08:00:00
4 1900-01-03 08:00:00
5 1900-01-03 08:00:00
6 1900-01-03 08:00:00
7 1900-01-03 08:00:00
8 01:00:00
9 01:00:00
But if I try to convert to either time or datetime I always get an error.
TypeError: <class 'datetime.time'> is not convertible to datetime
Today if I don't fix this, all the duration greater than 24:00 are gone.
Your problem lies in the engine that reads the Excel file. It converts cells that have a certain format (e.g. [h]:mm:ss or hh:mm:ss) to datetime.datetime or datetime.time objects. Those then get transferred into the pandas DataFrame, so it's not actually a pandas problem.
Before you start hacking the excel reader engine, it might be easier to tackle the issue in Excel. Here's a small sample file;
You can download it here.
duration is auto-formatted by Excel, duration_text is what you get if you set the column format to 'text' before you enter the values, duration_to_text is what you get if you change the format to text after Excel auto-formatted the values (first column).
Now you have everything you need after import with pandas:
df = pd.read_excel('path_to_file')
df
duration duration_text duration_to_text
0 12:30:00 12:30:00 0.520833
1 1900-01-01 00:30:00 24:30:00 1.020833
# now you can parse to timedelta:
pd.to_timedelta(df['duration_text'], errors='coerce')
0 0 days 12:30:00
1 1 days 00:30:00
Name: duration_text, dtype: timedelta64[ns]
# or
pd.to_timedelta(df['duration_to_text'], unit='d', errors='coerce')
0 0 days 12:29:59.999971200 # note the precision issue ;-)
1 1 days 00:29:59.999971200
Name: duration_to_text, dtype: timedelta64[ns]
Another viable option could be to save the Excel file as a csv and import that to a pandas DataFrame. The sample xlsx used above would then look like this for example.
If you have no other option than to re-process in pandas, an option could be to treat datetime.time objects and datetime.datetime objects specifically, e.g.
import datetime
# where you have datetime (incorrect from excel)
m = [isinstance(i, datetime.datetime) for i in df.duration]
# convert to timedelta where it's possible
df['timedelta'] = pd.to_timedelta(df['duration'].astype(str), errors='coerce')
# where you have datetime, some special treatment is needed...
df.loc[m, 'timedelta'] = df.loc[m, 'duration'].apply(lambda t: pd.Timestamp(str(t)) - pd.Timestamp('1899-12-31'))
df['timedelta']
0 0 days 12:30:00
1 1 days 00:30:00
Name: timedelta, dtype: timedelta64[ns]
IIUC, use pd.to_timedelta:
Setup a MRE:
df = pd.DataFrame({'duration': ['43:24:57', '22:12:52', '-', '78:41:33']})
print(df)
# Output
duration
0 43:24:57
1 22:12:52
2 -
3 78:41:33
df['duration'] = pd.to_timedelta(df['duration'], errors='coerce')
print(df)
# Output
duration
0 1 days 19:24:57
1 0 days 22:12:52
2 NaT
3 3 days 06:41:33
Update
#MrFuppes Excel file is exactly what I have in my column 'duration'
Try:
df['duration'] = np.where(df['duration'].apply(len) == 8,
'1899-12-31 ' + df['duration'], df['duration'])
df['duration'] = pd.to_datetime(df['duration'], errors='coerce') \
- pd.Timestamp('1899-12-31')
print(df)
# Output (with a slightly modified example of #MrFuppes)
duration
0 0 days 12:30:00
1 1 days 00:30:00
2 NaT
Related
I have a table with a DateTime Column as shown below. The time Interval is in hours
ID TimeInterval Temperature
1 00:00:00 27
2 01:00:00 26
3 02:00:00 24
4 03:00:00 24
5 04:00:00 25
I tried to use the time interval for plotting. However, I got an error
float() argument must be a string or a number, not 'datetime.time'
So, I want to extract the first two numbers of Column TimeInterval and have it into a new column.
Any ideas how to extract that?
You can use strftime with %H for hours
# Create test data
df = pd.DataFrame({'ID': [1, 2, 3, 4, 5], 'TimeInterval': ['00:00:00', '01:00:00', '02:00:00', '03:00:00', '04:00:00'], 'Temperature': [27, 26, 24, 24, 25]})
# Change the format to %H:%M:%S.
df['TimeInterval'] = pd.to_datetime(df['TimeInterval'], format='%H:%M:%S')
# Create a new column
df['new'] = df['TimeInterval'].dt.strftime('%H')
output:
ID TimeInterval Temperature new
1 1900-01-01 00:00:00 27 00
2 1900-01-01 01:00:00 26 01
3 1900-01-01 02:00:00 24 02
4 1900-01-01 03:00:00 24 03
5 1900-01-01 04:00:00 25 04
If the "TimeInterval" column is a string ...
you can select the first 2 characters from it and parse it later on into an integer:
df["new"] = df["TimeInterval"].str[:2].astype(int)
But that would be an ugly solution.
Here is a better solution
# test data
df = pd.DataFrame([{'TimeInterval':'00:00:00'},
{'TimeInterval':'01:00:00'}])
# cast it to a datetime object
df['TimeInterval'] = pd.to_datetime(df['TimeInterval'], format='%H:%M:%S')
# select the hours
df['hours'] = df['TimeInterval'].dt.hour
I have a data frame where there is time columns having minutes from 0-1339 meaning 1440 minutes of a day. I want to add a column datetime representing the day 2021-3-21 including hh amd mm like this 1980-03-01 11:00 I tried following code
from datetime import datetime, timedelta
date = datetime.date(2021, 3, 21)
days = date - datetime.date(1900, 1, 1)
df['datetime'] = pd.to_datetime(df['time'],format='%H:%M:%S:%f') + pd.to_timedelta(days, unit='d')
But the error seems like descriptor 'date' requires a 'datetime.datetime' object but received a 'int'
Is there any other way to solve this problem or fixing this code? Please help to figure this out.
>>df
time
0
1
2
3
..
1339
I want to convert this minutes to particular format 1980-03-01 11:00 where I will use the date 2021-3-21 and convert the minutes tohhmm part. The dataframe will look like.
>df
datetime time
2021-3-21 00:00 0
2021-3-21 00:01 1
2021-3-21 00:02 2
...
How can I format my data in this way?
Let's try with pd.to_timedelta instead to get the duration in minutes from time then add a TimeStamp:
df['datetime'] = (
pd.Timestamp('2021-3-21') + pd.to_timedelta(df['time'], unit='m')
)
df.head():
time datetime
0 0 2021-03-21 00:00:00
1 1 2021-03-21 00:01:00
2 2 2021-03-21 00:02:00
3 3 2021-03-21 00:03:00
4 4 2021-03-21 00:04:00
Complete Working Example with Sample Data:
import numpy as np
import pandas as pd
df = pd.DataFrame({'time': np.arange(0, 1440)})
df['datetime'] = (
pd.Timestamp('2021-3-21') + pd.to_timedelta(df['time'], unit='m')
)
print(df)
jan_21=[datetime(2021,1,1) + timedelta(hours=i) for i in range(5)]
jan_21
[datetime.datetime(2021, 1, 1, 0, 0),
datetime.datetime(2021, 1, 1, 1, 0),
datetime.datetime(2021, 1, 1, 2, 0),
datetime.datetime(2021, 1, 1, 3, 0),
datetime.datetime(2021, 1, 1, 4, 0)]
prices = np.random.randint(1,100,size=(5,))
prices
[46 23 13 26 52]
df = pd.DataFrame({'datetime':jan_21, 'price':prices})
df
datetime price
0 2021-01-01 00:00:00 83
1 2021-01-01 01:00:00 60
2 2021-01-01 02:00:00 29
3 2021-01-01 03:00:00 97
4 2021-01-01 04:00:00 67
All good so far, this is how I expected the dataframe and datetime values to be displayed. The problem comes when I save the dataframe to an excel file and then read it back into a dataframe, the datetime values get messed up.
df.to_excel('price_data.xlsx', index=False)
new_df = pd.read_excel('price_data.xlsx')
new_df
datetime price
0 2021-01-01 00:00:00.000000 83
1 2021-01-01 00:59:59.999999 60
2 2021-01-01 02:00:00.000001 29
3 2021-01-01 03:00:00.000000 97
4 2021-01-01 03:59:59.999999 67
I'd like df == new_df to evaluate to True
Against the backdrop of the likely cause of the issue (see sophros' answer), what you could do to - superficiously - circumvent the problem is converting the cells of df["datetime"] to strings before producing the excel file and then converting the strings to datetime again, after new_df has been created:
df["datetime"] = df["datetime"].dt.strftime("%m/%d/%Y, %H:%M:%S")
df.to_excel('price_data.xlsx', index=False)
new_df = pd.read_excel('price_data.xlsx')
new_df["datetime"] = pd.to_datetime(new_df["datetime"], format="%m/%d/%Y, %H:%M:%S")
The reason for the difference in time part of 00:59:59.999999 and 02:00:00.000001 and 03:59:59.999999 is most likely related to a slightly different binary representation of date/time types in Excel and Python or pandas.
The time is frequently stored as a float but the difference is when is the 0-th time (e.g. year 1 AC or 1970 - as in Linux; good explanation here). Therefore, the conversion may loose some least significant parts of the date/time and there is not much you can do about it but round it up or using approximate comparisons as with any float.
We have Pandas dataframe with start_date and end_date columns(Input format is given below). I need to check whether the given input time range is present between the start_date and end_date.
For example, if the time range is 09:30-10:30, the output should be first row (student1) and if time range is 16:00-17:30, the output should be second row (student2). Please guide me how can I achieve this.
Input Dataframe:
name start_date end_date
0 student1 2020-08-30 09:00:00 2020-08-30 10:00:00
1 student2 2020-08-30 15:00:00 2020-08-30 18:00:00
2 student3 2020-08-30 11:00:00 2020-08-30 12:30:00
Supposing your date columns are datetime format:
from datetime import datetime
start_time = datetime(2020, 1, 1, 9, 30)
end_time = datetime(2020, 1, 1, 10, 30)
df['start_date'].dt.time.le(end_time.time())&df['end_date'].dt.time.ge(start_time.time())
I am trying to construct a datetime column in Pandas that represents multiple columns describing the year, month, day, etc. Most of the other answers I can find on this topic involve processing data in the opposite direction (from datetime to integer hour, for instance).
df = pd.DataFrame()
df['year'] = [2019, 2019, 2019, 2019, 2019, 2019]
df['month'] = [8, 8, 8, 8, 8, 8]
df['day'] = [1, 1, 1, 1, 1, 1]
df['hour'] = [10,10,11,11,12,12]
df['minute'] = [15,45,20,40,10,50]
df['second'] = [0, 1, 5, 10, 10, 11]
Goal:
df['datetime_val'] =
0 2019-08-01 10:15:00
1 2019-08-01 10:45:01
2 2019-08-01 11:20:05
3 2019-08-01 11:40:10
4 2019-08-01 12:10:10
5 2019-08-01 12:50:11
Name: datetime_vals, dtype: datetime64[ns]
In the example above, how could I rapidly create a datetime column representing the constituent time information? I could easily do this with .apply() and a helper function but I envision performing this operation for millions of rows. I would love something inbuilt / vectorized. Thanks!
IIUC to_datetime can take dataframe , only if the columns is well named as yours
pd.to_datetime(df)
0 2019-08-01 10:15:00
1 2019-08-01 10:45:01
2 2019-08-01 11:20:05
3 2019-08-01 11:40:10
4 2019-08-01 12:10:10
5 2019-08-01 12:50:11
dtype: datetime64[ns]
After reading through this comparison of string concatenation methods for pandas dataframes, it looks like you could benefit from using df.assign:
df.assign(datetime_val=[f"{str(year)}-{str(month)}-{str(day)} {str(hour)}:{str(minute)}:{str(second)}" for year, month, day, hour, minute, second in zip(df['year'], df['month'], df['day'], df['hour'], df['minute'], df['second'])])
EDIT2:
My method does not return datetime64 objects, however, as pointed out below by Andy L. In fact, method 3 becomes incredibly slow when swapping out the strings for datetime objects. However, the method 1 vs method 2 comparison is still valid.
EDIT:
Did some testing to compare the three methods presented here
you may convert whole df to str and use agg to concat string and with format parameter of pd.to_datetime
df = df.astype(str)
pd.to_datetime(df.agg('-'.join, axis=1), format='%Y-%m-%d-%H-%M-%S')
Out[170]:
0 2019-08-01 10:15:00
1 2019-08-01 10:45:01
2 2019-08-01 11:20:05
3 2019-08-01 11:40:10
4 2019-08-01 12:10:10
5 2019-08-01 12:50:11
dtype: datetime64[ns]