Pandas problem with a column with mixed time and date time - python

I have a column that came from Excel, that is supposed to contain durations (in hours) - example: 02:00:00 -
It works well if all this durations are less than 24:00 but if one is more than that, it appears in pandas as 1900-01-03 08:00:00 (so datetime)
as a result the datatype is dtype('O').
df = pd.DataFrame({'duration':[datetime.time(2, 0), datetime.time(2, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0),
datetime.datetime(1900, 1, 3, 8, 0), datetime.time(1, 0),
datetime.time(1, 0)]})
# Output
duration
0 02:00:00
1 02:00:00
2 1900-01-03 08:00:00
3 1900-01-03 08:00:00
4 1900-01-03 08:00:00
5 1900-01-03 08:00:00
6 1900-01-03 08:00:00
7 1900-01-03 08:00:00
8 01:00:00
9 01:00:00
But if I try to convert to either time or datetime I always get an error.
TypeError: <class 'datetime.time'> is not convertible to datetime
Today if I don't fix this, all the duration greater than 24:00 are gone.

Your problem lies in the engine that reads the Excel file. It converts cells that have a certain format (e.g. [h]:mm:ss or hh:mm:ss) to datetime.datetime or datetime.time objects. Those then get transferred into the pandas DataFrame, so it's not actually a pandas problem.
Before you start hacking the excel reader engine, it might be easier to tackle the issue in Excel. Here's a small sample file;
You can download it here.
duration is auto-formatted by Excel, duration_text is what you get if you set the column format to 'text' before you enter the values, duration_to_text is what you get if you change the format to text after Excel auto-formatted the values (first column).
Now you have everything you need after import with pandas:
df = pd.read_excel('path_to_file')
df
duration duration_text duration_to_text
0 12:30:00 12:30:00 0.520833
1 1900-01-01 00:30:00 24:30:00 1.020833
# now you can parse to timedelta:
pd.to_timedelta(df['duration_text'], errors='coerce')
0 0 days 12:30:00
1 1 days 00:30:00
Name: duration_text, dtype: timedelta64[ns]
# or
pd.to_timedelta(df['duration_to_text'], unit='d', errors='coerce')
0 0 days 12:29:59.999971200 # note the precision issue ;-)
1 1 days 00:29:59.999971200
Name: duration_to_text, dtype: timedelta64[ns]
Another viable option could be to save the Excel file as a csv and import that to a pandas DataFrame. The sample xlsx used above would then look like this for example.
If you have no other option than to re-process in pandas, an option could be to treat datetime.time objects and datetime.datetime objects specifically, e.g.
import datetime
# where you have datetime (incorrect from excel)
m = [isinstance(i, datetime.datetime) for i in df.duration]
# convert to timedelta where it's possible
df['timedelta'] = pd.to_timedelta(df['duration'].astype(str), errors='coerce')
# where you have datetime, some special treatment is needed...
df.loc[m, 'timedelta'] = df.loc[m, 'duration'].apply(lambda t: pd.Timestamp(str(t)) - pd.Timestamp('1899-12-31'))
df['timedelta']
0 0 days 12:30:00
1 1 days 00:30:00
Name: timedelta, dtype: timedelta64[ns]

IIUC, use pd.to_timedelta:
Setup a MRE:
df = pd.DataFrame({'duration': ['43:24:57', '22:12:52', '-', '78:41:33']})
print(df)
# Output
duration
0 43:24:57
1 22:12:52
2 -
3 78:41:33
df['duration'] = pd.to_timedelta(df['duration'], errors='coerce')
print(df)
# Output
duration
0 1 days 19:24:57
1 0 days 22:12:52
2 NaT
3 3 days 06:41:33
Update
#MrFuppes Excel file is exactly what I have in my column 'duration'
Try:
df['duration'] = np.where(df['duration'].apply(len) == 8,
'1899-12-31 ' + df['duration'], df['duration'])
df['duration'] = pd.to_datetime(df['duration'], errors='coerce') \
- pd.Timestamp('1899-12-31')
print(df)
# Output (with a slightly modified example of #MrFuppes)
duration
0 0 days 12:30:00
1 1 days 00:30:00
2 NaT

Related

Get the first 2 items from a column to make another Column Pandas

I have a table with a DateTime Column as shown below. The time Interval is in hours
ID TimeInterval Temperature
1 00:00:00 27
2 01:00:00 26
3 02:00:00 24
4 03:00:00 24
5 04:00:00 25
I tried to use the time interval for plotting. However, I got an error
float() argument must be a string or a number, not 'datetime.time'
So, I want to extract the first two numbers of Column TimeInterval and have it into a new column.
Any ideas how to extract that?
You can use strftime with %H for hours
# Create test data
df = pd.DataFrame({'ID': [1, 2, 3, 4, 5], 'TimeInterval': ['00:00:00', '01:00:00', '02:00:00', '03:00:00', '04:00:00'], 'Temperature': [27, 26, 24, 24, 25]})
# Change the format to %H:%M:%S.
df['TimeInterval'] = pd.to_datetime(df['TimeInterval'], format='%H:%M:%S')
# Create a new column
df['new'] = df['TimeInterval'].dt.strftime('%H')
output:
ID TimeInterval Temperature new
1 1900-01-01 00:00:00 27 00
2 1900-01-01 01:00:00 26 01
3 1900-01-01 02:00:00 24 02
4 1900-01-01 03:00:00 24 03
5 1900-01-01 04:00:00 25 04
If the "TimeInterval" column is a string ...
you can select the first 2 characters from it and parse it later on into an integer:
df["new"] = df["TimeInterval"].str[:2].astype(int)
But that would be an ugly solution.
Here is a better solution
# test data
df = pd.DataFrame([{'TimeInterval':'00:00:00'},
{'TimeInterval':'01:00:00'}])
# cast it to a datetime object
df['TimeInterval'] = pd.to_datetime(df['TimeInterval'], format='%H:%M:%S')
# select the hours
df['hours'] = df['TimeInterval'].dt.hour

Adding a datetime column in pandas dataframe from minute values

I have a data frame where there is time columns having minutes from 0-1339 meaning 1440 minutes of a day. I want to add a column datetime representing the day 2021-3-21 including hh amd mm like this 1980-03-01 11:00 I tried following code
from datetime import datetime, timedelta
date = datetime.date(2021, 3, 21)
days = date - datetime.date(1900, 1, 1)
df['datetime'] = pd.to_datetime(df['time'],format='%H:%M:%S:%f') + pd.to_timedelta(days, unit='d')
But the error seems like descriptor 'date' requires a 'datetime.datetime' object but received a 'int'
Is there any other way to solve this problem or fixing this code? Please help to figure this out.
>>df
time
0
1
2
3
..
1339
I want to convert this minutes to particular format 1980-03-01 11:00 where I will use the date 2021-3-21 and convert the minutes tohhmm part. The dataframe will look like.
>df
datetime time
2021-3-21 00:00 0
2021-3-21 00:01 1
2021-3-21 00:02 2
...
How can I format my data in this way?
Let's try with pd.to_timedelta instead to get the duration in minutes from time then add a TimeStamp:
df['datetime'] = (
pd.Timestamp('2021-3-21') + pd.to_timedelta(df['time'], unit='m')
)
df.head():
time datetime
0 0 2021-03-21 00:00:00
1 1 2021-03-21 00:01:00
2 2 2021-03-21 00:02:00
3 3 2021-03-21 00:03:00
4 4 2021-03-21 00:04:00
Complete Working Example with Sample Data:
import numpy as np
import pandas as pd
df = pd.DataFrame({'time': np.arange(0, 1440)})
df['datetime'] = (
pd.Timestamp('2021-3-21') + pd.to_timedelta(df['time'], unit='m')
)
print(df)

Pandas datetime values messed up after saving df to excel and then reading back into a df

jan_21=[datetime(2021,1,1) + timedelta(hours=i) for i in range(5)]
jan_21
[datetime.datetime(2021, 1, 1, 0, 0),
datetime.datetime(2021, 1, 1, 1, 0),
datetime.datetime(2021, 1, 1, 2, 0),
datetime.datetime(2021, 1, 1, 3, 0),
datetime.datetime(2021, 1, 1, 4, 0)]
prices = np.random.randint(1,100,size=(5,))
prices
[46 23 13 26 52]
df = pd.DataFrame({'datetime':jan_21, 'price':prices})
df
datetime price
0 2021-01-01 00:00:00 83
1 2021-01-01 01:00:00 60
2 2021-01-01 02:00:00 29
3 2021-01-01 03:00:00 97
4 2021-01-01 04:00:00 67
All good so far, this is how I expected the dataframe and datetime values to be displayed. The problem comes when I save the dataframe to an excel file and then read it back into a dataframe, the datetime values get messed up.
df.to_excel('price_data.xlsx', index=False)
new_df = pd.read_excel('price_data.xlsx')
new_df
datetime price
0 2021-01-01 00:00:00.000000 83
1 2021-01-01 00:59:59.999999 60
2 2021-01-01 02:00:00.000001 29
3 2021-01-01 03:00:00.000000 97
4 2021-01-01 03:59:59.999999 67
I'd like df == new_df to evaluate to True
Against the backdrop of the likely cause of the issue (see sophros' answer), what you could do to - superficiously - circumvent the problem is converting the cells of df["datetime"] to strings before producing the excel file and then converting the strings to datetime again, after new_df has been created:
df["datetime"] = df["datetime"].dt.strftime("%m/%d/%Y, %H:%M:%S")
df.to_excel('price_data.xlsx', index=False)
new_df = pd.read_excel('price_data.xlsx')
new_df["datetime"] = pd.to_datetime(new_df["datetime"], format="%m/%d/%Y, %H:%M:%S")
The reason for the difference in time part of 00:59:59.999999 and 02:00:00.000001 and 03:59:59.999999 is most likely related to a slightly different binary representation of date/time types in Excel and Python or pandas.
The time is frequently stored as a float but the difference is when is the 0-th time (e.g. year 1 AC or 1970 - as in Linux; good explanation here). Therefore, the conversion may loose some least significant parts of the date/time and there is not much you can do about it but round it up or using approximate comparisons as with any float.

Check time range in pandas dataframe

We have Pandas dataframe with start_date and end_date columns(Input format is given below). I need to check whether the given input time range is present between the start_date and end_date.
For example, if the time range is 09:30-10:30, the output should be first row (student1) and if time range is 16:00-17:30, the output should be second row (student2). Please guide me how can I achieve this.
Input Dataframe:
name start_date end_date
0 student1 2020-08-30 09:00:00 2020-08-30 10:00:00
1 student2 2020-08-30 15:00:00 2020-08-30 18:00:00
2 student3 2020-08-30 11:00:00 2020-08-30 12:30:00
Supposing your date columns are datetime format:
from datetime import datetime
start_time = datetime(2020, 1, 1, 9, 30)
end_time = datetime(2020, 1, 1, 10, 30)
df['start_date'].dt.time.le(end_time.time())&df['end_date'].dt.time.ge(start_time.time())

How to create a Pandas column for datetime from year / month/ day / hour / minute / second?

I am trying to construct a datetime column in Pandas that represents multiple columns describing the year, month, day, etc. Most of the other answers I can find on this topic involve processing data in the opposite direction (from datetime to integer hour, for instance).
df = pd.DataFrame()
df['year'] = [2019, 2019, 2019, 2019, 2019, 2019]
df['month'] = [8, 8, 8, 8, 8, 8]
df['day'] = [1, 1, 1, 1, 1, 1]
df['hour'] = [10,10,11,11,12,12]
df['minute'] = [15,45,20,40,10,50]
df['second'] = [0, 1, 5, 10, 10, 11]
Goal:
df['datetime_val'] =
0 2019-08-01 10:15:00
1 2019-08-01 10:45:01
2 2019-08-01 11:20:05
3 2019-08-01 11:40:10
4 2019-08-01 12:10:10
5 2019-08-01 12:50:11
Name: datetime_vals, dtype: datetime64[ns]
In the example above, how could I rapidly create a datetime column representing the constituent time information? I could easily do this with .apply() and a helper function but I envision performing this operation for millions of rows. I would love something inbuilt / vectorized. Thanks!
IIUC to_datetime can take dataframe , only if the columns is well named as yours
pd.to_datetime(df)
0 2019-08-01 10:15:00
1 2019-08-01 10:45:01
2 2019-08-01 11:20:05
3 2019-08-01 11:40:10
4 2019-08-01 12:10:10
5 2019-08-01 12:50:11
dtype: datetime64[ns]
After reading through this comparison of string concatenation methods for pandas dataframes, it looks like you could benefit from using df.assign:
df.assign(datetime_val=[f"{str(year)}-{str(month)}-{str(day)} {str(hour)}:{str(minute)}:{str(second)}" for year, month, day, hour, minute, second in zip(df['year'], df['month'], df['day'], df['hour'], df['minute'], df['second'])])
EDIT2:
My method does not return datetime64 objects, however, as pointed out below by Andy L. In fact, method 3 becomes incredibly slow when swapping out the strings for datetime objects. However, the method 1 vs method 2 comparison is still valid.
EDIT:
Did some testing to compare the three methods presented here
you may convert whole df to str and use agg to concat string and with format parameter of pd.to_datetime
df = df.astype(str)
pd.to_datetime(df.agg('-'.join, axis=1), format='%Y-%m-%d-%H-%M-%S')
Out[170]:
0 2019-08-01 10:15:00
1 2019-08-01 10:45:01
2 2019-08-01 11:20:05
3 2019-08-01 11:40:10
4 2019-08-01 12:10:10
5 2019-08-01 12:50:11
dtype: datetime64[ns]

Categories

Resources