Ok so I am trying to subtract the next time from the previous time in a dataframe column called local_time as indicated by this code. I have also tried this using list comprehension.
next_df = df.shift(-1)
def time_between (df):
return datetime.combine(date.today(), next_df['Local Time']) - datetime.combine(date.today(), df['Local Time'])
df['time_diff'] = df.apply(time_between, axis = 1)code here
however I recieve this error when trying to subtract:
return datetime.combine(date.today(), next_df['Local Time']) - datetime.combine(date.today(), df['Local Time'])
TypeError: combine() argument 2 must be datetime.time, not Series
You might try if pd.DataFrame.diff will work with datetime data. Assuming your data types are correct, date arithmetic should work fine.
Otherwise, you need to do vectorized calculations using whole arrays as each element in your arithmetic. Also use the dt accessors native to pandas, like pandas.Series.dt.date
instead of date.today(), you can use df['today'] = date.today() and df['today'].dt.date + df['Local Time'].dt.time. 90% sure that will yield a datetime column. If so, you could then use df.diff() pretty easily.
Related
I have a column called Time in a dataframe that looks like this:
599359 12:32:25
326816 17:55:22
326815 17:55:22
358789 12:48:25
361553 12:06:45
...
814512 21:22:07
268266 18:57:31
659699 14:28:20
659698 14:28:20
268179 17:48:53
Name: Time, Length: 546967, dtype: object
And right now it is an object dtype. I've tried the following to convert it to a datetime:
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time
And I understand that the .dt.time methods are needed to prevent the Year and Month from being added, but I believe this is causing the dtype to revert to an object.
Any workarounds? I know I could do
df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True)
but I have over 500,000 rows and this is taking forever.
When you do this bit: df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time, you're converting the 'Time' column to have pd.dtype as object... and that "object" is the python type datetime.time.
The pandas dtype pd.datetime is a different type than python's datetime.datetime objects. And pandas' pd.datetime does not support time objects (i.e. you can't have pandas consider the column a datetime without providing the year). This is the dtype is changing to object.
In the case of your second approach, df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True) there is something slightly different happening. In this case you're applying the pd.to_datetime to each scalar element of the 'Time' series. Take a look at the return types of the function in the docs, but basically in this case the time values in your df are being converted to pd.datetime objects on the 1st of january 1900. (i.e. a default date is added).
So: pandas is behaving correctly. If you only want the times, then it's okay to use the datetime.time objects in the column. But to operate on them you'll probably be relying on many [slow] df.apply methods. Alternatively, just keep the default date of 1900-01-01 and then you can add/subtract the pd.datetime columns and get the speed advantage of pandas. Then just strip off the date when you're done with it.
I have a column of string values in pandas as follows:
2022-07-01 00:00:00+00:00
I want to compare it to a couple of dates as follows:
month_start_date = datetime.date(start_year, start_month, 1)
month_end_date = datetime.date(start_year, start_month, calendar.monthrange(start_year, start_month)[1])
df = df[(df[date] >= month_start_date) and (df[date] <= month_end_date)]
How do i convert the string value to datetime.date?
I have tried to use pd.to_datetime(df['date']), says cant compare datetime to date
Tried to use pd.to_datetime(df['date']).dt.date says dt can only be used for datetime l like variables, did you mean at
Also tired to normalize it, but that bring more errors with timezone, and active and naive timezone
Also tried .astype('datetime64[ns]')
None of it is working
UPDATE
Turns out none of the above are working because half the data is in this format: 2022-07-01 00:00:00+00:00
And the rest is in this format: 2022-07-01
Here is how i am getting around this issue:
for index, row in df_uscis.iterrows():
df_uscis.loc[index, 'date'] = datetime.datetime.strptime(row['date'].split(' ')[0], "%Y-%m-%d").date()
Is there a simpler and faster way of doing this? I tried to make a new column with the date values only, but not sure how to do that
From your update, if you only need to turn the values from string to date objects, you can try:
df['date'] = pd.to_datetime(df['date'].str.split(' ').str[0])
df['date'] = df['date'].dt.date
Also, try to avoid using iterrows, as it is really slow and usually there's a better way to achieve what you're trying to acomplish, but if you really need to iterate through a DataFrame, try using the df.itertuples() method.
I have a data file with timestamps that look like this:
It gets loaded into pandas with a column name of "Time". I am trying to create two new datetime64 type columns, one with the date and one with the time (hour). I have explored a few solutions to this problem on StackOverflow but am still having issues. Quick note, I need the final columns to not be objects so I can use pandas and numpy functionality.
I load the dataframe and create two new columns like so:
df = pd.read_csv('C:\\Users\\...\\xyz.csv')
df['Date'] = pd.to_datetime(df['Time']).dt.date
df['Hour'] = pd.to_datetime(df['Time']).dt.time
This works but the Date and Hour columns are now objects.
I run this to convert the date to my desired datetime64 data type and it works:
df['Date'] = pd.to_datetime(df['Date'])
However, when I try to use this same code on the Hour column, I get an error:
TypeError: <class 'datetime.time'> is not convertible to datetime
I did some digging and found the following code which runs:
df['Hour'] = pd.to_datetime(df['Hour'], format='%H:%M:%S')
However the actual output includes a generic date like so:
When I try to run code referencing the Hour column like so:
HourVarb = '15:00:00'
df['Test'] = np.where(df['Hour']==HourVarb,1,np.nan)
It runs but doesn't produce the result I want.
Perhaps my HourVarb variable is the wrong format for the numpy code? Alternatively, the 1/1/1900 is causing problems and the format %H: %M: %S needs to change? My end goal is to be able to reference the hour and the date to filter out specific date/hour combinations. Please help.
One note, when I change the HourVarb to '1/1/1900 15:00:00' the code above works as intended, but I'd still like to understand if there is a cleaner way that removes the date. Thanks
I'm not sure I understand the problem with the 'object' datatypes of these columns.
I loaded the data you provided this way:
df = pd.read_csv('xyz.csv')
df['Time'] = pd.to_datetime(df['Time'])
df['Date'] = df['Time'].dt.date
df['Hour'] = df['Time'].dt.time
print(df.dtypes)
And I get these data types:
Time datetime64[ns]
Date object
Hour object
The fact that Date and Hour are object types should not be a problem. The underlying data is a datetime type:
print(type(df.Date.iloc[0]))
print(type(df.Hour.iloc[0]))
<class 'datetime.date'>
<class 'datetime.time'>
This means you can use these columns as such. For example:
print(df['Date'] + pd.Timedelta('1D'))
What are you trying to do that is requiring the column dtype to be a Pandas dtype?
UPDATE
Here is how you achieve the last part of your question:
from datetime import datetime, time
hourVarb = datetime.strptime("15:00:00", '%H:%M:%S').time()
# or hourVarb = time(15, 0)
df['Test'] = df['Hour'] == hourVarb
print(df['Test'])
0 True
1 False
2 False
3 False
Name: Test, dtype: bool
I believe I have tried every answer I could find on this page. The problem I am still having is converting a column to date format, and deleting the time from it.
This is my data frame, mixed datetimes and integers in passed array is the error I am getting with most of the methods I tried. I also failed converting it to a string -> date. Not sure how to approach this, any help would be appreciated.
This way you can convert to date and can also add other conditions
import datetime
def convert_to_date(x):
return datetime.datetime.strptime(x , '%Y-%m-%d %H:%M:%S')
df.loc[:, 'Date'] = df['Date'].apply(convert_to_date)
# better to use a lambda function
df.loc[:, 'Date'] = df['Date'].apply(lambda x:datetime.datetime.strptime(x , '%Y-%m-%d %H:%M:%S'))
How do I create an datetime index "foo" to use with raw data series.
(Example would "as of" every 15 seconds 'foo' and and every 30 seconds 'foo2'.) If raw series can be inserted into a 'base' dataframe, I would like to use 'foo' to recast the dataframe.
If wanted series to combine combine df "foo" and df "foo2", what would be the memory hits
Would it be better to fill the foo index with the raw data series.
EDIT:
after import pandas , datetime.timedelta stops working
It's very hard for me to understand what you're asking; an illustration of exactly what you're looking for, with example data, would help make things more clear.
I think what you should do:
rng = DateRange(start, end, offset=datetools.Second(15)
to create the date range. To put data in a DataFrame indexed by that, you should add the columns and reindex them to the date range above using method='ffill':
df = DataFrame(index=rng)
df[colname] = series.reindex(df.index, method='ffill')
Per datetime.timedelta, datetime.datetime is part of the pandas namespace, so if you did from pandas import * then any import datetime you had done before that would be masked by the datetime.datetime reference inside the pandas namespace.
Since Wes' answer I think pandas.DateRange is no longer present in pandas. I'm on pandas version 0.22.0.
I used pandas.DatetimeIndex instead, e.g.:
import datetime
import pandas as pd
start = datetime.datetime.now()
times = pd.DatetimeIndex(freq='2s', start=start, periods=10)
or alternatively
start = datetime.datetime.now()
end = start + datetime.timedelta(hours=1)
times = pd.DatetimeIndex(freq='2s', start=start, end=end)
as of version 0.24
Creating a DatetimeIndex based on start, periods, and end has been deprecated in favor of date_range().
Using date_range() is similar to DatetimeIndex()
start = datetime.datetime.now()
end = start + datetime.timedelta(hours=1)
times = pd.date_range(freq='2s', start=start, end=end)
times is a DatetimeIndex with 1801 elements with an interval of 2 seconds