Separating Date and Time in Pandas - python

I have a data file with timestamps that look like this:
It gets loaded into pandas with a column name of "Time". I am trying to create two new datetime64 type columns, one with the date and one with the time (hour). I have explored a few solutions to this problem on StackOverflow but am still having issues. Quick note, I need the final columns to not be objects so I can use pandas and numpy functionality.
I load the dataframe and create two new columns like so:
df = pd.read_csv('C:\\Users\\...\\xyz.csv')
df['Date'] = pd.to_datetime(df['Time']).dt.date
df['Hour'] = pd.to_datetime(df['Time']).dt.time
This works but the Date and Hour columns are now objects.
I run this to convert the date to my desired datetime64 data type and it works:
df['Date'] = pd.to_datetime(df['Date'])
However, when I try to use this same code on the Hour column, I get an error:
TypeError: <class 'datetime.time'> is not convertible to datetime
I did some digging and found the following code which runs:
df['Hour'] = pd.to_datetime(df['Hour'], format='%H:%M:%S')
However the actual output includes a generic date like so:
When I try to run code referencing the Hour column like so:
HourVarb = '15:00:00'
df['Test'] = np.where(df['Hour']==HourVarb,1,np.nan)
It runs but doesn't produce the result I want.
Perhaps my HourVarb variable is the wrong format for the numpy code? Alternatively, the 1/1/1900 is causing problems and the format %H: %M: %S needs to change? My end goal is to be able to reference the hour and the date to filter out specific date/hour combinations. Please help.
One note, when I change the HourVarb to '1/1/1900 15:00:00' the code above works as intended, but I'd still like to understand if there is a cleaner way that removes the date. Thanks

I'm not sure I understand the problem with the 'object' datatypes of these columns.
I loaded the data you provided this way:
df = pd.read_csv('xyz.csv')
df['Time'] = pd.to_datetime(df['Time'])
df['Date'] = df['Time'].dt.date
df['Hour'] = df['Time'].dt.time
print(df.dtypes)
And I get these data types:
Time datetime64[ns]
Date object
Hour object
The fact that Date and Hour are object types should not be a problem. The underlying data is a datetime type:
print(type(df.Date.iloc[0]))
print(type(df.Hour.iloc[0]))
<class 'datetime.date'>
<class 'datetime.time'>
This means you can use these columns as such. For example:
print(df['Date'] + pd.Timedelta('1D'))
What are you trying to do that is requiring the column dtype to be a Pandas dtype?
UPDATE
Here is how you achieve the last part of your question:
from datetime import datetime, time
hourVarb = datetime.strptime("15:00:00", '%H:%M:%S').time()
# or hourVarb = time(15, 0)
df['Test'] = df['Hour'] == hourVarb
print(df['Test'])
0 True
1 False
2 False
3 False
Name: Test, dtype: bool

Related

Convert object-type hours:minutes:seconds column to datetime type in Pandas

I have a column called Time in a dataframe that looks like this:
599359 12:32:25
326816 17:55:22
326815 17:55:22
358789 12:48:25
361553 12:06:45
...
814512 21:22:07
268266 18:57:31
659699 14:28:20
659698 14:28:20
268179 17:48:53
Name: Time, Length: 546967, dtype: object
And right now it is an object dtype. I've tried the following to convert it to a datetime:
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time
And I understand that the .dt.time methods are needed to prevent the Year and Month from being added, but I believe this is causing the dtype to revert to an object.
Any workarounds? I know I could do
df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True)
but I have over 500,000 rows and this is taking forever.
When you do this bit: df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time, you're converting the 'Time' column to have pd.dtype as object... and that "object" is the python type datetime.time.
The pandas dtype pd.datetime is a different type than python's datetime.datetime objects. And pandas' pd.datetime does not support time objects (i.e. you can't have pandas consider the column a datetime without providing the year). This is the dtype is changing to object.
In the case of your second approach, df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True) there is something slightly different happening. In this case you're applying the pd.to_datetime to each scalar element of the 'Time' series. Take a look at the return types of the function in the docs, but basically in this case the time values in your df are being converted to pd.datetime objects on the 1st of january 1900. (i.e. a default date is added).
So: pandas is behaving correctly. If you only want the times, then it's okay to use the datetime.time objects in the column. But to operate on them you'll probably be relying on many [slow] df.apply methods. Alternatively, just keep the default date of 1900-01-01 and then you can add/subtract the pd.datetime columns and get the speed advantage of pandas. Then just strip off the date when you're done with it.

selecting a df row by month formatted with (lambda x: datetime.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z'))

I'm having some issues with geopandas and pandas datetime objects; I kept getting the error
pandas Invalid field type <class 'pandas._libs.tslibs.timedeltas.Timedelta'>
when I try to save it using gpd.to_file() apparently this is a known issue between pandas and geopandas date types so I used
df.DATE = df.DATE.apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z'))
to get a datetime object I could manipulate without getting the aforementioned error when I save the results. Due to that change, my selection by month
months = [4]
for month in months:
df = df[[(pd.DatetimeIndex(df.DATE).month == month)]]
no longer works, throwing a value error.
ValueError: Item wrong length 1 instead of 108700.
I tried dropping the pd.DatetimeIndex but this throws a dataframe series error
AttributeError: 'Series' object has no attribute 'month'
and
df = df[(df.DATE.month == month)]
gives me the same error.
I know it converted over to a datetime object because print(df.dtype) shows DATE datetime64[ns, UTC] and
for index, row in df.iterrows():
print(row.DATE.month)
prints the month as a integer to the terminal.
Without going back to pd.Datetime how can I fix my select statement for the month?
The statement df.DATE returns a Series object. That doesn't have a .month attribute. The dates inside the Series do, which is why row.DATE.month works. Try something like:
filter = [x.month == month for x in df.DATE]
df_filtered = df[filter]
Before that, I'm not sure what you're trying to accomplish with pd.DatetimeIndex(df.DATE).month == month) but a similar fix should take care of it.

conversion of any date to datetime in pandas dataframe

I want to define a function that takes in a pandas dateframe and iterates through its columns to find if there is any date field or a timestamp which can be an object. I want to convert those fields to DateTime. In short, I want to clean my different dataframes which can come from different sources and have different timestamps and store them. can you guys point me in the right direction?
def dateConversion(flattenedObject):
for col in df.columns:
df[col]=pd.to_datetime(df[col])
something like this.. ?
def test_datetime(dataframe):
for col in dataframe.select_dtypes('object').columns:
try:
dataframe[col] = pd.to_datetime(dataframe[col])
print(f'Converted {col}')
except ValueError:
print(f"{col} does not conform to standard datetime format")
Test
df = pd.DataFrame({'date_test' : ['01-01-2020'],
'string_test' : ['not a date']})
print(test_datetime(df))
Converted date_test
string_test does not conform to standard datetime format
print(df.dtypes)
date_test datetime64[ns]
string_test object
dtype: object

DataFrame does not allow Timestamps conversion for resampling

I have a 12 millions entries csv file that I imported as dataframe with pandas that looks like this.
pair time open close
0 AUD/JPY 20170102 00:00:08.238 83.774002 84.626999
1 AUD/JPY 20170102 00:00:08.352 83.774002 84.626999
2 AUD/JPY 20170102 00:00:13.662 84.184998 84.324997
3 AUD/JPY 20170102 00:00:13.783 84.184998 84.324997
The time column is a string but I need a datetime object in order to downsample the dataframe and get OHLC values. The df.resample function requires datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex). I tried
df['time'] = pd.to_datetime(df['time'])
but this creates Timestamp, and for some reason I cannot convert the Timestamps into datetime object.
time = df['time'].dt.to_pydatetime()
df['time'] = time
This works creating a separate array and assigning the resulting list but as soon as I incorporate it into the dataframe it is converted back into Timestamps automatically. It does not work even creating a new dataframe with dtype = 'object' and then adding the datetime list as before.
A way around would be that of converting each row individually but given the size of the dataframe it would take ages. Any suggestions?
EDIT: with
time = pd.DataFrame(dtype = 'datetime64')
time = pd.to_datetime(df['time'])
time = time.dt.to_pydatetime()
new = pd.DataFrame({'pair': df['pair'],'time': pd.Series(time, dtype='object'), 'open': df['open'], 'close': df['close']}, dtype ='object')
I am now able to receive a datetime object when calling new['time'][0], however
new['time'].resample('5T')
still raises the error: "Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'"
EDIT: Ok, so apparently I just had to set the timestamp as index of the dataframe and then resample applies without issues.
can you try:
import datetime as dt
df['time']=pd.to_datetime(df['time'], format="%y/%m/%d")
df['timeconvert'] = df['time'].dt.date
Ok, so apparently I just had to set the timestamp as index of the dataframe and then resample applies without issues. There is no need to bother with timestamp conversion or anything else, thanks anyway for the reply.

Pandas series of timestamps casting to string

I'm new to pandas and am curious if there is something wrong with my approach to splitting a date column into separate hour, minute, seconds, etc columns.
I have a column in my dataframe (df) that contains only dates ('dates'). I convert this to a series of timestamps using the below line of code:
dt = to_datetime(df['dates'], format='%m/%d/%Y %H:%M:%S.%f', errors='ignore')
When I iterate through dt using a for loop I get all of the dates parsed properly and can call out separate properties of the timestamp without an issue. For example:
for i in [d.year for d in dt]:
print(i) # Correctly prints 2017 for all dates
for i in [d for d in dt]:
print(type(i)) # Prints <class 'pandas._libs.tslib.Timestamp'>
But when I attempt to add a new column to my dataframe using the below line of code:
df['year'] = Series([d.year for d in dt], index=df.index)
I get an attribute error:
AttributeError: 'str' object has no attribute 'year'
Some example date strings are: '09/22/2017 19:23:12.993', '09/01/2017 02:47:11.593'.
Does anyone know why python appears to be interpreting d as a string in the final example but not in the first two?

Categories

Resources