DataFrame does not allow Timestamps conversion for resampling - python

I have a 12 millions entries csv file that I imported as dataframe with pandas that looks like this.
pair time open close
0 AUD/JPY 20170102 00:00:08.238 83.774002 84.626999
1 AUD/JPY 20170102 00:00:08.352 83.774002 84.626999
2 AUD/JPY 20170102 00:00:13.662 84.184998 84.324997
3 AUD/JPY 20170102 00:00:13.783 84.184998 84.324997
The time column is a string but I need a datetime object in order to downsample the dataframe and get OHLC values. The df.resample function requires datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex). I tried
df['time'] = pd.to_datetime(df['time'])
but this creates Timestamp, and for some reason I cannot convert the Timestamps into datetime object.
time = df['time'].dt.to_pydatetime()
df['time'] = time
This works creating a separate array and assigning the resulting list but as soon as I incorporate it into the dataframe it is converted back into Timestamps automatically. It does not work even creating a new dataframe with dtype = 'object' and then adding the datetime list as before.
A way around would be that of converting each row individually but given the size of the dataframe it would take ages. Any suggestions?
EDIT: with
time = pd.DataFrame(dtype = 'datetime64')
time = pd.to_datetime(df['time'])
time = time.dt.to_pydatetime()
new = pd.DataFrame({'pair': df['pair'],'time': pd.Series(time, dtype='object'), 'open': df['open'], 'close': df['close']}, dtype ='object')
I am now able to receive a datetime object when calling new['time'][0], however
new['time'].resample('5T')
still raises the error: "Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'"
EDIT: Ok, so apparently I just had to set the timestamp as index of the dataframe and then resample applies without issues.

can you try:
import datetime as dt
df['time']=pd.to_datetime(df['time'], format="%y/%m/%d")
df['timeconvert'] = df['time'].dt.date

Ok, so apparently I just had to set the timestamp as index of the dataframe and then resample applies without issues. There is no need to bother with timestamp conversion or anything else, thanks anyway for the reply.

Related

Convert object-type hours:minutes:seconds column to datetime type in Pandas

I have a column called Time in a dataframe that looks like this:
599359 12:32:25
326816 17:55:22
326815 17:55:22
358789 12:48:25
361553 12:06:45
...
814512 21:22:07
268266 18:57:31
659699 14:28:20
659698 14:28:20
268179 17:48:53
Name: Time, Length: 546967, dtype: object
And right now it is an object dtype. I've tried the following to convert it to a datetime:
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time
And I understand that the .dt.time methods are needed to prevent the Year and Month from being added, but I believe this is causing the dtype to revert to an object.
Any workarounds? I know I could do
df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True)
but I have over 500,000 rows and this is taking forever.
When you do this bit: df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time, you're converting the 'Time' column to have pd.dtype as object... and that "object" is the python type datetime.time.
The pandas dtype pd.datetime is a different type than python's datetime.datetime objects. And pandas' pd.datetime does not support time objects (i.e. you can't have pandas consider the column a datetime without providing the year). This is the dtype is changing to object.
In the case of your second approach, df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True) there is something slightly different happening. In this case you're applying the pd.to_datetime to each scalar element of the 'Time' series. Take a look at the return types of the function in the docs, but basically in this case the time values in your df are being converted to pd.datetime objects on the 1st of january 1900. (i.e. a default date is added).
So: pandas is behaving correctly. If you only want the times, then it's okay to use the datetime.time objects in the column. But to operate on them you'll probably be relying on many [slow] df.apply methods. Alternatively, just keep the default date of 1900-01-01 and then you can add/subtract the pd.datetime columns and get the speed advantage of pandas. Then just strip off the date when you're done with it.

convert a pandas column of string dates to compare with datetime.date

I have a column of string values in pandas as follows:
2022-07-01 00:00:00+00:00
I want to compare it to a couple of dates as follows:
month_start_date = datetime.date(start_year, start_month, 1)
month_end_date = datetime.date(start_year, start_month, calendar.monthrange(start_year, start_month)[1])
df = df[(df[date] >= month_start_date) and (df[date] <= month_end_date)]
How do i convert the string value to datetime.date?
I have tried to use pd.to_datetime(df['date']), says cant compare datetime to date
Tried to use pd.to_datetime(df['date']).dt.date says dt can only be used for datetime l like variables, did you mean at
Also tired to normalize it, but that bring more errors with timezone, and active and naive timezone
Also tried .astype('datetime64[ns]')
None of it is working
UPDATE
Turns out none of the above are working because half the data is in this format: 2022-07-01 00:00:00+00:00
And the rest is in this format: 2022-07-01
Here is how i am getting around this issue:
for index, row in df_uscis.iterrows():
df_uscis.loc[index, 'date'] = datetime.datetime.strptime(row['date'].split(' ')[0], "%Y-%m-%d").date()
Is there a simpler and faster way of doing this? I tried to make a new column with the date values only, but not sure how to do that
From your update, if you only need to turn the values from string to date objects, you can try:
df['date'] = pd.to_datetime(df['date'].str.split(' ').str[0])
df['date'] = df['date'].dt.date
Also, try to avoid using iterrows, as it is really slow and usually there's a better way to achieve what you're trying to acomplish, but if you really need to iterate through a DataFrame, try using the df.itertuples() method.

Separating Date and Time in Pandas

I have a data file with timestamps that look like this:
It gets loaded into pandas with a column name of "Time". I am trying to create two new datetime64 type columns, one with the date and one with the time (hour). I have explored a few solutions to this problem on StackOverflow but am still having issues. Quick note, I need the final columns to not be objects so I can use pandas and numpy functionality.
I load the dataframe and create two new columns like so:
df = pd.read_csv('C:\\Users\\...\\xyz.csv')
df['Date'] = pd.to_datetime(df['Time']).dt.date
df['Hour'] = pd.to_datetime(df['Time']).dt.time
This works but the Date and Hour columns are now objects.
I run this to convert the date to my desired datetime64 data type and it works:
df['Date'] = pd.to_datetime(df['Date'])
However, when I try to use this same code on the Hour column, I get an error:
TypeError: <class 'datetime.time'> is not convertible to datetime
I did some digging and found the following code which runs:
df['Hour'] = pd.to_datetime(df['Hour'], format='%H:%M:%S')
However the actual output includes a generic date like so:
When I try to run code referencing the Hour column like so:
HourVarb = '15:00:00'
df['Test'] = np.where(df['Hour']==HourVarb,1,np.nan)
It runs but doesn't produce the result I want.
Perhaps my HourVarb variable is the wrong format for the numpy code? Alternatively, the 1/1/1900 is causing problems and the format %H: %M: %S needs to change? My end goal is to be able to reference the hour and the date to filter out specific date/hour combinations. Please help.
One note, when I change the HourVarb to '1/1/1900 15:00:00' the code above works as intended, but I'd still like to understand if there is a cleaner way that removes the date. Thanks
I'm not sure I understand the problem with the 'object' datatypes of these columns.
I loaded the data you provided this way:
df = pd.read_csv('xyz.csv')
df['Time'] = pd.to_datetime(df['Time'])
df['Date'] = df['Time'].dt.date
df['Hour'] = df['Time'].dt.time
print(df.dtypes)
And I get these data types:
Time datetime64[ns]
Date object
Hour object
The fact that Date and Hour are object types should not be a problem. The underlying data is a datetime type:
print(type(df.Date.iloc[0]))
print(type(df.Hour.iloc[0]))
<class 'datetime.date'>
<class 'datetime.time'>
This means you can use these columns as such. For example:
print(df['Date'] + pd.Timedelta('1D'))
What are you trying to do that is requiring the column dtype to be a Pandas dtype?
UPDATE
Here is how you achieve the last part of your question:
from datetime import datetime, time
hourVarb = datetime.strptime("15:00:00", '%H:%M:%S').time()
# or hourVarb = time(15, 0)
df['Test'] = df['Hour'] == hourVarb
print(df['Test'])
0 True
1 False
2 False
3 False
Name: Test, dtype: bool

Changing type of Data Frame cells without loop

Sometimes, data is not in the format we wish it to be. Python offers ways to deal with this (such as int() and str()), but solutions for Data Frames are not trivial.
For instance, let us generate a Data Frame of 5 datetime observations:
import pandas
from datetime import datetime
datelist = pd.date_range(datetime.today(), periods=5).tolist()
df = pd.DataFrame (datelist, columns = ['A'])
Our goal is to convert this datetime data into the date format.
First, we may try
df['A'] = df['A'].datetime.date()
For which we will get an Attribute Error: 'Series' object has not attribute 'date'. An option, then, would be to create a loop to change each cell at a time, but, according to Pandas documentation, we should never change something we are iterating over (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html). How could we, then, solve this problem?
You can use .dt accessor for datetimelike properties of the Series values:
df.A.dt.date
# 0 2021-02-23
# 1 2021-02-24
# 2 2021-02-25
# 3 2021-02-26
# 4 2021-02-27
# Name: A, dtype: object

Converting Pandas Timestamp to just the time (looking for something faster than .apply)

So if I have a timestamp in pandas as such:
Timestamp('2014-11-07 00:05:00')
How can I create a new column that just has the 'time' component?
So I want
00:05:00
Currently, I'm using .apply as shown below, but this is slow (my dataframe is a couple million rows), and i'm looking for a faster way.
df['time'] = df['date_time'].apply(lambda x: x.time())
Instead of .apply, I tried using .astype(time), as I noticed .astype operations can be faster than .apply, but that apparently doesn't work on timestamps (AttributeError: 'Timestamp' object has no attribute 'astype')... any ideas?
You want .dt.time see the docs for some more examples of things under the .dt accessor.
df['date_time'].dt.time
There are two df1 and df2, each having date and time columns respectively.
Following code snippets useful to convert data type and comparing.
type(df1['date'].iloc[0]), type(df2['time'].iloc[0])
>>> (datetime.date, pandas._libs.tslibs.timestamps.Timestamp)
type(df1['date'].iloc[0]), type(df2['time'].iloc[0].date())
>>> (datetime.date, datetime.date)
df1['date'].iloc[0] == df2['time'].iloc[0].date()
>>> False

Categories

Resources