Converting numpy64 objects to Pandas datetime - python

Question is pretty self-explanatory. I am finding that pd.to_datetime isn't changing anything about the object type and using pd.Timestampe()directly is bombing out.
Before this is marked a duplicate of Converting between datetime, Timestamp and datetime64, I am struggling at changing an entire column of a dataframe not just one datetime object. Perhaps that was in the article but I didn't see it in the top answer.
I will add that my error is occurring when I try to get unique values from the dataframes column. Is using unique converting the dtype to something unwanted?

The method you mentioned pandas.to_datetime() will work on scalars, Series and whole DataFrame if you need, so:
dataFrame['column_date_converted'] = pd.to_datetime(dataFrame['column_to_convert'])

Related

How to remove the time from the datetime in Python? [duplicate]

I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only.
I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:
[dt.to_datetime().date() for dt in df.dates]
But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?
Since version 0.15.0 this can now be easily done using .dt to access just the date component:
df['just_date'] = df['dates'].dt.date
The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:
df['normalised_date'] = df['dates'].dt.normalize()
This keeps the dtype as datetime64, but the display shows just the date value.
pandas: .dt accessor
pandas.Series.dt
Simple Solution:
df['date_only'] = df['date_time_column'].dt.date
While I upvoted EdChum's answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized - that is, it will be slow).
A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not "keep only date part", since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:
printing to screen
saving to csv
using the column to groupby
... and it is much more efficient, since the operation is vectorized.
EDIT: in fact, the answer the OP's would have preferred is probably "recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations".
Pandas v0.13+: Use to_csv with date_format parameter
Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.
Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:
df.to_csv(filename, date_format='%Y-%m-%d')
See Python's strftime directives for formatting conventions.
This is a simple way to extract the date:
import pandas as pd
d='2015-01-08 22:44:09'
date=pd.to_datetime(d).date()
print(date)
Pandas DatetimeIndex and Series have a method called normalize that does exactly what you want.
You can read more about it in this answer.
It can be used as ser.dt.normalize()
Just giving a more up to date answer in case someone sees this old post.
Adding "utc=False" when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.
pd.to_datetime(df['Date'], utc=False)
You will be able to save it in excel without getting the error "ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel."
df['Column'] = df['Column'].dt.strftime('%m/%d/%Y')
This will give you just the dates and NO TIME at your desired format. You can change format according to your need '%m/%d/%Y' It will change the data type of the column to 'object'.
If you want just the dates and DO NOT want time in YYYY-MM-DD format use :
df['Column'] = pd.to_datetime(df['Column']).dt.date
The datatype will be 'object'.
For 'datetime64' datatype, use:
df['Column'] = pd.to_datetime(df['Column']).dt.normalize()
Converting to datetime64[D]:
df.dates.values.astype('M8[D]')
Though re-assigning that to a DataFrame col will revert it back to [ns].
If you wanted actual datetime.date:
dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])
I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work
df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))
On tables of >1000000 rows I've found that these are both fast, with floor just slightly faster:
df['mydate'] = df.index.floor('d')
or
df['mydate'] = df.index.normalize()
If your index has timezones and you don't want those in the result, do:
df['mydate'] = df.index.tz_localize(None).floor('d')
df.index.date is many times slower; to_datetime() is even worse. Both have the further disadvantage that the results cannot be saved to an hdf store as it does not support type datetime.date.
Note that I've used the index as the date source here; if your source is another column, you would need to add .dt, e.g. df.mycol.dt.floor('d')
This worked for me on UTC Timestamp (2020-08-19T09:12:57.945888)
for di, i in enumerate(df['YourColumnName']):
df['YourColumnName'][di] = pd.Timestamp(i)
If the column is not already in datetime format:
df['DTformat'] = pd.to_datetime(df['col'])
Once it's in datetime format you can convert the entire column to date only like this:
df['DateOnly'] = df['DTformat'].apply(lambda x: x.date())

Handling timestamps with timezones in Pandas and Rpy2

I'm trying to understand how to add a row that contains a timestamp to a Pandas dataframe that has a column with a data type of datetime64[ns, UTC]. Unfortunately, when I add a row, the column datatype changes to object, which ends up breaking conversion to a R data frame via Rpy2.
Here are the interesting lines of code where I'm seeing the problem, with debug printing statements around it whose output I'll share as well. The variable observation is a simple python list whose first value is a timestamp. Code:
print('A: df.dtypes[0] = {}'.format(str(df.dtypes[0])))
print('observation[0].type = {}, observation[0].tzname() = {}'.format(str(type(observation[0])), observation[0].tzname()))
df.loc[len(df)] = observation
print('B: df.dtypes[0] = {}'.format(str(df.dtypes[0])))
Here is the output of the above code snippet:
A: df.dtypes[0] = datetime64[ns, UTC]
observation[0].type = <class 'datetime.datetime'>, observation[0].tzname() = UTC
B: df.dtypes[0] = object
What I'm observing is that the datatype of the column is being changed when I append the row. As far as I can tell, Pandas is adding the timestamp as an instance of . The rpy2 pandas2ri module seems to be unable to convert values of that class.
I've so far been unable to find an approach that lets me append a row to the data frame and preserve the column type for the timestamp column. Suggestions would be welcome.
==========================
Update
I've been able to work around the problem in a hacky way. I create a one-row temporary dataframe from the list of values, then set the types on the columns for this one-row dataframe. Then I append the row from this temporary dataframe to the one I'm working on. This is the only approach I was able to identify that preserves the column type of the dataframe I'm appending to. It's almost enough to make me pine for a strongly typed language.
I'd prefer a more elegant solution, so I'm leaving this open in case anyone can suggest one.
Check this post for an answer, especially the answer by Wes McKinney:
Converting between datetime, Timestamp and datetime64

Pandas timestamps in ISO format cause Exasol error when importing

When using pyexasol's import_from_pandas(df) for a DataFrame, df, which has a datetime column, Exasol (6.2) throws an error because it can't parse the ISO-formatted string representation of the dataframe column. Specifically, the "+00:00" final characters are unparsable by Exasol. My current workaround is to turn all pandas datetime columns into string columns, but that can cost a lot of time.
What's the right way to import datetime columns from Pandas dataframes into an existing Exasol table with a TIMESTAMP column type?
PyEXASOL creator is here.
You may use import_params dictionary argument to pass additional parameters to pandas.to_csv() method which is used internally. One of such parameters is date_format. Just pass the right format compatible with Exasol.
I'll consider adding this parameter by default.
Hope it helps!

Pandas: convert column to datetime format [duplicate]

I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only.
I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:
[dt.to_datetime().date() for dt in df.dates]
But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?
Since version 0.15.0 this can now be easily done using .dt to access just the date component:
df['just_date'] = df['dates'].dt.date
The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:
df['normalised_date'] = df['dates'].dt.normalize()
This keeps the dtype as datetime64, but the display shows just the date value.
pandas: .dt accessor
pandas.Series.dt
Simple Solution:
df['date_only'] = df['date_time_column'].dt.date
While I upvoted EdChum's answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized - that is, it will be slow).
A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not "keep only date part", since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:
printing to screen
saving to csv
using the column to groupby
... and it is much more efficient, since the operation is vectorized.
EDIT: in fact, the answer the OP's would have preferred is probably "recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations".
Pandas v0.13+: Use to_csv with date_format parameter
Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.
Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:
df.to_csv(filename, date_format='%Y-%m-%d')
See Python's strftime directives for formatting conventions.
This is a simple way to extract the date:
import pandas as pd
d='2015-01-08 22:44:09'
date=pd.to_datetime(d).date()
print(date)
Pandas DatetimeIndex and Series have a method called normalize that does exactly what you want.
You can read more about it in this answer.
It can be used as ser.dt.normalize()
Just giving a more up to date answer in case someone sees this old post.
Adding "utc=False" when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.
pd.to_datetime(df['Date'], utc=False)
You will be able to save it in excel without getting the error "ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel."
df['Column'] = df['Column'].dt.strftime('%m/%d/%Y')
This will give you just the dates and NO TIME at your desired format. You can change format according to your need '%m/%d/%Y' It will change the data type of the column to 'object'.
If you want just the dates and DO NOT want time in YYYY-MM-DD format use :
df['Column'] = pd.to_datetime(df['Column']).dt.date
The datatype will be 'object'.
For 'datetime64' datatype, use:
df['Column'] = pd.to_datetime(df['Column']).dt.normalize()
Converting to datetime64[D]:
df.dates.values.astype('M8[D]')
Though re-assigning that to a DataFrame col will revert it back to [ns].
If you wanted actual datetime.date:
dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])
I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work
df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))
On tables of >1000000 rows I've found that these are both fast, with floor just slightly faster:
df['mydate'] = df.index.floor('d')
or
df['mydate'] = df.index.normalize()
If your index has timezones and you don't want those in the result, do:
df['mydate'] = df.index.tz_localize(None).floor('d')
df.index.date is many times slower; to_datetime() is even worse. Both have the further disadvantage that the results cannot be saved to an hdf store as it does not support type datetime.date.
Note that I've used the index as the date source here; if your source is another column, you would need to add .dt, e.g. df.mycol.dt.floor('d')
This worked for me on UTC Timestamp (2020-08-19T09:12:57.945888)
for di, i in enumerate(df['YourColumnName']):
df['YourColumnName'][di] = pd.Timestamp(i)
If the column is not already in datetime format:
df['DTformat'] = pd.to_datetime(df['col'])
Once it's in datetime format you can convert the entire column to date only like this:
df['DateOnly'] = df['DTformat'].apply(lambda x: x.date())

Keep only date part when using pandas.to_datetime

I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only.
I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:
[dt.to_datetime().date() for dt in df.dates]
But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?
Since version 0.15.0 this can now be easily done using .dt to access just the date component:
df['just_date'] = df['dates'].dt.date
The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:
df['normalised_date'] = df['dates'].dt.normalize()
This keeps the dtype as datetime64, but the display shows just the date value.
pandas: .dt accessor
pandas.Series.dt
Simple Solution:
df['date_only'] = df['date_time_column'].dt.date
While I upvoted EdChum's answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized - that is, it will be slow).
A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not "keep only date part", since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:
printing to screen
saving to csv
using the column to groupby
... and it is much more efficient, since the operation is vectorized.
EDIT: in fact, the answer the OP's would have preferred is probably "recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations".
Pandas v0.13+: Use to_csv with date_format parameter
Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.
Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:
df.to_csv(filename, date_format='%Y-%m-%d')
See Python's strftime directives for formatting conventions.
This is a simple way to extract the date:
import pandas as pd
d='2015-01-08 22:44:09'
date=pd.to_datetime(d).date()
print(date)
Pandas DatetimeIndex and Series have a method called normalize that does exactly what you want.
You can read more about it in this answer.
It can be used as ser.dt.normalize()
Just giving a more up to date answer in case someone sees this old post.
Adding "utc=False" when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.
pd.to_datetime(df['Date'], utc=False)
You will be able to save it in excel without getting the error "ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel."
df['Column'] = df['Column'].dt.strftime('%m/%d/%Y')
This will give you just the dates and NO TIME at your desired format. You can change format according to your need '%m/%d/%Y' It will change the data type of the column to 'object'.
If you want just the dates and DO NOT want time in YYYY-MM-DD format use :
df['Column'] = pd.to_datetime(df['Column']).dt.date
The datatype will be 'object'.
For 'datetime64' datatype, use:
df['Column'] = pd.to_datetime(df['Column']).dt.normalize()
Converting to datetime64[D]:
df.dates.values.astype('M8[D]')
Though re-assigning that to a DataFrame col will revert it back to [ns].
If you wanted actual datetime.date:
dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])
I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work
df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))
On tables of >1000000 rows I've found that these are both fast, with floor just slightly faster:
df['mydate'] = df.index.floor('d')
or
df['mydate'] = df.index.normalize()
If your index has timezones and you don't want those in the result, do:
df['mydate'] = df.index.tz_localize(None).floor('d')
df.index.date is many times slower; to_datetime() is even worse. Both have the further disadvantage that the results cannot be saved to an hdf store as it does not support type datetime.date.
Note that I've used the index as the date source here; if your source is another column, you would need to add .dt, e.g. df.mycol.dt.floor('d')
This worked for me on UTC Timestamp (2020-08-19T09:12:57.945888)
for di, i in enumerate(df['YourColumnName']):
df['YourColumnName'][di] = pd.Timestamp(i)
If the column is not already in datetime format:
df['DTformat'] = pd.to_datetime(df['col'])
Once it's in datetime format you can convert the entire column to date only like this:
df['DateOnly'] = df['DTformat'].apply(lambda x: x.date())

Categories

Resources