I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only.
I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:
[dt.to_datetime().date() for dt in df.dates]
But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?
Since version 0.15.0 this can now be easily done using .dt to access just the date component:
df['just_date'] = df['dates'].dt.date
The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:
df['normalised_date'] = df['dates'].dt.normalize()
This keeps the dtype as datetime64, but the display shows just the date value.
pandas: .dt accessor
pandas.Series.dt
Simple Solution:
df['date_only'] = df['date_time_column'].dt.date
While I upvoted EdChum's answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized - that is, it will be slow).
A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not "keep only date part", since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:
printing to screen
saving to csv
using the column to groupby
... and it is much more efficient, since the operation is vectorized.
EDIT: in fact, the answer the OP's would have preferred is probably "recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations".
Pandas v0.13+: Use to_csv with date_format parameter
Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.
Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:
df.to_csv(filename, date_format='%Y-%m-%d')
See Python's strftime directives for formatting conventions.
This is a simple way to extract the date:
import pandas as pd
d='2015-01-08 22:44:09'
date=pd.to_datetime(d).date()
print(date)
Pandas DatetimeIndex and Series have a method called normalize that does exactly what you want.
You can read more about it in this answer.
It can be used as ser.dt.normalize()
Just giving a more up to date answer in case someone sees this old post.
Adding "utc=False" when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.
pd.to_datetime(df['Date'], utc=False)
You will be able to save it in excel without getting the error "ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel."
df['Column'] = df['Column'].dt.strftime('%m/%d/%Y')
This will give you just the dates and NO TIME at your desired format. You can change format according to your need '%m/%d/%Y' It will change the data type of the column to 'object'.
If you want just the dates and DO NOT want time in YYYY-MM-DD format use :
df['Column'] = pd.to_datetime(df['Column']).dt.date
The datatype will be 'object'.
For 'datetime64' datatype, use:
df['Column'] = pd.to_datetime(df['Column']).dt.normalize()
Converting to datetime64[D]:
df.dates.values.astype('M8[D]')
Though re-assigning that to a DataFrame col will revert it back to [ns].
If you wanted actual datetime.date:
dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])
I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work
df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))
On tables of >1000000 rows I've found that these are both fast, with floor just slightly faster:
df['mydate'] = df.index.floor('d')
or
df['mydate'] = df.index.normalize()
If your index has timezones and you don't want those in the result, do:
df['mydate'] = df.index.tz_localize(None).floor('d')
df.index.date is many times slower; to_datetime() is even worse. Both have the further disadvantage that the results cannot be saved to an hdf store as it does not support type datetime.date.
Note that I've used the index as the date source here; if your source is another column, you would need to add .dt, e.g. df.mycol.dt.floor('d')
This worked for me on UTC Timestamp (2020-08-19T09:12:57.945888)
for di, i in enumerate(df['YourColumnName']):
df['YourColumnName'][di] = pd.Timestamp(i)
If the column is not already in datetime format:
df['DTformat'] = pd.to_datetime(df['col'])
Once it's in datetime format you can convert the entire column to date only like this:
df['DateOnly'] = df['DTformat'].apply(lambda x: x.date())
Related
I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only.
I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:
[dt.to_datetime().date() for dt in df.dates]
But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?
Since version 0.15.0 this can now be easily done using .dt to access just the date component:
df['just_date'] = df['dates'].dt.date
The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:
df['normalised_date'] = df['dates'].dt.normalize()
This keeps the dtype as datetime64, but the display shows just the date value.
pandas: .dt accessor
pandas.Series.dt
Simple Solution:
df['date_only'] = df['date_time_column'].dt.date
While I upvoted EdChum's answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized - that is, it will be slow).
A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not "keep only date part", since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:
printing to screen
saving to csv
using the column to groupby
... and it is much more efficient, since the operation is vectorized.
EDIT: in fact, the answer the OP's would have preferred is probably "recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations".
Pandas v0.13+: Use to_csv with date_format parameter
Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.
Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:
df.to_csv(filename, date_format='%Y-%m-%d')
See Python's strftime directives for formatting conventions.
This is a simple way to extract the date:
import pandas as pd
d='2015-01-08 22:44:09'
date=pd.to_datetime(d).date()
print(date)
Pandas DatetimeIndex and Series have a method called normalize that does exactly what you want.
You can read more about it in this answer.
It can be used as ser.dt.normalize()
Just giving a more up to date answer in case someone sees this old post.
Adding "utc=False" when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.
pd.to_datetime(df['Date'], utc=False)
You will be able to save it in excel without getting the error "ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel."
df['Column'] = df['Column'].dt.strftime('%m/%d/%Y')
This will give you just the dates and NO TIME at your desired format. You can change format according to your need '%m/%d/%Y' It will change the data type of the column to 'object'.
If you want just the dates and DO NOT want time in YYYY-MM-DD format use :
df['Column'] = pd.to_datetime(df['Column']).dt.date
The datatype will be 'object'.
For 'datetime64' datatype, use:
df['Column'] = pd.to_datetime(df['Column']).dt.normalize()
Converting to datetime64[D]:
df.dates.values.astype('M8[D]')
Though re-assigning that to a DataFrame col will revert it back to [ns].
If you wanted actual datetime.date:
dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])
I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work
df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))
On tables of >1000000 rows I've found that these are both fast, with floor just slightly faster:
df['mydate'] = df.index.floor('d')
or
df['mydate'] = df.index.normalize()
If your index has timezones and you don't want those in the result, do:
df['mydate'] = df.index.tz_localize(None).floor('d')
df.index.date is many times slower; to_datetime() is even worse. Both have the further disadvantage that the results cannot be saved to an hdf store as it does not support type datetime.date.
Note that I've used the index as the date source here; if your source is another column, you would need to add .dt, e.g. df.mycol.dt.floor('d')
This worked for me on UTC Timestamp (2020-08-19T09:12:57.945888)
for di, i in enumerate(df['YourColumnName']):
df['YourColumnName'][di] = pd.Timestamp(i)
If the column is not already in datetime format:
df['DTformat'] = pd.to_datetime(df['col'])
Once it's in datetime format you can convert the entire column to date only like this:
df['DateOnly'] = df['DTformat'].apply(lambda x: x.date())
I have a column that may contain both date and datetime objects and I want to convert all of them to date.
If I do df['DT_REFERENCE'].dt.date it will throw an error on dates.
The best solution I could come up with is
df['DT_REFERENCE'].astype('datetime64[ns]').dt.date
but I was wondering if there is a better approach.
Use to_datetime:
pd.to_datetime(df['DT_REFERENCE']).dt.date
I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only.
I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:
[dt.to_datetime().date() for dt in df.dates]
But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?
Since version 0.15.0 this can now be easily done using .dt to access just the date component:
df['just_date'] = df['dates'].dt.date
The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:
df['normalised_date'] = df['dates'].dt.normalize()
This keeps the dtype as datetime64, but the display shows just the date value.
pandas: .dt accessor
pandas.Series.dt
Simple Solution:
df['date_only'] = df['date_time_column'].dt.date
While I upvoted EdChum's answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized - that is, it will be slow).
A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not "keep only date part", since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:
printing to screen
saving to csv
using the column to groupby
... and it is much more efficient, since the operation is vectorized.
EDIT: in fact, the answer the OP's would have preferred is probably "recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations".
Pandas v0.13+: Use to_csv with date_format parameter
Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.
Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:
df.to_csv(filename, date_format='%Y-%m-%d')
See Python's strftime directives for formatting conventions.
This is a simple way to extract the date:
import pandas as pd
d='2015-01-08 22:44:09'
date=pd.to_datetime(d).date()
print(date)
Pandas DatetimeIndex and Series have a method called normalize that does exactly what you want.
You can read more about it in this answer.
It can be used as ser.dt.normalize()
Just giving a more up to date answer in case someone sees this old post.
Adding "utc=False" when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.
pd.to_datetime(df['Date'], utc=False)
You will be able to save it in excel without getting the error "ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel."
df['Column'] = df['Column'].dt.strftime('%m/%d/%Y')
This will give you just the dates and NO TIME at your desired format. You can change format according to your need '%m/%d/%Y' It will change the data type of the column to 'object'.
If you want just the dates and DO NOT want time in YYYY-MM-DD format use :
df['Column'] = pd.to_datetime(df['Column']).dt.date
The datatype will be 'object'.
For 'datetime64' datatype, use:
df['Column'] = pd.to_datetime(df['Column']).dt.normalize()
Converting to datetime64[D]:
df.dates.values.astype('M8[D]')
Though re-assigning that to a DataFrame col will revert it back to [ns].
If you wanted actual datetime.date:
dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])
I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work
df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))
On tables of >1000000 rows I've found that these are both fast, with floor just slightly faster:
df['mydate'] = df.index.floor('d')
or
df['mydate'] = df.index.normalize()
If your index has timezones and you don't want those in the result, do:
df['mydate'] = df.index.tz_localize(None).floor('d')
df.index.date is many times slower; to_datetime() is even worse. Both have the further disadvantage that the results cannot be saved to an hdf store as it does not support type datetime.date.
Note that I've used the index as the date source here; if your source is another column, you would need to add .dt, e.g. df.mycol.dt.floor('d')
This worked for me on UTC Timestamp (2020-08-19T09:12:57.945888)
for di, i in enumerate(df['YourColumnName']):
df['YourColumnName'][di] = pd.Timestamp(i)
If the column is not already in datetime format:
df['DTformat'] = pd.to_datetime(df['col'])
Once it's in datetime format you can convert the entire column to date only like this:
df['DateOnly'] = df['DTformat'].apply(lambda x: x.date())
One thing that I'm finding hard with the pandas/numpy combo is dealing with dates. My dataframe time series indices are often DateTimeIndexes containing Timestamps but sometimes seem to be something else (e.g. datetime.Date or numpy.datetime64).
Is there a generic way to check if a particular object is a date, i.e. any of the known date variable types? Or is that a function I should look to create myself?
Thanks!
I use this function to convert a series to a consistent datetime object in pandas / numpy. It works with both scalars and series.
import pandas as pd
x = '2018-12-11'
pd.to_datetime(x) # Timestamp('2018-12-11 00:00:00')
if isinstance(yourVariable,datetime.datetime):
print("it's a date")
I would try converting the string representation of what I suspect to be a datetime into a datetime object, using the parse function from dateutil.parser.
https://chrisalbon.com/python/basics/strings_to_datetime/
I am using pandas to import data dfST = read_csv( ... , parse_dates={'timestamp':[date]})
In my csv, date is in the format YYY/MM/DD, which is all I need - there is no time. I have several data sets that I need to compare for membership. When I convert theses 'timestamp' to a string, sometimes I get something like this:
'1977-07-31T00:00:00.000000000Z'
which I understand is a datetime including milliseconds and a timezone. Is there any way to suppress the addition of the extraneous time on import? If not, I need to exclude it somehow.
dfST.timestamp[1]
Out[138]: Timestamp('1977-07-31 00:00:00')
I have tried formatting it, which seemed to work until I called the formatted values:
dfSTdate=pd.to_datetime(dfST.timestamp, format="%Y-%m-%d")
dfSTdate.head()
Out[123]:
0 1977-07-31
1 1977-07-31
Name: timestamp, dtype: datetime64[ns]
But no... when I test the value of this I also get the time:
dfSTdate[1]
Out[124]: Timestamp('1977-07-31 00:00:00')
When I convert this to an array, the time is included with the millisecond and the timezone, which really messes my comparisons up.
test97=np.array(dfSTdate)
test97[1]
Out[136]: numpy.datetime64('1977-07-30T20:00:00.000000000-0400')
How can I get rid of the time?!?
Ultimately I wish to compare membership among data sets using numpy.in1d with date as a string ('YYYY-MM-DD') as one part of the comparison
This is due to the way datetime values are stored in pandas: using the numpy datetime64[ns] dtype. So datetime values are always stored at nanosecond resolution. Even if you only have a date, this will be converted to a timestamp with a zero time of nanosecond resolution. This is just due to the implementation in pandas.
The issues you have with printing the values and having unexpected results, is just because how these objects are printed in the python console (their representation), not their actual value.
If you print a single values, you get a the Timestamp representation of pandas:
Timestamp('1977-07-31 00:00:00')
So you get the seconds here as well, just because this is the default representation.
If you convert it to an array, and then print it, you get the standard numpy representation:
numpy.datetime64('1977-07-30T20:00:00.000000000-0400')
This is indeed a very misleading representation. Because numpy will, just for printing it in the console, convert it to your local timezone. But this doesn't change your actual value, it's just weird printing.
That is the background, now to answer your question, how do I get rid of the time?
That depends on your goal. Do you really want to convert it to a string? Or do you just don't like the repr?
if you just want to work with the datetime values, you don't need to get rid of it.
if you want to convert it to strings, you can apply strfitme (df['timestamp'].apply(lambda x: x.strftime('%Y-%m-%d'))). Or if it is to write it as strings to csv, use the date_format keyword in to_csv
if you really want a 'date', you can use the datetime.date type (standard python type) in a DataFrame column. You can convert your existing column to this with with: pd.DatetimeIndex(dfST['timestamp']).date. But personally I don't think this has many advantages.