I have about 800,000 rows of data in a dataframe, and one column of the data df['Date'] is string of time and date 'YYYY-MM-DD HH:MM:SS.fff', which doesn't have timezone information. However I know they are in New_York timezone and they need to be convert into CET. Now I have two methods to get the job done:
method 1 (very slow for sure):
df['Date'].apply(lambda x: timezone('America/New_York')\
.localize(datetime.datetime.strptime(x,'%Y%m%d%H:%M:%S.%f'))\
.astimezone(timezone('CET')))
method 2 :
df.index = pd.to_datetime(df['Date'],format='%Y%m%d%H:%M:%S.%f')
df.index.tz_localize('America/New_York').tz_convert('CET')
I am just wondering if there are any other better ways to do it? or any potential pitfalls of the methods I listed? Thanks!
Also, I would like to shift all timestamp by a fix amount of time, such as 1ms timedelta(0,0,1000), how can I implement it using method 2?
Method 2 is definately the best way of doing this.
However, it occurs to me that you are formatting this date after you have loaded the data.
It is much faster to parse dates on load of a file, than it is to change them after you have loaded it. (Not to mention cleaner)
If your data is loaded from a csv file using the pandas.read_csv() function for instance, then you can use the parse_dates= option and the date_parser= option.
You can try it out directly with your lambda function as the date_parser=
and just set the parse_dates= to a list of your date columns.
Like this:
pd.read_csv('myfile.csv', parse_dates=['Date'] date_parser=lambda x: timezone('America/New_York')\
.localize(datetime.datetime.strptime(x,'%Y%m%d%H:%M:%S.%f'))\
.astimezone(timezone('CET')))
Should work and will probably be the fastest.
Related
I am using jupyter notebook and reading a .csv file with pandas read_csv. When using the following code, it takes a reeeeeeeally long time (more than 10 minutes). The dataset has 70821 entries.
[Input]
df = pd.read_csv("file.csv", parse_dates=[0])
df.head()
[Output]
created_at field1 field2
2022-09-16T22:53:19+07:00 100.0 NaN
2022-09-16T22:54:46+07:00 100.0 NaN
2022-09-16T22:56:14+07:00 100.0 NaN
2022-09-16T23:02:01+07:00 100.0 NaN
2022-09-16T23:03:28+07:00 100.0 NaN
If I just use parse_dates=True it will not detect the date and keep it as object.
If I read the dataset without parse_dates it goes much faster like I would expect it (~1 second).
When I then use a seperate line of code to parse the date like
df["created_at"]=pd.to_datetime(df['created_at'])
it goes faster than using parse_dates in read_csv but still takes couple of minutes (around 3 minutes).
Using the following
df["created_at"]=pd.to_datetime(df['created_at'], format="%Y-%m-%d %H:%M:%S%z")
or with the T in the format string
df["created_at"]=pd.to_datetime(df['created_at'], format="%Y-%m-%dT%H:%M:%S%z")
or
df["created_at"]=pd.to_datetime(df['created_at'], infer_datetime_format=True)
does not increase the speed (still around 3 minutes)
So my questions are the following:
Why is the slowest way to parse the date directly with read_csv?
Why is the way with pd.to_datetime faster and
Why does something like format="%Y-%m-%d %H:%M:%S%z" or infer_datetime_format=True not speed up the process
And finally how do I do it in a better way? If I read the file once and parsed it to datetime, what would be the best way to write it back to a csv file so I don't have to go through this process all over again? I assume I have to write a function and manually change every entry to a better formatted date?
Can someone help me figuring out why those different approaches take such a different amount of time and how I speed up e.g. with something I tried in 3.?
Thanks a lot.
EDIT:
I tried now to manually adjust the date format and see, where it causes trouble. Turns out, when I delete +07:00 in the date string, it is fast (~500 ms).
Under this following link I uploaded to csv files. example1 is the file with the problematic datetime format. In example2_no_timezone I deleted in every entry the +07:00 part which makes the parsing fast again (expected behaviour).
Folder with two example datasets
The questions above do remain sadly.
Why is pandas not able to read the original date string in an appropriate time
Why is to_datetime faster (but still too slow with the original dataset
How do I fix this without changing the format in the original dataset (e.g., with means of to_datetime and providing format=...)
I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only.
I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:
[dt.to_datetime().date() for dt in df.dates]
But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?
Since version 0.15.0 this can now be easily done using .dt to access just the date component:
df['just_date'] = df['dates'].dt.date
The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:
df['normalised_date'] = df['dates'].dt.normalize()
This keeps the dtype as datetime64, but the display shows just the date value.
pandas: .dt accessor
pandas.Series.dt
Simple Solution:
df['date_only'] = df['date_time_column'].dt.date
While I upvoted EdChum's answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized - that is, it will be slow).
A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not "keep only date part", since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:
printing to screen
saving to csv
using the column to groupby
... and it is much more efficient, since the operation is vectorized.
EDIT: in fact, the answer the OP's would have preferred is probably "recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations".
Pandas v0.13+: Use to_csv with date_format parameter
Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.
Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:
df.to_csv(filename, date_format='%Y-%m-%d')
See Python's strftime directives for formatting conventions.
This is a simple way to extract the date:
import pandas as pd
d='2015-01-08 22:44:09'
date=pd.to_datetime(d).date()
print(date)
Pandas DatetimeIndex and Series have a method called normalize that does exactly what you want.
You can read more about it in this answer.
It can be used as ser.dt.normalize()
Just giving a more up to date answer in case someone sees this old post.
Adding "utc=False" when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.
pd.to_datetime(df['Date'], utc=False)
You will be able to save it in excel without getting the error "ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel."
df['Column'] = df['Column'].dt.strftime('%m/%d/%Y')
This will give you just the dates and NO TIME at your desired format. You can change format according to your need '%m/%d/%Y' It will change the data type of the column to 'object'.
If you want just the dates and DO NOT want time in YYYY-MM-DD format use :
df['Column'] = pd.to_datetime(df['Column']).dt.date
The datatype will be 'object'.
For 'datetime64' datatype, use:
df['Column'] = pd.to_datetime(df['Column']).dt.normalize()
Converting to datetime64[D]:
df.dates.values.astype('M8[D]')
Though re-assigning that to a DataFrame col will revert it back to [ns].
If you wanted actual datetime.date:
dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])
I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work
df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))
On tables of >1000000 rows I've found that these are both fast, with floor just slightly faster:
df['mydate'] = df.index.floor('d')
or
df['mydate'] = df.index.normalize()
If your index has timezones and you don't want those in the result, do:
df['mydate'] = df.index.tz_localize(None).floor('d')
df.index.date is many times slower; to_datetime() is even worse. Both have the further disadvantage that the results cannot be saved to an hdf store as it does not support type datetime.date.
Note that I've used the index as the date source here; if your source is another column, you would need to add .dt, e.g. df.mycol.dt.floor('d')
This worked for me on UTC Timestamp (2020-08-19T09:12:57.945888)
for di, i in enumerate(df['YourColumnName']):
df['YourColumnName'][di] = pd.Timestamp(i)
If the column is not already in datetime format:
df['DTformat'] = pd.to_datetime(df['col'])
Once it's in datetime format you can convert the entire column to date only like this:
df['DateOnly'] = df['DTformat'].apply(lambda x: x.date())
When pandas loads data, from a csv, for example, it runs infer_objects() (or something that does the same thing). infer_objects tries to determine an appropriate dtype for each column loaded. It sometimes (always?) does not infer datetime columns.
In order to do dynamic down-stream analysis I need the dtypes to be assigned automatically. I want to convert object columns to datetimes dynamically. I really like using infer_datetime_format=True within pd.to_datetime(). This is great because I don't always know the format that the date will arrive in. But this won't work long-term because it's too effective.
It won't throw any errors on int or float columns (because those technically could be dates). Alternatively if I only try to convert columns with dtype of 'object' I can get it to error noisely with errors = 'raise'. But for many datetime columns I deal with I prefer to coerce format errors (errors = 'coerce') rather than prevent the column from being converted.
So has anyone found a good way to best detect columns that are realistically dates?
You can try passing a customized date_parser:
pd.read_csv('file.csv', parse_dates=[0,1,2,3],
date_parser=lambda x: pd.to_datetime(x, errors='coerce'))
I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only.
I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:
[dt.to_datetime().date() for dt in df.dates]
But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?
Since version 0.15.0 this can now be easily done using .dt to access just the date component:
df['just_date'] = df['dates'].dt.date
The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:
df['normalised_date'] = df['dates'].dt.normalize()
This keeps the dtype as datetime64, but the display shows just the date value.
pandas: .dt accessor
pandas.Series.dt
Simple Solution:
df['date_only'] = df['date_time_column'].dt.date
While I upvoted EdChum's answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized - that is, it will be slow).
A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not "keep only date part", since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:
printing to screen
saving to csv
using the column to groupby
... and it is much more efficient, since the operation is vectorized.
EDIT: in fact, the answer the OP's would have preferred is probably "recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations".
Pandas v0.13+: Use to_csv with date_format parameter
Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.
Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:
df.to_csv(filename, date_format='%Y-%m-%d')
See Python's strftime directives for formatting conventions.
This is a simple way to extract the date:
import pandas as pd
d='2015-01-08 22:44:09'
date=pd.to_datetime(d).date()
print(date)
Pandas DatetimeIndex and Series have a method called normalize that does exactly what you want.
You can read more about it in this answer.
It can be used as ser.dt.normalize()
Just giving a more up to date answer in case someone sees this old post.
Adding "utc=False" when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.
pd.to_datetime(df['Date'], utc=False)
You will be able to save it in excel without getting the error "ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel."
df['Column'] = df['Column'].dt.strftime('%m/%d/%Y')
This will give you just the dates and NO TIME at your desired format. You can change format according to your need '%m/%d/%Y' It will change the data type of the column to 'object'.
If you want just the dates and DO NOT want time in YYYY-MM-DD format use :
df['Column'] = pd.to_datetime(df['Column']).dt.date
The datatype will be 'object'.
For 'datetime64' datatype, use:
df['Column'] = pd.to_datetime(df['Column']).dt.normalize()
Converting to datetime64[D]:
df.dates.values.astype('M8[D]')
Though re-assigning that to a DataFrame col will revert it back to [ns].
If you wanted actual datetime.date:
dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])
I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work
df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))
On tables of >1000000 rows I've found that these are both fast, with floor just slightly faster:
df['mydate'] = df.index.floor('d')
or
df['mydate'] = df.index.normalize()
If your index has timezones and you don't want those in the result, do:
df['mydate'] = df.index.tz_localize(None).floor('d')
df.index.date is many times slower; to_datetime() is even worse. Both have the further disadvantage that the results cannot be saved to an hdf store as it does not support type datetime.date.
Note that I've used the index as the date source here; if your source is another column, you would need to add .dt, e.g. df.mycol.dt.floor('d')
This worked for me on UTC Timestamp (2020-08-19T09:12:57.945888)
for di, i in enumerate(df['YourColumnName']):
df['YourColumnName'][di] = pd.Timestamp(i)
If the column is not already in datetime format:
df['DTformat'] = pd.to_datetime(df['col'])
Once it's in datetime format you can convert the entire column to date only like this:
df['DateOnly'] = df['DTformat'].apply(lambda x: x.date())
I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only.
I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:
[dt.to_datetime().date() for dt in df.dates]
But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?
Since version 0.15.0 this can now be easily done using .dt to access just the date component:
df['just_date'] = df['dates'].dt.date
The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:
df['normalised_date'] = df['dates'].dt.normalize()
This keeps the dtype as datetime64, but the display shows just the date value.
pandas: .dt accessor
pandas.Series.dt
Simple Solution:
df['date_only'] = df['date_time_column'].dt.date
While I upvoted EdChum's answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized - that is, it will be slow).
A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not "keep only date part", since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:
printing to screen
saving to csv
using the column to groupby
... and it is much more efficient, since the operation is vectorized.
EDIT: in fact, the answer the OP's would have preferred is probably "recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations".
Pandas v0.13+: Use to_csv with date_format parameter
Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.
Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:
df.to_csv(filename, date_format='%Y-%m-%d')
See Python's strftime directives for formatting conventions.
This is a simple way to extract the date:
import pandas as pd
d='2015-01-08 22:44:09'
date=pd.to_datetime(d).date()
print(date)
Pandas DatetimeIndex and Series have a method called normalize that does exactly what you want.
You can read more about it in this answer.
It can be used as ser.dt.normalize()
Just giving a more up to date answer in case someone sees this old post.
Adding "utc=False" when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.
pd.to_datetime(df['Date'], utc=False)
You will be able to save it in excel without getting the error "ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel."
df['Column'] = df['Column'].dt.strftime('%m/%d/%Y')
This will give you just the dates and NO TIME at your desired format. You can change format according to your need '%m/%d/%Y' It will change the data type of the column to 'object'.
If you want just the dates and DO NOT want time in YYYY-MM-DD format use :
df['Column'] = pd.to_datetime(df['Column']).dt.date
The datatype will be 'object'.
For 'datetime64' datatype, use:
df['Column'] = pd.to_datetime(df['Column']).dt.normalize()
Converting to datetime64[D]:
df.dates.values.astype('M8[D]')
Though re-assigning that to a DataFrame col will revert it back to [ns].
If you wanted actual datetime.date:
dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])
I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work
df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))
On tables of >1000000 rows I've found that these are both fast, with floor just slightly faster:
df['mydate'] = df.index.floor('d')
or
df['mydate'] = df.index.normalize()
If your index has timezones and you don't want those in the result, do:
df['mydate'] = df.index.tz_localize(None).floor('d')
df.index.date is many times slower; to_datetime() is even worse. Both have the further disadvantage that the results cannot be saved to an hdf store as it does not support type datetime.date.
Note that I've used the index as the date source here; if your source is another column, you would need to add .dt, e.g. df.mycol.dt.floor('d')
This worked for me on UTC Timestamp (2020-08-19T09:12:57.945888)
for di, i in enumerate(df['YourColumnName']):
df['YourColumnName'][di] = pd.Timestamp(i)
If the column is not already in datetime format:
df['DTformat'] = pd.to_datetime(df['col'])
Once it's in datetime format you can convert the entire column to date only like this:
df['DateOnly'] = df['DTformat'].apply(lambda x: x.date())