Pandas: datetime conversion from dtype object - python

I am working on a timeseries dataset which looks like this:
DateTime SomeVariable
0 01/01 01:00:00 0.24244
1 01/01 02:00:00 0.84141
2 01/01 03:00:00 0.14144
3 01/01 04:00:00 0.74443
4 01/01 05:00:00 0.99999
The date is without year. Initially, the dtype of the DateTime is object and I am trying to change it to pandas datetime format. Since the date in my data is without year, on using:
df['DateTime'] = pd.to_datetime(df.DateTime)
I am getting the error OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 01:00:00
I understand why I am getting the error (as it's not according to the pandas acceptable format), but what I want to know is how I can change the dtype from object to pandas datetime format without having year in my date. I would appreciate the hints.
EDIT 1:
Since, I got to know that I can't do it without having year in the data. So this is how I am trying to change the dtype:
df = pd.read_csv(some file location)
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'], format='%y%d/%m %H:%M:%S')
df.head()
On doing that, I am getting:
ValueError: time data '2018/ 01/01 01:00:00' doesn't match format specified.
EDIT 2:
Changing the format to '%Y/%m/%d %H:%M:%S'.
My data is hourly data, so it goes till 24h. I have only provided the demo data till 5h.
I was getting the space on adding the year to the DateTime. In order to remove that, this is what I did:
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'][1:], format='%Y/%m/%d %H:%M:%S')
I am getting the following error for that:
ValueError: time data '2018/ 01/01 02:00:00' doesn't match format specified
On changing the format to '%y/%m/%d %H:%M:%S' with the same code, this is the error I get:
ValueError: time data '2018/ 01/01 02:00:00' does not match format '%y/%m/%d %H:%M:%S' (match)
The problem is because of the gap after the year but I am not able to get rid of it.
EDIT 3:
I am able to get rid of the space after adding the year, however I am still not able to change the dtype.
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'].str.strip(), format='%Y/%m/%d %H:%M:%S')
ValueError: time data '2018/01/01 01:00:00' doesn't match format specified
I noticed that there are 2 spaces between the date and the time in the error, however adding 2 spaces in the format doesn't help.
EDIT 4 (Solution):
Removed all the multiple whitespaces. Still the format was not matching. The problem was because of the time format. The hours were from 1-24 in my data and pandas support 0-23. Simply changed the time 24:00:00 to 00:00:00 and it works perfectly now.

This is not possible. A datetime object must have a year.
What you can do is ensure all years are aligned for your data.
For example, to convert to datetime while setting year to 2018:
df = pd.DataFrame({'DateTime': ['01/01 01:00:00', '01/01 02:00:00', '01/01 03:00:00',
'01/01 04:00:00', '01/01 05:00:00']})
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'], format='%Y/%m/%d %H:%M:%S')
print(df)
DateTime
0 2018-01-01 01:00:00
1 2018-01-01 02:00:00
2 2018-01-01 03:00:00
3 2018-01-01 04:00:00
4 2018-01-01 05:00:00

# Remove spaces. Have in mind this will remove all spaces.
df['DateTime'] = df['DateTime'].str.replace(" ", "")
# I'm assuming year does not matter and that 01/01 is in the format day/month.
df['DateTime'] = pd.to_datetime(df['DateTime'], format='%d/%m%H:%M:%S')

Related

After changing column type to string, hours, minutes and seconds are missing from the date

I am have a dataframe loaded from a file containing a time series and values
datetime value_a
0 2019-08-19 00:00:00 194.32000000
1 2019-08-20 00:00:00 202.24000000
2 2019-08-21 00:00:00 196.55000000
3 2019-08-22 00:00:00 187.45000000
4 2019-08-23 00:00:00 190.36000000
After I try to convert first column to string, the hours minutes and seconds vanish.
datetime value_a
0 2019-08-19 194.32000000
1 2019-08-20 202.24000000
2 2019-08-21 196.55000000
3 2019-08-22 187.45000000
4 2019-08-23 190.36000000
Code snipped
df['datetime'] = df['datetime'].astype(str)
I kinda need the format %Y-%m-%d %H:%M:%S, because we are using it later.
What is wrong?
NOTE: I initially though that the issue is during conversion from object to datetime, however thanks to user #SomeDude, I have discovered that I am loosing h/m/s during to string conversion.
It seems like the error can be fixed by using different type conversion method with explicit format definition.
df['datetime'] = df['datetime'].dt.strftime("%Y-%m-%d %H:%M:%S")
This works.
You're saying "I don't like the default format".
Ok. So be explicit, include HMS in it when you re-format.
>>> df = pd.DataFrame([dict(datetime='2019-08-19 00:00:00', value_a=194.32)])
>>> df['datetime'] = pd.to_datetime(df.datetime)
>>>
>>> df['datetime'] = df.datetime.dt.strftime("%Y-%m-%d %H:%M:%S")
>>> df
datetime value_a
0 2019-08-19 00:00:00 194.32

Pandas Date Formatting (With Optional Milliseconds)

I'm getting data from an API and putting it into a Pandas DataFrame. The date column needs formatting into date/time, which I am doing. However the API sometimes returns dates without milliseconds which doesn't match the format pattern. This results in an error:
time data '2020-07-30T15:57:37Z' does not match format '%Y-%m-%dT%H:%M:%S.%fZ' (match)
In this example, how can I format the date column to date/time, so all dates are formatted with milliseconds?
import pandas as pd
dates = {
'date': ['2020-07-30T15:57:37Z', '2020-07-30T15:57:37.1Z']
}
df = pd.DataFrame(dates)
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%dT%H:%M:%S.%fZ')
print(df)
do it one time with milliseconds included and another time without milliseconds included. use errors='coerce' to return NaT when ValueError occurs.
with_miliseconds = pd.to_datetime(df['date'], format='%Y-%m-%dT%H:%M:%S.%fZ',errors='coerce')
without_miliseconds = pd.to_datetime(df['date'], format='%Y-%m-%dT%H:%M:%SZ',errors='coerce')
the results would be something like this:
with milliseconds:
0 NaT
1 2020-07-30 15:57:37.100
Name: date, dtype: datetime64[ns]
without milliseconds:
0 2020-07-30 15:57:37
1 NaT
Name: date, dtype: datetime64[ns]
then you can fill NaTs of one dataframe with values of the other because they complement each other.
with_miliseconds.fillna(without_miliseconds)
0 2020-07-30 15:57:37.000
1 2020-07-30 15:57:37.100
Name: date, dtype: datetime64[ns]
To have a consistent format in your output DataFrame, you could run a Regex replacement before converting to a df for all values without mills.
dates = {'date': [re.sub(r'Z', '.0Z', date) if '.' not in date else date for date in dates['date']]}
Since only those dates containing a . have mills, we can run the replacements on the others.
After that, everything else is the same as in your code.
Output:
date
0 2020-07-30 15:57:37.000
1 2020-07-30 15:57:37.100
As your date string seems like the standard ISO 8601 you can just avoid the use of the format param. The parser will take into account that miliseconds are optional.
import pandas as pd
dates = {
'date': ['2020-07-30T15:57:37Z', '2020-07-30T15:57:37.1Z']
}
df = pd.DataFrame(dates)
df['date'] = pd.to_datetime(df['date'])
print(df)
date
0 2020-07-30 15:57:37+00:00
1 2020-07-30 15:57:37.100000+00:00

Dates go crazy when applying pd.to_datetime

I have this situation in which I have a DataFrame with a string column with some values with this format:
DD/MM/YYYY
and some with this other one:
DD/MM/YYYY HH:Mi:SS
If I try to convert everything to datetime like this
df['COLUMN'] = pd.to_datetime(df['COLUMN'])
The rows without the HH:Mi:SS go crazy and the months are interpreted as days (and viceversa).
How could avoid this and have a column with just date format?
Example of column which goes crazy:
Before conversion:
DateTime
--------
02/07/2021
15/07/2021 18:16:00
After conversion:
DateTime
2021-02-07 (This is February!!)
2021-07-15 18:16:00
Pandas to_datetime has an inbuild parameter to specify if your day is first. i.e. dayfirst
You can use it as :
df['COLUMN'] = pd.to_datetime(df['COLUMN'], dayfirst=True)
Checkout the documentation for more info.
I believe the following achieves the desired output (may not be the fastest way)
import pandas as pd
df = pd.DataFrame({'date': ['15/07/2021 18:16:00', '02/07/2021']})
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y', errors='coerce').fillna(pd.to_datetime(df['date'], format="%d/%m/%Y %H:%M:%S", errors="coerce"))
print(df.head())
for date in df['date']:
print(type(date))
Output:
date
0 2021-07-15 18:16:00
1 2021-07-02 00:00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Convert multiple format date into only one

I have a date column in df which needs to align the format.
The column has multiple formats at the moment such as 20201203, 05/06/20, 2019-09-15 00:00:1568480400.
My expected results need to be in YYYY-MM-DD format, I tried pd.to_datetime(df, format='%Y-%m-%d')before, but then received 2 errors 05/06/20 doesn't match format specified and second must be in 0..59
I assume that there would be some initial codes to pre-process the data, is it true. Or is there any proper function? Please help.
Thank you so much, everyone.
Use to_datetime function without format parameter. Let Pandas infer the datetime format:
df = pd.DataFrame({"date": ["20201203", "05/06/20", "2019-09-15 00:00:00.1568480400"]})
df["date"] = pd.to_datetime(df["date"]).dt.strftime("%Y-%m-%d")
>>> df
date
0 2020-12-03
1 2020-05-06
2 2019-09-15
Check your input data:
2019-09-15 00:00:1568480400 is not valid:
ParserError: second must be in 0..59: 2019-09-15 00:00:1568480400
Input
from dateutil.parser import parse
df=pd.DataFrame({
'Date':['20201203', '05/06/20', '2019-09-15 00:00:1568480400']
})
Two options
First
df.Date = df.Date.str.split(' ',1).str[0] ## If `'2019-09-15 00:00:1568480400'` is a valid date in your df.
df["Date"] = pd.to_datetime(df["Date"]).dt.strftime("%Y-%m-%d")
Second
for i in range(len(df['Date'])):
df['Date'][i] = parse(df['Date'][i])
df['Date'] = pd.to_datetime(df['Date']).dt.strftime("%Y-%m-%d")
df
Output
Date
0 2020-12-03
1 2020-05-06
2 2019-09-15

Combine date and time; ValueError: hour must be in 0..23

I have the following df:
date time
2018-01-01 00:00:00 7:30:33
2017-01-01 00:00:00 7:30:33
I want to create a datetime column that should look like this:
2018-01-01 7:30:33
2017-01-01 7:30:33
To do this I use the following code:
df["datetime"] = pd.to_datetime(df['date'].apply(str)+' '+df['time'])
It works the majority of the time. However, in some parts of my df (I dont know which parts), I get the following error:
ValueError: hour must be in 0..23
What am I doing wrong and how can I fix this?
Convert date to datetime, and time to timedelta, and just sum 'em up.
pd.to_datetime(df.date) + pd.to_timedelta(df.time)
0 2018-01-01 07:30:33
1 2017-01-01 07:30:33
dtype: datetime64[ns]
If you're worried about invalid values, add the errors='coerce' argument to both functions to handle them appropriately.

Categories

Resources