I have a date column in df which needs to align the format.
The column has multiple formats at the moment such as 20201203, 05/06/20, 2019-09-15 00:00:1568480400.
My expected results need to be in YYYY-MM-DD format, I tried pd.to_datetime(df, format='%Y-%m-%d')before, but then received 2 errors 05/06/20 doesn't match format specified and second must be in 0..59
I assume that there would be some initial codes to pre-process the data, is it true. Or is there any proper function? Please help.
Thank you so much, everyone.
Use to_datetime function without format parameter. Let Pandas infer the datetime format:
df = pd.DataFrame({"date": ["20201203", "05/06/20", "2019-09-15 00:00:00.1568480400"]})
df["date"] = pd.to_datetime(df["date"]).dt.strftime("%Y-%m-%d")
>>> df
date
0 2020-12-03
1 2020-05-06
2 2019-09-15
Check your input data:
2019-09-15 00:00:1568480400 is not valid:
ParserError: second must be in 0..59: 2019-09-15 00:00:1568480400
Input
from dateutil.parser import parse
df=pd.DataFrame({
'Date':['20201203', '05/06/20', '2019-09-15 00:00:1568480400']
})
Two options
First
df.Date = df.Date.str.split(' ',1).str[0] ## If `'2019-09-15 00:00:1568480400'` is a valid date in your df.
df["Date"] = pd.to_datetime(df["Date"]).dt.strftime("%Y-%m-%d")
Second
for i in range(len(df['Date'])):
df['Date'][i] = parse(df['Date'][i])
df['Date'] = pd.to_datetime(df['Date']).dt.strftime("%Y-%m-%d")
df
Output
Date
0 2020-12-03
1 2020-05-06
2 2019-09-15
Related
I'm getting data from an API and putting it into a Pandas DataFrame. The date column needs formatting into date/time, which I am doing. However the API sometimes returns dates without milliseconds which doesn't match the format pattern. This results in an error:
time data '2020-07-30T15:57:37Z' does not match format '%Y-%m-%dT%H:%M:%S.%fZ' (match)
In this example, how can I format the date column to date/time, so all dates are formatted with milliseconds?
import pandas as pd
dates = {
'date': ['2020-07-30T15:57:37Z', '2020-07-30T15:57:37.1Z']
}
df = pd.DataFrame(dates)
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%dT%H:%M:%S.%fZ')
print(df)
do it one time with milliseconds included and another time without milliseconds included. use errors='coerce' to return NaT when ValueError occurs.
with_miliseconds = pd.to_datetime(df['date'], format='%Y-%m-%dT%H:%M:%S.%fZ',errors='coerce')
without_miliseconds = pd.to_datetime(df['date'], format='%Y-%m-%dT%H:%M:%SZ',errors='coerce')
the results would be something like this:
with milliseconds:
0 NaT
1 2020-07-30 15:57:37.100
Name: date, dtype: datetime64[ns]
without milliseconds:
0 2020-07-30 15:57:37
1 NaT
Name: date, dtype: datetime64[ns]
then you can fill NaTs of one dataframe with values of the other because they complement each other.
with_miliseconds.fillna(without_miliseconds)
0 2020-07-30 15:57:37.000
1 2020-07-30 15:57:37.100
Name: date, dtype: datetime64[ns]
To have a consistent format in your output DataFrame, you could run a Regex replacement before converting to a df for all values without mills.
dates = {'date': [re.sub(r'Z', '.0Z', date) if '.' not in date else date for date in dates['date']]}
Since only those dates containing a . have mills, we can run the replacements on the others.
After that, everything else is the same as in your code.
Output:
date
0 2020-07-30 15:57:37.000
1 2020-07-30 15:57:37.100
As your date string seems like the standard ISO 8601 you can just avoid the use of the format param. The parser will take into account that miliseconds are optional.
import pandas as pd
dates = {
'date': ['2020-07-30T15:57:37Z', '2020-07-30T15:57:37.1Z']
}
df = pd.DataFrame(dates)
df['date'] = pd.to_datetime(df['date'])
print(df)
date
0 2020-07-30 15:57:37+00:00
1 2020-07-30 15:57:37.100000+00:00
How to remove T00:00:00+05:30 after year, month and date values in pandas? I tried converting the column into datetime but also it's showing the same results, I'm using pandas in streamlit. I tried the below code
df['Date'] = pd.to_datetime(df['Date'])
The output is same as below :
Date
2019-07-01T00:00:00+05:30
2019-07-01T00:00:00+05:30
2019-07-02T00:00:00+05:30
2019-07-02T00:00:00+05:30
2019-07-02T00:00:00+05:30
2019-07-03T00:00:00+05:30
2019-07-03T00:00:00+05:30
2019-07-04T00:00:00+05:30
2019-07-04T00:00:00+05:30
2019-07-05T00:00:00+05:30
Can anyone help me how to remove T00:00:00+05:30 from the above rows?
If I understand correctly, you want to keep only the date part.
Convert date strings to datetime
df = pd.DataFrame(
columns={'date'},
data=["2019-07-01T02:00:00+05:30", "2019-07-02T01:00:00+05:30"]
)
date
0 2019-07-01T02:00:00+05:30
1 2019-07-02T01:00:00+05:30
2 2019-07-03T03:00:00+05:30
df['date'] = pd.to_datetime(df['date'])
date
0 2019-07-01 02:00:00+05:30
1 2019-07-02 01:00:00+05:30
Remove the timezone
df['datetime'] = df['datetime'].dt.tz_localize(None)
date
0 2019-07-01 02:00:00
1 2019-07-02 01:00:00
Keep the date only
df['date'] = df['date'].dt.date
0 2019-07-01
1 2019-07-02
Don't bother with apply to Python dates or string changes. The former will leave you with an object type column and the latter is slow. Just round to the day frequency using the library function.
>>> pd.Series([pd.Timestamp('2000-01-05 12:01')]).dt.round('D')
0 2000-01-06
dtype: datetime64[ns]
If you have a timezone aware timestamp, convert to UTC with no time zone then round:
>>> pd.Series([pd.Timestamp('2019-07-01T00:00:00+05:30')]).dt.tz_convert(None) \
.dt.round('D')
0 2019-07-01
dtype: datetime64[ns]
Pandas doesn't have a builtin conversion to datetime.date, but you could use .apply to achieve this if you want to have date objects instead of string:
import pandas as pd
import datetime
df = pd.DataFrame(
{"date": [
"2019-07-01T00:00:00+05:30",
"2019-07-01T00:00:00+05:30",
"2019-07-02T00:00:00+05:30",
"2019-07-02T00:00:00+05:30",
"2019-07-02T00:00:00+05:30",
"2019-07-03T00:00:00+05:30",
"2019-07-03T00:00:00+05:30",
"2019-07-04T00:00:00+05:30",
"2019-07-04T00:00:00+05:30",
"2019-07-05T00:00:00+05:30"]})
df["date"] = df["date"].apply(lambda x: datetime.datetime.fromisoformat(x).date())
print(df)
I have this situation in which I have a DataFrame with a string column with some values with this format:
DD/MM/YYYY
and some with this other one:
DD/MM/YYYY HH:Mi:SS
If I try to convert everything to datetime like this
df['COLUMN'] = pd.to_datetime(df['COLUMN'])
The rows without the HH:Mi:SS go crazy and the months are interpreted as days (and viceversa).
How could avoid this and have a column with just date format?
Example of column which goes crazy:
Before conversion:
DateTime
--------
02/07/2021
15/07/2021 18:16:00
After conversion:
DateTime
2021-02-07 (This is February!!)
2021-07-15 18:16:00
Pandas to_datetime has an inbuild parameter to specify if your day is first. i.e. dayfirst
You can use it as :
df['COLUMN'] = pd.to_datetime(df['COLUMN'], dayfirst=True)
Checkout the documentation for more info.
I believe the following achieves the desired output (may not be the fastest way)
import pandas as pd
df = pd.DataFrame({'date': ['15/07/2021 18:16:00', '02/07/2021']})
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y', errors='coerce').fillna(pd.to_datetime(df['date'], format="%d/%m/%Y %H:%M:%S", errors="coerce"))
print(df.head())
for date in df['date']:
print(type(date))
Output:
date
0 2021-07-15 18:16:00
1 2021-07-02 00:00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
I have this date column which the dtype: object and the format is 31-Mar-20. So i tried to turn it with datetime.strptime into datetime64[D] and with format of 2020-03-31 which somehow whatever i have tried it does not work, i have tried some methode from this and this. In some way, it does turn my column to datetime64 but it has timestamp in it and i don't want it. I need it to be datetime without timestamp and the format is 2020-03-31 This is my code
dates = [datetime.datetime.strptime(ts,'%d-%b-%y').strftime('%Y-%m-%d')
for ts in df['date']]
df['date']= pd.DataFrame({'date': dates})
df = df.sort_values(by=['date'])
This approach might work -
import pandas as pd
df = pd.DataFrame({'dates': ['20-Mar-2020', '21-Mar-2020', '22-Mar-2020']})
df
dates
0 20-Mar-2020
1 21-Mar-2020
2 22-Mar-2020
df['dates'] = pd.to_datetime(df['dates'], format='%d-%b-%Y').dt.date
df
dates
0 2020-03-20
1 2020-03-21
2 2020-03-22
df['date'] = pd.to_datetime(df['date'], format="%d-%b-%y")
This converts it to a datetime, when you look at df it displays values as 2020-03-31 like you want, however these are all datetime objects so if you extract one value with df['date'][0] then you see Timestamp('2020-03-31 00:00:00')
if you want to convert them into a date you can do
df['date'] = [df_datetime.date() for df_datetime in df['date'] ]
There is probably a better way of doing this step.
I am working on a timeseries dataset which looks like this:
DateTime SomeVariable
0 01/01 01:00:00 0.24244
1 01/01 02:00:00 0.84141
2 01/01 03:00:00 0.14144
3 01/01 04:00:00 0.74443
4 01/01 05:00:00 0.99999
The date is without year. Initially, the dtype of the DateTime is object and I am trying to change it to pandas datetime format. Since the date in my data is without year, on using:
df['DateTime'] = pd.to_datetime(df.DateTime)
I am getting the error OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 01:00:00
I understand why I am getting the error (as it's not according to the pandas acceptable format), but what I want to know is how I can change the dtype from object to pandas datetime format without having year in my date. I would appreciate the hints.
EDIT 1:
Since, I got to know that I can't do it without having year in the data. So this is how I am trying to change the dtype:
df = pd.read_csv(some file location)
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'], format='%y%d/%m %H:%M:%S')
df.head()
On doing that, I am getting:
ValueError: time data '2018/ 01/01 01:00:00' doesn't match format specified.
EDIT 2:
Changing the format to '%Y/%m/%d %H:%M:%S'.
My data is hourly data, so it goes till 24h. I have only provided the demo data till 5h.
I was getting the space on adding the year to the DateTime. In order to remove that, this is what I did:
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'][1:], format='%Y/%m/%d %H:%M:%S')
I am getting the following error for that:
ValueError: time data '2018/ 01/01 02:00:00' doesn't match format specified
On changing the format to '%y/%m/%d %H:%M:%S' with the same code, this is the error I get:
ValueError: time data '2018/ 01/01 02:00:00' does not match format '%y/%m/%d %H:%M:%S' (match)
The problem is because of the gap after the year but I am not able to get rid of it.
EDIT 3:
I am able to get rid of the space after adding the year, however I am still not able to change the dtype.
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'].str.strip(), format='%Y/%m/%d %H:%M:%S')
ValueError: time data '2018/01/01 01:00:00' doesn't match format specified
I noticed that there are 2 spaces between the date and the time in the error, however adding 2 spaces in the format doesn't help.
EDIT 4 (Solution):
Removed all the multiple whitespaces. Still the format was not matching. The problem was because of the time format. The hours were from 1-24 in my data and pandas support 0-23. Simply changed the time 24:00:00 to 00:00:00 and it works perfectly now.
This is not possible. A datetime object must have a year.
What you can do is ensure all years are aligned for your data.
For example, to convert to datetime while setting year to 2018:
df = pd.DataFrame({'DateTime': ['01/01 01:00:00', '01/01 02:00:00', '01/01 03:00:00',
'01/01 04:00:00', '01/01 05:00:00']})
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'], format='%Y/%m/%d %H:%M:%S')
print(df)
DateTime
0 2018-01-01 01:00:00
1 2018-01-01 02:00:00
2 2018-01-01 03:00:00
3 2018-01-01 04:00:00
4 2018-01-01 05:00:00
# Remove spaces. Have in mind this will remove all spaces.
df['DateTime'] = df['DateTime'].str.replace(" ", "")
# I'm assuming year does not matter and that 01/01 is in the format day/month.
df['DateTime'] = pd.to_datetime(df['DateTime'], format='%d/%m%H:%M:%S')