I have a DataFrame that contains strings which should be converted to datetime in order to sort the DataFrame. The strings are received from Syslogs.
The strings look like as the ones on the picture and below:
date
Mar 16 03:40:24.411
Mar 16 03:40:25.415
Mar 16 03:40:28.532
Mar 16 03:40:30.539
Mar 14 03:20:30.337
Mar 14 03:20:31.340
Mar 14 03:20:37.415
I tried to convert it with pandas.to_datetime(), but I received the following error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-03-16 03:40:24
I may need the nanoseconds as well.
Is necessary specify format of string with this reference.
There is no year, so output year is default:
df['date'] = pd.to_datetime(df['date'], format='%b %d %H:%M:%S.%f')
print (df)
date
0 1900-03-16 03:40:24.411
1 1900-03-16 03:40:25.415
2 1900-03-16 03:40:28.532
3 1900-03-16 03:40:30.539
4 1900-03-14 03:20:30.337
5 1900-03-14 03:20:31.340
6 1900-03-14 03:20:37.415
You can add some year to column and then parse it like:
df['date'] = pd.to_datetime('2020 ' + df['date'], format='%Y %b %d %H:%M:%S.%f')
print (df)
date
0 2020-03-16 03:40:24.411
1 2020-03-16 03:40:25.415
2 2020-03-16 03:40:28.532
3 2020-03-16 03:40:30.539
4 2020-03-14 03:20:30.337
5 2020-03-14 03:20:31.340
6 2020-03-14 03:20:37.415
The best way is using pandas.to_datetime as mentioned above. If you are not familiar with date string formatting, you can getaway using date parser libraries. Example dateutil library:
# python -m pip install —user dateutil
from dateutil import parser
import pandas as pd
df = pd.DataFrame({'dates': ['Mar 16 03:40:24.411',' Mar 16 03:40:25.415','Mar 16 03:40:28.532']})
# parse it
df['dates'] = df['dates'].apply(parser.parse)
print(df)
dateutil parser will add current year to your dates.
vectoring
# using numpy.vectorize
import numpy as np
df['dates'] = np.vectorize(parser.parse)(df['dates'])
Note:
This is not optional for large datasets and should be used only when pd.to_datetime is not able to parse date.
Related
I have a Pandas dataframe df that looks as follows:
df = pd.DataFrame({'timestamp' : ['Wednesday, Apr 4/04/22 at 17:02',
'Saturday, Apr 4/23/22 at 15:45'],
'foo' : [1, 2]
})
df
timestamp foo
0 Wednesday, Apr 4/04/22 at 17:02 1
1 Saturday, Apr 4/23/22 at 15:45 2
I'm trying to convert the timestamp column to a datetime object so that I can add a day_of_week column.
My attempt:
df['timestamp'] = pd.to_datetime(df['timestamp'],
format='%A, %b %-m/%-d/%y at %H:%M')
df['day_of_week'] = df['timestamp'].dt.day_name()
The error is:
ValueError: '-' is a bad directive in format '%A, %b %-m/%-d/%y at %H:%M'
Any assistance would be greatly appreciated. Thanks!
Just use the format without the -:
df['timestamp'] = pd.to_datetime(df['timestamp'],
format='%A, %b %m/%d/%y at %H:%M')
df['day_of_week'] = df['timestamp'].dt.day_name()
NB. to_datetime is quite flexible on the provided data, note how the incorrect day of week was just ignored.
output:
timestamp foo day_of_week
0 2022-04-04 17:02:00 1 Monday
1 2022-04-23 15:45:00 2 Saturday
I have a column calls "date" which is an object and it has very different date format like dd.m.yy, dd.mm.yyyy, dd/mm/yyyy, dd/mm, m/d/yyyy etc as below. Obviously by simply using df['date'] = pd.to_datetime(df['date']) will not work. I wonder for messy date value like that, is there anyway to standardized and covert the date into one single format ?
date
17.2.22 # means Feb 17 2022
23.02.22 # means Feb 23 2022
17/02/2022 # means Feb 17 2022
18.2.22 # means Feb 18 2022
2/22/2022 # means Feb 22 2022
3/1/2022 # means March 1 2022
<more messy different format>
Coerce the dates to datetime and allow invalid entries to be turned into nulls.Also, allow pandas to infer the format. code below
df['date'] = pd.to_datetime(df['date'], errors='coerce',infer_datetime_format=True)
date
0 2022-02-17
1 2022-02-23
2 2022-02-17
3 2022-02-18
4 2022-02-22
5 2022-03-01
Based on wwnde's solution, the following works in my real dataset -
df['date'].fillna('',inplace=True)
df['date'] = df['date'].astype('str')
df['date new'] = df['date'].str.replace('.','/')
df['date new'] = pd.to_datetime(df['date new'],
errors='coerce',infer_datetime_format=True)
I am trying to convert a column with a real mix of date formats. I have tried a few things on SO but still not got a working solution. I have tried changing column to 'string', also tried converting the floats in int.
data
date
1 43076.0
2 43077
3 07 Dec 2017
4 2021-12-22 00:00:00
code to try and fix the Excel dates and '07 Dec 2017' style
d = ['43076.0', '43077', '07 Dec 2017', '2021-12-22 00:00:00']
df = pd.DataFrame(d, columns=['date'])
date1 = pd.to_datetime(df['date'], errors='coerce', format='%d %a %Y')
date2 = pd.to_datetime(df['date'], errors='coerce', unit='D', origin='1899-12-30')
frame_clean[col] = date2.fillna(date1)
error
Name: StartDate, Length: 16189, dtype: object' is not compatible with origin='1899-12-30'; it must be numeric with a unit specified
I like this solution rather than using apply as to slow. But I am struggling to get it working.
Edit
Breaking down #FObersteiner solution for better understanding.
convert the simple dates
df['datetime'] = pd.to_datetime(df['date'], errors='coerce')
0 NaT
1 NaT
2 2018-12-07
3 2021-12-22
isolate the numeric rows
m = pd.to_numeric(df['date'], errors='coerce').notna()
m
0 True
1 True
2 False
3 False
convert numeric rows to floats
df['date'][m].astype(float)
0 43080.0
1 43077.0
convert numeric rows to floats and then dt objects
pd.to_datetime(df['date'][m].astype(float), errors='coerce', unit='D', origin='1899-12-30')
0 2017-12-11
1 2017-12-08
pull it alltogether and bring back the simple date rows
df.loc[m, 'datetime'] = pd.to_datetime(df['date'][m].astype(float), errors='coerce', unit='D', origin='1899-12-30')
print(df)
For given example, use a mask to convert numeric and non-numeric data separately:
import pandas as pd
df = pd.DataFrame({'date':['43076.0', '43077', '07 Dec 2017', '2021-12-22 00:00:00']})
df['datetime'] = pd.to_datetime(df['date'], errors='coerce')
m = pd.to_numeric(df['date'], errors='coerce').notna()
df.loc[m, 'datetime'] = pd.to_datetime(df['date'][m].astype(float), errors='coerce', unit='D', origin='1899-12-30')
print(df)
date datetime
0 43076.0 2017-12-07
1 43077 2017-12-08
2 07 Dec 2017 2017-12-07
3 2021-12-22 00:00:00 2021-12-22
I am fetching data from one of the file which has date stored as
20 March
Using pandas I want to convert to 20/03/2020
I tried using strftime,to_datetime using errors but still I am not able convert.
Moreover when I group by date it stores date column numerically like:
1 January,1 February,1 March then 2 January,2 February, 2 March
How do I resolve this?
import pandas as pd
def to_datetime_(dt):
return pd.to_datetime(dt + " 2020")
to get timestamp in pandas with year 2020 always
If year is always 2020 then use the following code:
df = pd.DataFrame({'date':['20 March','22 March']})
df['date_new'] = pd.to_datetime(df['date'], format='%d %B')
If this shows year as 1900 then:
df['date_new'] = df['date_new'].mask(df['date_new'].dt.year == 1900, df['date_new'] + pd.offsets.DateOffset(year = 2020))
print(df)
date date_new
0 20 March 2020-03-20
1 22 March 2020-03-22
Further you can convert the date format as required.
Do,
import pandas as pd
import datetime
df = pd.DataFrame({
'dates': ['1 January', '2 January', '10 March', '1 April']
})
df['dates'] = df['dates'].map(lambda x: datetime.datetime.strptime(x, "%d %B").replace(year=2020))
# Output
dates
0 2020-01-01
1 2020-01-02
2 2020-03-10
3 2020-04-01
My data has date variable with two different date formats
Date
01 Jan 2019
02 Feb 2019
01-12-2019
23-01-2019
11-04-2019
22-05-2019
I want to convert this string into date(YYYY-mm-dd)
Date
2019-01-01
2019-02-01
2019-12-01
2019-01-23
2019-04-11
2019-05-22
I have tried following things, but I am looking for better approach
df['Date'] = np.where(df['Date'].str.contains('-'), pd.to_datetime(df['Date'], format='%d-%m-%Y'), pd.to_datetime(df['Date'], format='%d %b %Y'))
Working solution for me
df['Date_1']= np.where(df['Date'].str.contains('-'),df['Date'],np.nan)
df['Date_2']= np.where(df['Date'].str.contains('-'),np.nan,df['Date'])
df['Date_new'] = np.where(df['Date'].str.contains('-'),pd.to_datetime(df['Date_1'], format = '%d-%m-%Y'),pd.to_datetime(df['Date_2'], format = '%d %b %Y'))
Just use the option dayfirst=True
pd.to_datetime(df.Date, dayfirst=True)
Out[353]:
0 2019-01-01
1 2019-02-02
2 2019-12-01
3 2019-01-23
4 2019-04-11
5 2019-05-22
Name: Date, dtype: datetime64[ns]
My suggestion:
Define a conversion function as follows:
import datetime as dt
def conv_date(x):
try:
res = pd.to_datetime(dt.datetime.strptime(x, "%d %b %Y"))
except ValueError:
res = pd.to_datetime(dt.datetime.strptime(x, "%d-%m-%Y"))
return res
Now get the new date column as folows:
df['Date_new'] = df['Date'].apply(lambda x: conv_date(x))
You can get your desired result with the help of apply AND to_datetime method of pandas, as given below:-
import pandas pd
def change(value):
return pd.to_datetime(value)
df = pd.DataFrame(data = {'date':['01 jan 2019']})
df['date'] = df['date'].apply(change)
df
I hope it may help you.
This works simply as expected -
import pandas as pd
a = pd. DataFrame({
'Date' : ['01 Jan 2019',
'02 Feb 2019',
'01-12-2019',
'23-01-2019',
'11-04-2019',
'22-05-2019']
})
a['Date'] = a['Date'].apply(lambda date: pd.to_datetime(date, dayfirst=True))
print(a)