Date column coming in in different formats when loading csv Pandas - python

I'm having trouble loading some Dates from a csv file, the dates are essentially correct, but they seem to flip from YYYY-DD-MM to YYYY-MM-DD, and it would appear it depends on whether the Day with in the Date is below the 10th or not. When I look at the csv in Excel however the dates are all of the same format which weirdly is 01/04/19 (DD/MM/YY) , so completely different from what pandas is loading it as.
Here is how the Date column is coming in:
2019-01-04
2019-08-04
2019-04-15
2019-04-22
2019-04-29
2019-06-05
2019-05-13
2019-05-20
2019-05-27
2019-03-06
2019-10-06
2019-06-17
2019-06-24
2019-01-07
2019-08-07
I've tried parsing the date when loading the csv at the beginning and tried things like df['Date'] = pd.to_datetime(df['Date']) but nothing appears to work. Has anyone seen anything like this before?

you may use df['Date'] = pd.to_datetime(df['DateS'], format='%Y-%m-%d')
as follows
df = pd.DataFrame({
'DateS' : ['2019-01-04', '2019-08-04']
})
df['Date'] = pd.to_datetime(df['DateS'], format='%Y-%m-%d')
df

Related

How to convert a column of datetime with different format to a specific one?

Hi I am not an expert in python and I am still a beginner in using pandas and working with data.
I have a df with a column timestamp. The datetime in the column are as shown below:
2021-09-07 16:36:14 UTC
2021-09-04 15:31:44 UTC
2021-07-15 06:49:47.320081 UTC
2021-09-07 14:55:55.353145 UTC
I would like to have only the date and time, without the UTC text at the end and without the decimals after the second and in the end save the dataframe in a csv file. Basically I want the column in this format:
2021-09-07 16:36:14
2021-09-04 15:31:44
2021-07-15 06:49:47
2021-09-07 14:55:55
I tried with these two functions:
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d %H:%M:%S %Z', errors='coerce')
df['timestamp'] = df['timestamp'].dt.strftime('%Y-%m-%d %H:%M:%S')
I fix half of the problem. The datetime without the decimals after the second get fixed, but the ones with the decimals just get empty, you can find the example below:
2021-09-07 16:36:14
2021-09-04 15:31:44
Please can someone help me with this problem?
Try extracting the part of the field you want.
df['timestamp'] = pd.to_datetime(df['timestamp'].str[:19])
print(df)
print(df.dtypes.
timestamp
0 2021-09-07 16:36:14
1 2021-09-04 15:31:44
2 2021-07-15 06:49:47
3 2021-09-07 14:55:55
timestamp datetime64[ns]
dtype: object
You can take the first 20 characters:
df['timestamp'] = pd.to_datetime(df['timestamp'].str[:19])
print(df)
# Output
timestamp
0 2021-09-07 16:36:14
1 2021-09-04 15:31:44
2 2021-07-15 06:49:47
3 2021-09-07 14:55:55
If you want to keep the timezone information (UTC), you can remove only the microsecond part:
df['timestamp']= pd.to_datetime(df['timestamp'].str.replace('\.\d+', '', regex=True))
print(df)
# Output
timestamp
0 2021-09-07 16:36:14+00:00
1 2021-09-04 15:31:44+00:00
2 2021-07-15 06:49:47+00:00
3 2021-09-07 14:55:55+00:00
Try parser, as it can take different formats as an input
from dateutil import parser
# df['timestamp'] = parser.parse(df['timestamp'])
date = parser.parse("2021-07-15 06:49:47.320081 UTC")
print(date)
2021-07-15 06:49:47.320081+00:00
Or this output
# Which would imply
# df['timestamp'] = parser.parse(df['timestamp']).strftime("%F %T")
print(date.strftime("%F %T"))
2021-07-15 06:49:47

why to_datetime() doesn't work when converting string to datetime pandas [duplicate]

I am trying to convert my column in a df into a time series. The dataset goes from March 23rd 2015-August 17th 2019 and the dataset looks like this:
time 1day_active_users
0 2015-03-23 00:00:00-04:00 19687.0
1 2015-03-24 00:00:00-04:00 19437.0
I am trying to convert the time column into a datetime series but it returns the column as an object. Here is the code:
data = pd.read_csv(data_path)
data.set_index('time', inplace=True)
data.index= pd.to_datetime(data.index)
data.index.dtype
data.index.dtype returns dtype('O'). I assume this is why when I try to index an element in time, it returns an error. For example, when I run this:
data.loc['2015']
It gives me this error
KeyError: '2015'
Any help or feedback would be appreciated. Thank you.
As commented, the problem might be due to the different timezones. Try passing utc=True to pd.to_datetime:
df['time'] = pd.to_datetime(df['time'],utc=True)
df['time']
Test Data
time 1day_active_users
0 2015-03-23 00:00:00-04:00 19687.0
1 2015-03-24 00:00:00-05:00 19437.0
Output:
0 2015-03-23 04:00:00+00:00
1 2015-03-24 05:00:00+00:00
Name: time, dtype: datetime64[ns, UTC]
And then:
df.set_index('time', inplace=True)
df.loc['2015']
gives
1day_active_users
time
2015-03-23 04:00:00+00:00 19687.0
2015-03-24 05:00:00+00:00 19437.0

Dates go crazy when applying pd.to_datetime

I have this situation in which I have a DataFrame with a string column with some values with this format:
DD/MM/YYYY
and some with this other one:
DD/MM/YYYY HH:Mi:SS
If I try to convert everything to datetime like this
df['COLUMN'] = pd.to_datetime(df['COLUMN'])
The rows without the HH:Mi:SS go crazy and the months are interpreted as days (and viceversa).
How could avoid this and have a column with just date format?
Example of column which goes crazy:
Before conversion:
DateTime
--------
02/07/2021
15/07/2021 18:16:00
After conversion:
DateTime
2021-02-07 (This is February!!)
2021-07-15 18:16:00
Pandas to_datetime has an inbuild parameter to specify if your day is first. i.e. dayfirst
You can use it as :
df['COLUMN'] = pd.to_datetime(df['COLUMN'], dayfirst=True)
Checkout the documentation for more info.
I believe the following achieves the desired output (may not be the fastest way)
import pandas as pd
df = pd.DataFrame({'date': ['15/07/2021 18:16:00', '02/07/2021']})
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y', errors='coerce').fillna(pd.to_datetime(df['date'], format="%d/%m/%Y %H:%M:%S", errors="coerce"))
print(df.head())
for date in df['date']:
print(type(date))
Output:
date
0 2021-07-15 18:16:00
1 2021-07-02 00:00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

Convert multiple format date into only one

I have a date column in df which needs to align the format.
The column has multiple formats at the moment such as 20201203, 05/06/20, 2019-09-15 00:00:1568480400.
My expected results need to be in YYYY-MM-DD format, I tried pd.to_datetime(df, format='%Y-%m-%d')before, but then received 2 errors 05/06/20 doesn't match format specified and second must be in 0..59
I assume that there would be some initial codes to pre-process the data, is it true. Or is there any proper function? Please help.
Thank you so much, everyone.
Use to_datetime function without format parameter. Let Pandas infer the datetime format:
df = pd.DataFrame({"date": ["20201203", "05/06/20", "2019-09-15 00:00:00.1568480400"]})
df["date"] = pd.to_datetime(df["date"]).dt.strftime("%Y-%m-%d")
>>> df
date
0 2020-12-03
1 2020-05-06
2 2019-09-15
Check your input data:
2019-09-15 00:00:1568480400 is not valid:
ParserError: second must be in 0..59: 2019-09-15 00:00:1568480400
Input
from dateutil.parser import parse
df=pd.DataFrame({
'Date':['20201203', '05/06/20', '2019-09-15 00:00:1568480400']
})
Two options
First
df.Date = df.Date.str.split(' ',1).str[0] ## If `'2019-09-15 00:00:1568480400'` is a valid date in your df.
df["Date"] = pd.to_datetime(df["Date"]).dt.strftime("%Y-%m-%d")
Second
for i in range(len(df['Date'])):
df['Date'][i] = parse(df['Date'][i])
df['Date'] = pd.to_datetime(df['Date']).dt.strftime("%Y-%m-%d")
df
Output
Date
0 2020-12-03
1 2020-05-06
2 2019-09-15

Cannot remove timestamp in datetime

I have this date column which the dtype: object and the format is 31-Mar-20. So i tried to turn it with datetime.strptime into datetime64[D] and with format of 2020-03-31 which somehow whatever i have tried it does not work, i have tried some methode from this and this. In some way, it does turn my column to datetime64 but it has timestamp in it and i don't want it. I need it to be datetime without timestamp and the format is 2020-03-31 This is my code
dates = [datetime.datetime.strptime(ts,'%d-%b-%y').strftime('%Y-%m-%d')
for ts in df['date']]
df['date']= pd.DataFrame({'date': dates})
df = df.sort_values(by=['date'])
This approach might work -
import pandas as pd
df = pd.DataFrame({'dates': ['20-Mar-2020', '21-Mar-2020', '22-Mar-2020']})
df
dates
0 20-Mar-2020
1 21-Mar-2020
2 22-Mar-2020
df['dates'] = pd.to_datetime(df['dates'], format='%d-%b-%Y').dt.date
df
dates
0 2020-03-20
1 2020-03-21
2 2020-03-22
df['date'] = pd.to_datetime(df['date'], format="%d-%b-%y")
This converts it to a datetime, when you look at df it displays values as 2020-03-31 like you want, however these are all datetime objects so if you extract one value with df['date'][0] then you see Timestamp('2020-03-31 00:00:00')
if you want to convert them into a date you can do
df['date'] = [df_datetime.date() for df_datetime in df['date'] ]
There is probably a better way of doing this step.

Categories

Resources