Converting Python dataframe column to date format - python

I have a column in a dataframe that I want to convert to a date. The values of the column are either DDMONYYY or DD Month YYYY 00:00:00.000 GMT. For example, one row in the dataframe could have the value 31DEC2002 and the next row could have 31 December 2015 00:00:00.000 GMT. I think this is why I get an error when trying to convert the column to a date using pd.to_datetime or datetime.strptime to convert.
Anyone got any ideas? I'd be very grateful for any help/pointers.

For me working to_datetime with utc=True for converting all values to UTC and errors='coerce' for convert not parseable values to NaT (missing datetime):
df = pd.DataFrame({'date':['31DEC2002','31 December 2015 00:00:00.000 GMT','.']})
df['date'] = pd.to_datetime(df['date'], utc=True, errors='coerce')
print (df)
date
0 2002-12-31 00:00:00+00:00
1 2015-12-31 00:00:00+00:00
2 NaT

Related

why to_datetime() doesn't work when converting string to datetime pandas [duplicate]

I am trying to convert my column in a df into a time series. The dataset goes from March 23rd 2015-August 17th 2019 and the dataset looks like this:
time 1day_active_users
0 2015-03-23 00:00:00-04:00 19687.0
1 2015-03-24 00:00:00-04:00 19437.0
I am trying to convert the time column into a datetime series but it returns the column as an object. Here is the code:
data = pd.read_csv(data_path)
data.set_index('time', inplace=True)
data.index= pd.to_datetime(data.index)
data.index.dtype
data.index.dtype returns dtype('O'). I assume this is why when I try to index an element in time, it returns an error. For example, when I run this:
data.loc['2015']
It gives me this error
KeyError: '2015'
Any help or feedback would be appreciated. Thank you.
As commented, the problem might be due to the different timezones. Try passing utc=True to pd.to_datetime:
df['time'] = pd.to_datetime(df['time'],utc=True)
df['time']
Test Data
time 1day_active_users
0 2015-03-23 00:00:00-04:00 19687.0
1 2015-03-24 00:00:00-05:00 19437.0
Output:
0 2015-03-23 04:00:00+00:00
1 2015-03-24 05:00:00+00:00
Name: time, dtype: datetime64[ns, UTC]
And then:
df.set_index('time', inplace=True)
df.loc['2015']
gives
1day_active_users
time
2015-03-23 04:00:00+00:00 19687.0
2015-03-24 05:00:00+00:00 19437.0

How to remove hours, minutes, seconds and UTC offset from pandas date column? I'm running with streamlit and pandas

How to remove T00:00:00+05:30 after year, month and date values in pandas? I tried converting the column into datetime but also it's showing the same results, I'm using pandas in streamlit. I tried the below code
df['Date'] = pd.to_datetime(df['Date'])
The output is same as below :
Date
2019-07-01T00:00:00+05:30
2019-07-01T00:00:00+05:30
2019-07-02T00:00:00+05:30
2019-07-02T00:00:00+05:30
2019-07-02T00:00:00+05:30
2019-07-03T00:00:00+05:30
2019-07-03T00:00:00+05:30
2019-07-04T00:00:00+05:30
2019-07-04T00:00:00+05:30
2019-07-05T00:00:00+05:30
Can anyone help me how to remove T00:00:00+05:30 from the above rows?
If I understand correctly, you want to keep only the date part.
Convert date strings to datetime
df = pd.DataFrame(
columns={'date'},
data=["2019-07-01T02:00:00+05:30", "2019-07-02T01:00:00+05:30"]
)
date
0 2019-07-01T02:00:00+05:30
1 2019-07-02T01:00:00+05:30
2 2019-07-03T03:00:00+05:30
df['date'] = pd.to_datetime(df['date'])
date
0 2019-07-01 02:00:00+05:30
1 2019-07-02 01:00:00+05:30
Remove the timezone
df['datetime'] = df['datetime'].dt.tz_localize(None)
date
0 2019-07-01 02:00:00
1 2019-07-02 01:00:00
Keep the date only
df['date'] = df['date'].dt.date
0 2019-07-01
1 2019-07-02
Don't bother with apply to Python dates or string changes. The former will leave you with an object type column and the latter is slow. Just round to the day frequency using the library function.
>>> pd.Series([pd.Timestamp('2000-01-05 12:01')]).dt.round('D')
0 2000-01-06
dtype: datetime64[ns]
If you have a timezone aware timestamp, convert to UTC with no time zone then round:
>>> pd.Series([pd.Timestamp('2019-07-01T00:00:00+05:30')]).dt.tz_convert(None) \
.dt.round('D')
0 2019-07-01
dtype: datetime64[ns]
Pandas doesn't have a builtin conversion to datetime.date, but you could use .apply to achieve this if you want to have date objects instead of string:
import pandas as pd
import datetime
df = pd.DataFrame(
{"date": [
"2019-07-01T00:00:00+05:30",
"2019-07-01T00:00:00+05:30",
"2019-07-02T00:00:00+05:30",
"2019-07-02T00:00:00+05:30",
"2019-07-02T00:00:00+05:30",
"2019-07-03T00:00:00+05:30",
"2019-07-03T00:00:00+05:30",
"2019-07-04T00:00:00+05:30",
"2019-07-04T00:00:00+05:30",
"2019-07-05T00:00:00+05:30"]})
df["date"] = df["date"].apply(lambda x: datetime.datetime.fromisoformat(x).date())
print(df)

How to extract multiple parts of values of a single column?

I have a date column of the format YYYY-MM-DD. I want to slice the only year and month from it. But I don't want the "-" as I have to later convert it into an integer to feed into my linear regression model.
It's current datatype is "object".
Dataframe :-
date open close high low
0 2019-10-08 56.46 56.10 57.02 56.08
1 2019-10-09 56.76 56.76 56.95 56.41
2 2019-10-10 56.98 57.52 57.61 56.83
3 2019-10-11 58.24 59.05 59.41 58.08
4 2019-10-14 58.73 58.97 59.53 58.67
You can use pd.to_datetime to convert date column to datetime then use pd.Series.dt.strftime.
s = pd.to_datetime(df['date'])
df['date'] = s.dt.strftime("%Y%m") # would give 202010
# or
# df['date'] = s.dt.strftime("%y%m") # would give 2010
date --> your date column
df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'].apply(lambda x: x.strftime('%Y-%m'))

Pandas - Different time formats in the same column

I have a Dataframe that has dates stored in different formats in the same column as shown below:
date
1-10-2018
2-10-2018
3-Oct-2018
4-10-2018
Is there anyway I could make all of them to have the same format.
Use to_datetime with specify formats with errors='coerce' for replace not matched values to NaNs. Last combine_first for replace missing values by date2 Series.
date1 = pd.to_datetime(df['date'], format='%d-%m-%Y', errors='coerce')
date2 = pd.to_datetime(df['date'], format='%d-%b-%Y', errors='coerce')
df['date'] = date1.combine_first(date2)
print (df)
date
0 2018-10-01
1 2018-10-02
2 2018-10-03
3 2018-10-04

Pandas: datetime conversion from dtype object

I am working on a timeseries dataset which looks like this:
DateTime SomeVariable
0 01/01 01:00:00 0.24244
1 01/01 02:00:00 0.84141
2 01/01 03:00:00 0.14144
3 01/01 04:00:00 0.74443
4 01/01 05:00:00 0.99999
The date is without year. Initially, the dtype of the DateTime is object and I am trying to change it to pandas datetime format. Since the date in my data is without year, on using:
df['DateTime'] = pd.to_datetime(df.DateTime)
I am getting the error OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 01:00:00
I understand why I am getting the error (as it's not according to the pandas acceptable format), but what I want to know is how I can change the dtype from object to pandas datetime format without having year in my date. I would appreciate the hints.
EDIT 1:
Since, I got to know that I can't do it without having year in the data. So this is how I am trying to change the dtype:
df = pd.read_csv(some file location)
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'], format='%y%d/%m %H:%M:%S')
df.head()
On doing that, I am getting:
ValueError: time data '2018/ 01/01 01:00:00' doesn't match format specified.
EDIT 2:
Changing the format to '%Y/%m/%d %H:%M:%S'.
My data is hourly data, so it goes till 24h. I have only provided the demo data till 5h.
I was getting the space on adding the year to the DateTime. In order to remove that, this is what I did:
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'][1:], format='%Y/%m/%d %H:%M:%S')
I am getting the following error for that:
ValueError: time data '2018/ 01/01 02:00:00' doesn't match format specified
On changing the format to '%y/%m/%d %H:%M:%S' with the same code, this is the error I get:
ValueError: time data '2018/ 01/01 02:00:00' does not match format '%y/%m/%d %H:%M:%S' (match)
The problem is because of the gap after the year but I am not able to get rid of it.
EDIT 3:
I am able to get rid of the space after adding the year, however I am still not able to change the dtype.
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'].str.strip(), format='%Y/%m/%d %H:%M:%S')
ValueError: time data '2018/01/01 01:00:00' doesn't match format specified
I noticed that there are 2 spaces between the date and the time in the error, however adding 2 spaces in the format doesn't help.
EDIT 4 (Solution):
Removed all the multiple whitespaces. Still the format was not matching. The problem was because of the time format. The hours were from 1-24 in my data and pandas support 0-23. Simply changed the time 24:00:00 to 00:00:00 and it works perfectly now.
This is not possible. A datetime object must have a year.
What you can do is ensure all years are aligned for your data.
For example, to convert to datetime while setting year to 2018:
df = pd.DataFrame({'DateTime': ['01/01 01:00:00', '01/01 02:00:00', '01/01 03:00:00',
'01/01 04:00:00', '01/01 05:00:00']})
df['DateTime'] = pd.to_datetime('2018/'+df['DateTime'], format='%Y/%m/%d %H:%M:%S')
print(df)
DateTime
0 2018-01-01 01:00:00
1 2018-01-01 02:00:00
2 2018-01-01 03:00:00
3 2018-01-01 04:00:00
4 2018-01-01 05:00:00
# Remove spaces. Have in mind this will remove all spaces.
df['DateTime'] = df['DateTime'].str.replace(" ", "")
# I'm assuming year does not matter and that 01/01 is in the format day/month.
df['DateTime'] = pd.to_datetime(df['DateTime'], format='%d/%m%H:%M:%S')

Categories

Resources