conversion of any date to datetime in pandas dataframe - python

I want to define a function that takes in a pandas dateframe and iterates through its columns to find if there is any date field or a timestamp which can be an object. I want to convert those fields to DateTime. In short, I want to clean my different dataframes which can come from different sources and have different timestamps and store them. can you guys point me in the right direction?
def dateConversion(flattenedObject):
for col in df.columns:
df[col]=pd.to_datetime(df[col])

something like this.. ?
def test_datetime(dataframe):
for col in dataframe.select_dtypes('object').columns:
try:
dataframe[col] = pd.to_datetime(dataframe[col])
print(f'Converted {col}')
except ValueError:
print(f"{col} does not conform to standard datetime format")
Test
df = pd.DataFrame({'date_test' : ['01-01-2020'],
'string_test' : ['not a date']})
print(test_datetime(df))
Converted date_test
string_test does not conform to standard datetime format
print(df.dtypes)
date_test datetime64[ns]
string_test object
dtype: object

Related

Convert object-type hours:minutes:seconds column to datetime type in Pandas

I have a column called Time in a dataframe that looks like this:
599359 12:32:25
326816 17:55:22
326815 17:55:22
358789 12:48:25
361553 12:06:45
...
814512 21:22:07
268266 18:57:31
659699 14:28:20
659698 14:28:20
268179 17:48:53
Name: Time, Length: 546967, dtype: object
And right now it is an object dtype. I've tried the following to convert it to a datetime:
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time
And I understand that the .dt.time methods are needed to prevent the Year and Month from being added, but I believe this is causing the dtype to revert to an object.
Any workarounds? I know I could do
df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True)
but I have over 500,000 rows and this is taking forever.
When you do this bit: df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S', errors='coerce', utc = True).dt.time, you're converting the 'Time' column to have pd.dtype as object... and that "object" is the python type datetime.time.
The pandas dtype pd.datetime is a different type than python's datetime.datetime objects. And pandas' pd.datetime does not support time objects (i.e. you can't have pandas consider the column a datetime without providing the year). This is the dtype is changing to object.
In the case of your second approach, df['Time'] = df['Time'].apply(pd.to_datetime, format='%H:%M:%S', errors='coerce', utc = True) there is something slightly different happening. In this case you're applying the pd.to_datetime to each scalar element of the 'Time' series. Take a look at the return types of the function in the docs, but basically in this case the time values in your df are being converted to pd.datetime objects on the 1st of january 1900. (i.e. a default date is added).
So: pandas is behaving correctly. If you only want the times, then it's okay to use the datetime.time objects in the column. But to operate on them you'll probably be relying on many [slow] df.apply methods. Alternatively, just keep the default date of 1900-01-01 and then you can add/subtract the pd.datetime columns and get the speed advantage of pandas. Then just strip off the date when you're done with it.

how to change date datatype from object to int64 without changing it's values

I have a column in my dataframe which consists of date 1/6/2023 (m/d/yyy) format. The date datatype is object but I want to convert it from object to int64 data type. I have tried the following code but it is drastically changing date values:
df = df.astype({'date':'int'})
is changing my values drastically is there any other alternative for the same ?
df = df.astype({'date':'int'})
Convert values to datetimes, then to strings - e.g. here YYYYMMDD format and last to integers:
print (df)
date
0 1/6/2023
df['date'] = pd.to_datetime(df['date'], dayfirst=True).dt.strftime('%Y%m%d').astype(int)
print (df)
date
0 20230601

Separating Date and Time in Pandas

I have a data file with timestamps that look like this:
It gets loaded into pandas with a column name of "Time". I am trying to create two new datetime64 type columns, one with the date and one with the time (hour). I have explored a few solutions to this problem on StackOverflow but am still having issues. Quick note, I need the final columns to not be objects so I can use pandas and numpy functionality.
I load the dataframe and create two new columns like so:
df = pd.read_csv('C:\\Users\\...\\xyz.csv')
df['Date'] = pd.to_datetime(df['Time']).dt.date
df['Hour'] = pd.to_datetime(df['Time']).dt.time
This works but the Date and Hour columns are now objects.
I run this to convert the date to my desired datetime64 data type and it works:
df['Date'] = pd.to_datetime(df['Date'])
However, when I try to use this same code on the Hour column, I get an error:
TypeError: <class 'datetime.time'> is not convertible to datetime
I did some digging and found the following code which runs:
df['Hour'] = pd.to_datetime(df['Hour'], format='%H:%M:%S')
However the actual output includes a generic date like so:
When I try to run code referencing the Hour column like so:
HourVarb = '15:00:00'
df['Test'] = np.where(df['Hour']==HourVarb,1,np.nan)
It runs but doesn't produce the result I want.
Perhaps my HourVarb variable is the wrong format for the numpy code? Alternatively, the 1/1/1900 is causing problems and the format %H: %M: %S needs to change? My end goal is to be able to reference the hour and the date to filter out specific date/hour combinations. Please help.
One note, when I change the HourVarb to '1/1/1900 15:00:00' the code above works as intended, but I'd still like to understand if there is a cleaner way that removes the date. Thanks
I'm not sure I understand the problem with the 'object' datatypes of these columns.
I loaded the data you provided this way:
df = pd.read_csv('xyz.csv')
df['Time'] = pd.to_datetime(df['Time'])
df['Date'] = df['Time'].dt.date
df['Hour'] = df['Time'].dt.time
print(df.dtypes)
And I get these data types:
Time datetime64[ns]
Date object
Hour object
The fact that Date and Hour are object types should not be a problem. The underlying data is a datetime type:
print(type(df.Date.iloc[0]))
print(type(df.Hour.iloc[0]))
<class 'datetime.date'>
<class 'datetime.time'>
This means you can use these columns as such. For example:
print(df['Date'] + pd.Timedelta('1D'))
What are you trying to do that is requiring the column dtype to be a Pandas dtype?
UPDATE
Here is how you achieve the last part of your question:
from datetime import datetime, time
hourVarb = datetime.strptime("15:00:00", '%H:%M:%S').time()
# or hourVarb = time(15, 0)
df['Test'] = df['Hour'] == hourVarb
print(df['Test'])
0 True
1 False
2 False
3 False
Name: Test, dtype: bool

Extract month data from pandas Dataframe

I originally have dates in string format.
I want to extract the month as a number from these dates.
df = pd.DataFrame({'Date':['2011/11/2', '2011/12/20', '2011/8/16']})
I convert them to a pandas datetime object.
df['Date'] = pd.to_datetime(df['Date'])
I then want to extract all the months.
When I try:
df.loc[0]["Date"].month
This works returning the correct value of 11.
But when I try to call multiple months it doesn't work?
df.loc[1:2]["Date"].month
AttributeError: 'Series' object has no attribute 'month'
df.loc[0]["Date"] returns a scalar: pd.Timestamp objects have a month attribute, which is what you are accessing.
df.loc[1:2]["Date"] returns a series: pd.Series objects do not have a month attribute, they do have a dt.month attribute if df['Date'] is a datetime series.
In addition, don't use chained indexing. You can use:
df.loc[0, 'Date'].month for a scalar
df.loc[1:2, 'Date'].dt.month for a series
There are different functions. pandas.Series.dt.month for converting Series filled by datetimes and pandas.Timestamp for converting scalar. For converting Index is function pandas.DatetimeIndex.month, there is no .dt.
So need:
#Series
df.loc[1:2, "Date"].dt.month
#scalar
df.loc[0, 'Date'].month
#DatetimeIndex
df.set_index('Date').month

pandas timestamp series to string?

I am new to python (coming from R), and I am trying to understand how I can convert a timestamp series in a pandas dataframe (in my case this is called df['timestamp']) into what I would call a string vector in R. is this possible? How would this be done?
I tried df['timestamp'].apply('str'), but this seems to simply put the entire column df['timestamp'] into one long string. I'm looking to convert each element into a string and preserve the structure, so that it's still a vector (or maybe this a called an array?)
Consider the dataframe df
df = pd.DataFrame(dict(timestamp=pd.to_datetime(['2000-01-01'])))
df
timestamp
0 2000-01-01
Use the datetime accessor dt to access the strftime method. You can pass a format string to strftime and it will return a formatted string. When used with the dt accessor you will get a series of strings.
df.timestamp.dt.strftime('%Y-%m-%d')
0 2000-01-01
Name: timestamp, dtype: object
Visit strftime.org for a handy set of format strings.
Use astype
>>> import pandas as pd
>>> df = pd.to_datetime(pd.Series(['Jul 31, 2009', '2010-01-10', None]))
>>> df.astype(str)
0 2009-07-31
1 2010-01-10
2 NaT
dtype: object
returns an array of strings
Following on from VinceP's answer, to convert a datetime Series in-place do the following:
df['Column_name']=df['Column_name'].astype(str)

Categories

Resources