Pandas series of timestamps casting to string - python

I'm new to pandas and am curious if there is something wrong with my approach to splitting a date column into separate hour, minute, seconds, etc columns.
I have a column in my dataframe (df) that contains only dates ('dates'). I convert this to a series of timestamps using the below line of code:
dt = to_datetime(df['dates'], format='%m/%d/%Y %H:%M:%S.%f', errors='ignore')
When I iterate through dt using a for loop I get all of the dates parsed properly and can call out separate properties of the timestamp without an issue. For example:
for i in [d.year for d in dt]:
print(i) # Correctly prints 2017 for all dates
for i in [d for d in dt]:
print(type(i)) # Prints <class 'pandas._libs.tslib.Timestamp'>
But when I attempt to add a new column to my dataframe using the below line of code:
df['year'] = Series([d.year for d in dt], index=df.index)
I get an attribute error:
AttributeError: 'str' object has no attribute 'year'
Some example date strings are: '09/22/2017 19:23:12.993', '09/01/2017 02:47:11.593'.
Does anyone know why python appears to be interpreting d as a string in the final example but not in the first two?

Related

selecting a df row by month formatted with (lambda x: datetime.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z'))

I'm having some issues with geopandas and pandas datetime objects; I kept getting the error
pandas Invalid field type <class 'pandas._libs.tslibs.timedeltas.Timedelta'>
when I try to save it using gpd.to_file() apparently this is a known issue between pandas and geopandas date types so I used
df.DATE = df.DATE.apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z'))
to get a datetime object I could manipulate without getting the aforementioned error when I save the results. Due to that change, my selection by month
months = [4]
for month in months:
df = df[[(pd.DatetimeIndex(df.DATE).month == month)]]
no longer works, throwing a value error.
ValueError: Item wrong length 1 instead of 108700.
I tried dropping the pd.DatetimeIndex but this throws a dataframe series error
AttributeError: 'Series' object has no attribute 'month'
and
df = df[(df.DATE.month == month)]
gives me the same error.
I know it converted over to a datetime object because print(df.dtype) shows DATE datetime64[ns, UTC] and
for index, row in df.iterrows():
print(row.DATE.month)
prints the month as a integer to the terminal.
Without going back to pd.Datetime how can I fix my select statement for the month?
The statement df.DATE returns a Series object. That doesn't have a .month attribute. The dates inside the Series do, which is why row.DATE.month works. Try something like:
filter = [x.month == month for x in df.DATE]
df_filtered = df[filter]
Before that, I'm not sure what you're trying to accomplish with pd.DatetimeIndex(df.DATE).month == month) but a similar fix should take care of it.

Separating Date and Time in Pandas

I have a data file with timestamps that look like this:
It gets loaded into pandas with a column name of "Time". I am trying to create two new datetime64 type columns, one with the date and one with the time (hour). I have explored a few solutions to this problem on StackOverflow but am still having issues. Quick note, I need the final columns to not be objects so I can use pandas and numpy functionality.
I load the dataframe and create two new columns like so:
df = pd.read_csv('C:\\Users\\...\\xyz.csv')
df['Date'] = pd.to_datetime(df['Time']).dt.date
df['Hour'] = pd.to_datetime(df['Time']).dt.time
This works but the Date and Hour columns are now objects.
I run this to convert the date to my desired datetime64 data type and it works:
df['Date'] = pd.to_datetime(df['Date'])
However, when I try to use this same code on the Hour column, I get an error:
TypeError: <class 'datetime.time'> is not convertible to datetime
I did some digging and found the following code which runs:
df['Hour'] = pd.to_datetime(df['Hour'], format='%H:%M:%S')
However the actual output includes a generic date like so:
When I try to run code referencing the Hour column like so:
HourVarb = '15:00:00'
df['Test'] = np.where(df['Hour']==HourVarb,1,np.nan)
It runs but doesn't produce the result I want.
Perhaps my HourVarb variable is the wrong format for the numpy code? Alternatively, the 1/1/1900 is causing problems and the format %H: %M: %S needs to change? My end goal is to be able to reference the hour and the date to filter out specific date/hour combinations. Please help.
One note, when I change the HourVarb to '1/1/1900 15:00:00' the code above works as intended, but I'd still like to understand if there is a cleaner way that removes the date. Thanks
I'm not sure I understand the problem with the 'object' datatypes of these columns.
I loaded the data you provided this way:
df = pd.read_csv('xyz.csv')
df['Time'] = pd.to_datetime(df['Time'])
df['Date'] = df['Time'].dt.date
df['Hour'] = df['Time'].dt.time
print(df.dtypes)
And I get these data types:
Time datetime64[ns]
Date object
Hour object
The fact that Date and Hour are object types should not be a problem. The underlying data is a datetime type:
print(type(df.Date.iloc[0]))
print(type(df.Hour.iloc[0]))
<class 'datetime.date'>
<class 'datetime.time'>
This means you can use these columns as such. For example:
print(df['Date'] + pd.Timedelta('1D'))
What are you trying to do that is requiring the column dtype to be a Pandas dtype?
UPDATE
Here is how you achieve the last part of your question:
from datetime import datetime, time
hourVarb = datetime.strptime("15:00:00", '%H:%M:%S').time()
# or hourVarb = time(15, 0)
df['Test'] = df['Hour'] == hourVarb
print(df['Test'])
0 True
1 False
2 False
3 False
Name: Test, dtype: bool

How to pop out the error-causing date records using pandas?

I have a dataframe like as shown below
df = pd.DataFrame({'date': ['45:42.7','11/1/2012 0:00','20/1/2012 2:48','15/1/2012 0:00',np.nan]})
I would like to convert the date column to type datetime.
So, I tried the below
df['date'] = pd.to_datetime(df['date'])
I get the below error
ValueError: hour must be in 0..23
As we can see from the sample dataframe NA is not causing this error but the 1st record which is 45:42.7.
While the raw excel file displays this date value 45:42.7 when I open the file but when I double click the cell, it displays correctly the actual date.
How can I filter the dataframe to pop-out the first record as output (which is the error causing record)?
I expect my output to be like shown in sample dataframe below
df = pd.DataFrame({'error_date': ['45:42.7']})
First if need to see wrong values convert to datetimes and filter missing values like:
print(df[pd.to_datetime(df['date'], format='%d/%m/%Y %H:%M',errors='coerce').isna()])
I think None is no problem, you need specify column format and for not matched rows are generated NaNs if add errors='coerce' parameter:
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y %H:%M',errors='coerce')
print (df)
date
0 2012-03-06 08:57:00
1 2012-01-11 00:00:00
2 2012-01-20 02:48:00
3 2012-01-15 00:00:00
4 NaT
The Error is caused by using something like 24:00.
Testing with (note the change in the second entry to 24:00):
df = pd.DataFrame({'date': ['6/3/2012 8:57','11/1/2012 24:00','20/1/2012 2:48','15/1/2012 0:00',np.nan]})
We receive the same error as in your big dataframe. Going trough with a for loop maybe a bit slower but this way we can catch the errors.
wrong_datetime_list = []
for index, value in enumerate(df['date']):
try:
df.loc[index,'date']= pd.to_datetime(df.loc[index,'date'])
except:
wrong_datetime_list.append((index, value))

Iterate through defined Datetime index range in Pandas Dataframe

I can see from here how to iterate through a list of dates from a datetime index. However, I would like to define the range of dates using:
my_df['Some_Column'].first_valid_index()
and
my_df['Some_Column'].last_valid_index()
My attempt looks like this:
for today_index, values in range(my_df['Some_Column'].first_valid_index() ,my_df['Some_Column'].last_valid_index()):
print(today_index)
However I get the following error:
TypeError: 'Timestamp' object cannot be interpreted as an integer
How do I inform the loop to restrict to those specific dates?
I think you need date_range:
s = my_df['Some_Column'].first_valid_index()
e = my_df['Some_Column'].last_valid_index()
r = pd.date_range(s, e)
And for loop use:
for val in r:
print (val)
If need selecting rows in DataFrame:
df1 = df.loc[s:e]

Extract month data from pandas Dataframe

I originally have dates in string format.
I want to extract the month as a number from these dates.
df = pd.DataFrame({'Date':['2011/11/2', '2011/12/20', '2011/8/16']})
I convert them to a pandas datetime object.
df['Date'] = pd.to_datetime(df['Date'])
I then want to extract all the months.
When I try:
df.loc[0]["Date"].month
This works returning the correct value of 11.
But when I try to call multiple months it doesn't work?
df.loc[1:2]["Date"].month
AttributeError: 'Series' object has no attribute 'month'
df.loc[0]["Date"] returns a scalar: pd.Timestamp objects have a month attribute, which is what you are accessing.
df.loc[1:2]["Date"] returns a series: pd.Series objects do not have a month attribute, they do have a dt.month attribute if df['Date'] is a datetime series.
In addition, don't use chained indexing. You can use:
df.loc[0, 'Date'].month for a scalar
df.loc[1:2, 'Date'].dt.month for a series
There are different functions. pandas.Series.dt.month for converting Series filled by datetimes and pandas.Timestamp for converting scalar. For converting Index is function pandas.DatetimeIndex.month, there is no .dt.
So need:
#Series
df.loc[1:2, "Date"].dt.month
#scalar
df.loc[0, 'Date'].month
#DatetimeIndex
df.set_index('Date').month

Categories

Resources