How to replace dates out of range with valid dates in pandas - python

I've got a data set with 10,000 entries, one variable among others is the birthday. All the entries are unique. I noticed that about 200 entries have 1/1/1900 as birthday. The next frequent date only has a frequency of 4 and the date also doesn't make any sense in this data set. I reckon 1/1/1900 was used as a placeholder since the birthday couldn't be left empty. Long story short, I want to replace the dates of these entries with valid dates using the backfill method.
I changed the column with the birthday to a datetime object:
df['Client Birthdate'] = pd.to_datetime(df['Client Birthdate'], yearfirst=True)
I then tried to use:
timestamp = pd.Timestamp(year=1900, month=1, day=1)
df['Client Birthdate'] = df['Client Birthdate'].replace(to_replace=timestamp, method='bfill')
However, df['Client Birthdate'].describe() still gave me this as output:
[198 rows x 9 columns]
count 10000
unique 7897
top 1900-01-01 00:00:00
freq 198
first 1900-01-01 00:00:00
last 1999-12-30 00:00:00
Name: Client Birthdate, dtype: object
So I tried using:
df['Client Birthdate'] = df['Client Birthdate'].replace(to_replace=timestamp, value=False)
df['Client Birthdate'] = df['Client Birthdate'].fillna(method='bfill')
which gave me:
[198 rows x 9 columns]
count 10000
unique 7897
top False
freq 198
Name: Client Birthdate, dtype: object
I have no idea why replace/fillna doesn't work, are they not compatible with datetime objects?
Is there also a way to replace all dates 'out-of-range', let's say birthdays before 1920 and after 2001 with valid dates?

I also tried replacing and I think the problem is because of matching with regexes, in any case you can solve with :
df["Client Birthday"].loc[df["Client Birthday"].eq(timestamp)] = np.nan
df["Client Birthday"] = df["Client Birthday"].bfill()
I assigned NaT (not a time) where "Client Birthday" is equal to the timestamp variable and then used bfill on the series.
As for your second problem, you can use pandas between time and create a range of acceptable dates. Then, if anything falls out of the range you can fill the values or replace them with something more sensible.

I tried to make a simple dataframe:
df_dict = {
'Client Birthdate': '1/1/1900'
}
df = pd.DataFrame(ddict, index=[i for i in range(len(ddict))])
Calling df:
Client Birthdate
0 1/1/1900
Then, used infer_datetime_format within pd.to_datetime():
df['Client Birthdate'] = pd.to_datetime(df['Client Birthdate'], infer_datetime_format=True)
Output of calling df again:
Client Birthdate
0 1900-01-01
And, dtypes:
Client Birthdate datetime64[ns]
dtype: object
However, to get the hour-minute-second-microsecond result into your column, you have to know and set the format using strftime(). Here's a simple example:
pd.to_datetime(df['Client Birthdate'], format='%Y-%m-%d').dt.strftime('%Y-%m-%d %H:%M:%S.%f')
Output:
0 1900-01-01 00:00:00.000000
Name: Client Birthdate, dtype: object
# Finally, to update your dates, just subsection the dataframe and set it equal to the date that you want. This example uses .loc() since pandas will probably through a SettingWithCopyWarning error otherwise.
df.loc[df['Client Birthdate'] == '1/1/1900', :] = timestamp

Related

How to convert Pandas Series of strings to Pandas datetime with non-standard formats that contain dates before 1970

I have a column of dates in the following format:
Jan-85
Apr-99
Nov-01
Feb-65
Apr-57
Dec-19
I want to convert this to a pandas datetime object.
The following syntax works to convert them:
pd.to_datetime(temp, format='%b-%y')
where temp is the pd.Series object of dates. The glaring issue here of course is that dates that are prior to 1970 are being wrongly converted to 20xx.
I tried updating the function call with the following parameter:
pd.to_datetime(temp, format='%b-%y', origin='1950-01-01')
However, I am getting the error:
Name: temp, Length: 42537, dtype: object' is not compatible with origin='1950-01-01'; it must be numeric with a unit specified
I tried specifying a unit as it said, but I got a different error citing that the unit cannot be specified alongside a format.
Any ideas how to fix this?
Just #DudeWah's logic, but improving upon the code:
def days_of_future_past(date,chk_y=pd.Timestamp.today().year):
return date.replace(year=date.year-100) if date.year > chk_y else date
temp = pd.to_datetime(temp,format='%b-%y').map(days_of_future_past)
Output:
>>> temp
0 1985-01-01
1 1999-04-01
2 2001-11-01
3 1965-02-01
4 1957-04-01
5 2019-12-01
6 1965-05-01
Name: date, dtype: datetime64[ns]
Gonna go ahead and answer my own question so others can use this solution if they come across this same issue. Not the greatest, but it gets the job done. It should work until 2069, so hopefully pandas will have a better solution to this by then lol
Perhaps someone else will post a better solution.
def wrong_date_preprocess(data):
"""Correct date issues with pre-1970 dates with whacky mon-yy format."""
df1 = data.copy()
dates = df1['date_column_of_interest']
# use particular datetime format with data; ex: jan-91
dates = pd.to_datetime(dates, format='%b-%y')
# look at wrongly defined python dates (pre 1970) and get indices
date_dummy = dates[dates > pd.Timestamp.today().floor('D')]
idx = list(date_dummy.index)
# fix wrong dates by offsetting 100 years back dates that defaulted to > 2069
dummy2 = date_dummy.apply(lambda x: x.replace(year=x.year - 100)).to_list()
dates.loc[idx] = dummy2
df1['date_column_of_interest'] = dates
return(df1)

Convert Pandas column into same date time format

I have a file where the date and time are in mixed formats as per below:
Ref_ID Date_Time
5.645217e 2020-12-02 16:23:15
5.587422e 2019-02-25 18:33:24
What I'm trying to do is convert the dates into a standard format so that I can further analyse my dataset.
Expected Outcome:
Ref_ID Date_Time
5.645217e 2020-02-12 16:23:15
5.587422e 2019-02-25 18:33:24
So far I've tried a few things like Pandas to_datetime conversion and converting the date using strptime but none has worked so far.
# Did not work
data["Date_Time"] = pd.to_datetime(data["Date_Time"], errors="coerce")
# Also Did not work
data["Date_Time"] = data["Date_Time"].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%y'))
I've also searched this site for a solution but haven't found one yet.
you could try uisng str.split to extract the day and month and use some boolean testing:
this may be a bit confusing with all the variables but all we are doing is creating new series and dataframes to manipulate the variables, those being the day and month of your original date-time column
# create new dataframe with time split by space so date and time are split
s = df['Date_Time'].str.split('\s',expand=True)
# split date into its own series
m = s[0].str.split('-',expand=True).astype(int)
#use conditional logic to figure out column is the month or day.
m['possible_month'] = np.where(m[1].ge(12),m[2],m[1])
m['possible_day'] = np.where(m[1].ge(12),m[1],m[2])
#concat this back into your first split to re-create a proper datetime.
s[0] = m[0].astype(str).str.cat([m['possible_month'].astype(str),
m['possible_day'].astype(str)],'-')
df['fixed_date'] = pd.to_datetime(s[0].str.cat(s[1].astype(str),' ')
,format='%Y-%m-%d %H:%M:%S')
print(df)
Ref_ID Date_Time fixed_date
0 5.645217e 2020-12-02 16:23:15 2020-02-12 16:23:15
1 5.587422e 2019-02-25 18:33:24 2019-02-25 18:33:24
print(df.dtypes)
Ref_ID object
Date_Time object
fixed_date datetime64[ns]
dtype: object

KeyError: Timestamp when converting date in column to date

Trying to convert the date (type=datetime) of a complete column into a date to use in a condition later on. The following error keeps showing up:
KeyError: Timestamp('2010-05-04 10:15:55')
Tried multiple things but I'm currently stuck with the code below.
for d in df.column:
pd.to_datetime(df.column[d]).apply(lambda x: x.date())
Also, how do I format the column so I can use it in a statement as follows:
df = df[df.column > 2015-05-28]
Just adding an answer in case anyone else ends up here :
firstly, lets create a dataframe with some dates, change the dtype into a string and convert it back. the errors='ignore' argument will ignore any non date time values in your column, so if you had John Smith in row x it would remain, on the same vein, if you changed errors='coerce' it would change John Smith into NaT (not a time value)
# Create date range with frequency of a day
rng = pd.date_range(start='01/01/18', end ='01/01/19',freq='D')
#pass this into a dataframe
df = pd.DataFrame({'Date' : rng})
print(df.dtypes)
Date datetime64[ns]
#okay lets case this into a str so we can convert it back
df['Date'] = df['Date'].astype(str)
print(df.dtypes)
Date object
# now lets convert it back #
df['Date'] = pd.to_datetime(df.Date,errors='ignore')
print(df.dtypes)
Date datetime64[ns]
# Okay lets slice the data frame for your desired date ##
print(df.loc[df.Date > '2018-12-29'))
Date
363 2018-12-30
364 2018-12-31
365 2019-01-01
The answer as provided by #Datanovice:
pd.to_datetime(df['your column'],errors='ignore')
then inspect the dtype it should be a datetime, if so, just do
df.loc[df.['your column'] > 'your-date' ]

How to perform logical tests on time values in a pandas dataframe

I have an excel sheet where one column contains a time field, where the values are the time of day entered as four digits: i.e. 0845, 1630, 1000.
I've read this into a pandas dataframe for analysis, one piece of which is labeling each time as day or evening. To do this, I first changed the datatype and format:
# Get start time as time
df['START_TIME'] = pd.to_datetime(df['START_TIME'],format='%H%M').dt.time
Which gets the values looking like:
08:45:00
16:30:00
10:00:00
The new dtype is object.
When I try to perform a logical test on that field, i.e.
# Create indicator of whether course begins before or after 4:00 PM
df['DAY COURSE INDICATOR'] = df['START_TIME'] < '16:00:00'
I get a Type Error:
TypeError: '<' not supported between instances of >'datetime.time' and 'str'
or syntax error if I remove the quotes.
What is the best way to create that indicator; how do I work with stand-alone time values? Or am I better off just leaving them as integers.
You can't compare a datetime.time and a str but you certainly can compare a datetime.time and a datetime.time:
import datetime
df['DAY COURSE INDICATOR'] = df['START_TIME'] < datetime.time(16, 0)
You can do exactly what you did in the first place:
pd.to_datetime(df['START_TIME'], format='%H:%M:%S') < pd.to_datetime('16:00:00', format='%H:%M:%S')
Example:
df = pd.DataFrame({'START_TIME': ['08:45']})
>>> pd.to_datetime(df['START_TIME'], format='%H:%M:%S') < pd.to_datetime('16:00:00', format='%H:%M:%S')
0 True
Name: START_TIME, dtype: bool

How do I change the Date but not the Time of a Timestamp within a dataframe column?

Python 3.6.0
I am importing a file with Unix timestamps.
I’m converting them to Pandas datetime and rounding to 10 minutes (12:00, 12:10, 12:20,…)
The data is collected from within a specified time period, but from different dates.
For our analysis, we want to change all dates to the same dates before doing a resampling.
At present we have a reduce_to_date that is the target for all dates.
current_date = pd.to_datetime('2017-04-05') #This will later be dynamic
reduce_to_date = current_date - pd.DateOffset(days=7)
I’ve tried to find an easy way to change the date in a series without changing the time.
I was trying to avoid lengthy conversions with .strftime().
One method that I've almost settled is to add the reduce_to_date and df['Timestamp'] difference to df['Timestamp']. However, I was trying to use the .date() function and that only works on a single element, not on the series.
GOOD!
passed_df['Timestamp'][0] = passed_df['Timestamp'][0] + (reduce_to_date.date() - passed_df['Timestamp'][0].date())
NOT GOOD
passed_df['Timestamp'][:] = passed_df['Timestamp'][:] + (reduce_to_date.date() - passed_df['Timestamp'][:].date())
AttributeError: 'Series' object has no attribute 'date'
I can use a loop:
x=1
for line in passed_df['Timestamp']:
passed_df['Timestamp'][x] = line + (reduce_to_date.date() - line.date())
x+=1
But this throws a warning:
C:\Users\elx65i5\Documents\Lightweight Logging\newmain.py:60: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The goal is to have all dates the same, but leave the original time.
If we can simply specify the replacement date, that’s great.
If we can use mathematics and change each date according to a time delta, equally as great.
Can we accomplish this in a vectorized fashion without using .strftime() or a lengthy procedure?
If I understand correctly, you can simply subtract an offset
passed_df['Timestamp'] -= pd.offsets.Day(7)
demo
passed_df=pd.DataFrame(dict(
Timestamp=pd.to_datetime(['2017-04-05 15:21:03', '2017-04-05 19:10:52'])
))
# Make sure your `Timestamp` column is datetime.
# Mine is because I constructed it that way.
# Use
# passed_df['Timestamp'] = pd.to_datetime(passed_df['Timestamp'])
passed_df['Timestamp'] -= pd.offsets.Day(7)
print(passed_df)
Timestamp
0 2017-03-29 15:21:03
1 2017-03-29 19:10:52
using strftime
Though this is not ideal, I wanted to make a point that you absolutely can use strftime. When your column is datetime, you can use strftime via the dt date accessor with dt.strftime. You can create a dynamic column where you specify the target date like this:
pd.to_datetime(passed_df.Timestamp.dt.strftime('{} %H:%M:%S'.format('2017-03-29')))
0 2017-03-29 15:21:03
1 2017-03-29 19:10:52
Name: Timestamp, dtype: datetime64[ns]
I think you need convert df['Timestamp'].dt.date to_datetime, because output of date is python date object, not pandas datetime object:
df=pd.DataFrame({'Timestamp':pd.to_datetime(['2017-04-05 15:21:03','2017-04-05 19:10:52'])})
print (df)
Timestamp
0 2017-04-05 15:21:03
1 2017-04-05 19:10:52
current_date = pd.to_datetime('2017-04-05')
reduce_to_date = current_date - pd.DateOffset(days=7)
df['Timestamp'] = df['Timestamp'] - reduce_to_date + pd.to_datetime(df['Timestamp'].dt.date)
print (df)
Timestamp
0 2017-04-12 15:21:03
1 2017-04-12 19:10:52

Categories

Resources