Trying to convert the date (type=datetime) of a complete column into a date to use in a condition later on. The following error keeps showing up:
KeyError: Timestamp('2010-05-04 10:15:55')
Tried multiple things but I'm currently stuck with the code below.
for d in df.column:
pd.to_datetime(df.column[d]).apply(lambda x: x.date())
Also, how do I format the column so I can use it in a statement as follows:
df = df[df.column > 2015-05-28]
Just adding an answer in case anyone else ends up here :
firstly, lets create a dataframe with some dates, change the dtype into a string and convert it back. the errors='ignore' argument will ignore any non date time values in your column, so if you had John Smith in row x it would remain, on the same vein, if you changed errors='coerce' it would change John Smith into NaT (not a time value)
# Create date range with frequency of a day
rng = pd.date_range(start='01/01/18', end ='01/01/19',freq='D')
#pass this into a dataframe
df = pd.DataFrame({'Date' : rng})
print(df.dtypes)
Date datetime64[ns]
#okay lets case this into a str so we can convert it back
df['Date'] = df['Date'].astype(str)
print(df.dtypes)
Date object
# now lets convert it back #
df['Date'] = pd.to_datetime(df.Date,errors='ignore')
print(df.dtypes)
Date datetime64[ns]
# Okay lets slice the data frame for your desired date ##
print(df.loc[df.Date > '2018-12-29'))
Date
363 2018-12-30
364 2018-12-31
365 2019-01-01
The answer as provided by #Datanovice:
pd.to_datetime(df['your column'],errors='ignore')
then inspect the dtype it should be a datetime, if so, just do
df.loc[df.['your column'] > 'your-date' ]
Related
I have a time column with the format XXXHHMMSS where XXX is the Day of Year. I also have a year column. I want to merge both these columns into one date time object.
Before I had detached XXX into a new column but this was making it more complicated.
I've converted the two columns to strings
points['UTC_TIME'] = points['UTC_TIME'].astype(str)
points['YEAR_'] = points['YEAR_'].astype(str)
Then I have the following line:
points['Time'] = pd.to_datetime(points['YEAR_'] * 1000 + points['UTC_TIME'], format='%Y%j%H%M%S')
I'm getting the value errorr, ValueError: time data '137084552' does not match format '%Y%j%H%M%S' (match)
Here is a photo of my columns and a link to the data
works fine for me if you combine both columns as string, EX:
import pandas as pd
df = pd.DataFrame({'YEAR_': [2002, 2002, 2002],
'UTC_TIME': [99082552, 135082552, 146221012]})
pd.to_datetime(df['YEAR_'].astype(str) + df['UTC_TIME'].astype(str).str.zfill(9),
format="%Y%j%H%M%S")
# 0 2002-04-09 08:25:52
# 1 2002-05-15 08:25:52
# 2 2002-05-26 22:10:12
# dtype: datetime64[ns]
Note, since %j expects zero-padded day of year, you might need to zero-fill, see first row in the example above.
I have a dataframe with time column as string and I should convert it to a timestamp only with h:m:sec.ms . Here an example:
import pandas as pd
df=pd.DataFrame({'time': ['02:21:18.110']})
df.time= pd.to_datetime(df.time , format="%H:%M:%S.%f")
df # I get 1900-01-01 02:21:18.110
Without format flag, I get current day 2020-12-16. How can I get the stamp without year-month-day which seemingly always is included. Thanks!
If need processing values later by some datetimelike methods better is convert values to timedeltas by to_timedelta instead times:
df['time'] = pd.to_timedelta(df['time'])
print (df)
time
0 0 days 02:21:18.110000
You need this:
df=pd.DataFrame({'time': ['02:21:18.110']})
df['time'] = pd.to_datetime(df['time']).dt.time
In [1023]: df
Out[1023]:
time
0 02:21:18.110000
I have a file where the date and time are in mixed formats as per below:
Ref_ID Date_Time
5.645217e 2020-12-02 16:23:15
5.587422e 2019-02-25 18:33:24
What I'm trying to do is convert the dates into a standard format so that I can further analyse my dataset.
Expected Outcome:
Ref_ID Date_Time
5.645217e 2020-02-12 16:23:15
5.587422e 2019-02-25 18:33:24
So far I've tried a few things like Pandas to_datetime conversion and converting the date using strptime but none has worked so far.
# Did not work
data["Date_Time"] = pd.to_datetime(data["Date_Time"], errors="coerce")
# Also Did not work
data["Date_Time"] = data["Date_Time"].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%y'))
I've also searched this site for a solution but haven't found one yet.
you could try uisng str.split to extract the day and month and use some boolean testing:
this may be a bit confusing with all the variables but all we are doing is creating new series and dataframes to manipulate the variables, those being the day and month of your original date-time column
# create new dataframe with time split by space so date and time are split
s = df['Date_Time'].str.split('\s',expand=True)
# split date into its own series
m = s[0].str.split('-',expand=True).astype(int)
#use conditional logic to figure out column is the month or day.
m['possible_month'] = np.where(m[1].ge(12),m[2],m[1])
m['possible_day'] = np.where(m[1].ge(12),m[1],m[2])
#concat this back into your first split to re-create a proper datetime.
s[0] = m[0].astype(str).str.cat([m['possible_month'].astype(str),
m['possible_day'].astype(str)],'-')
df['fixed_date'] = pd.to_datetime(s[0].str.cat(s[1].astype(str),' ')
,format='%Y-%m-%d %H:%M:%S')
print(df)
Ref_ID Date_Time fixed_date
0 5.645217e 2020-12-02 16:23:15 2020-02-12 16:23:15
1 5.587422e 2019-02-25 18:33:24 2019-02-25 18:33:24
print(df.dtypes)
Ref_ID object
Date_Time object
fixed_date datetime64[ns]
dtype: object
I've got a data set with 10,000 entries, one variable among others is the birthday. All the entries are unique. I noticed that about 200 entries have 1/1/1900 as birthday. The next frequent date only has a frequency of 4 and the date also doesn't make any sense in this data set. I reckon 1/1/1900 was used as a placeholder since the birthday couldn't be left empty. Long story short, I want to replace the dates of these entries with valid dates using the backfill method.
I changed the column with the birthday to a datetime object:
df['Client Birthdate'] = pd.to_datetime(df['Client Birthdate'], yearfirst=True)
I then tried to use:
timestamp = pd.Timestamp(year=1900, month=1, day=1)
df['Client Birthdate'] = df['Client Birthdate'].replace(to_replace=timestamp, method='bfill')
However, df['Client Birthdate'].describe() still gave me this as output:
[198 rows x 9 columns]
count 10000
unique 7897
top 1900-01-01 00:00:00
freq 198
first 1900-01-01 00:00:00
last 1999-12-30 00:00:00
Name: Client Birthdate, dtype: object
So I tried using:
df['Client Birthdate'] = df['Client Birthdate'].replace(to_replace=timestamp, value=False)
df['Client Birthdate'] = df['Client Birthdate'].fillna(method='bfill')
which gave me:
[198 rows x 9 columns]
count 10000
unique 7897
top False
freq 198
Name: Client Birthdate, dtype: object
I have no idea why replace/fillna doesn't work, are they not compatible with datetime objects?
Is there also a way to replace all dates 'out-of-range', let's say birthdays before 1920 and after 2001 with valid dates?
I also tried replacing and I think the problem is because of matching with regexes, in any case you can solve with :
df["Client Birthday"].loc[df["Client Birthday"].eq(timestamp)] = np.nan
df["Client Birthday"] = df["Client Birthday"].bfill()
I assigned NaT (not a time) where "Client Birthday" is equal to the timestamp variable and then used bfill on the series.
As for your second problem, you can use pandas between time and create a range of acceptable dates. Then, if anything falls out of the range you can fill the values or replace them with something more sensible.
I tried to make a simple dataframe:
df_dict = {
'Client Birthdate': '1/1/1900'
}
df = pd.DataFrame(ddict, index=[i for i in range(len(ddict))])
Calling df:
Client Birthdate
0 1/1/1900
Then, used infer_datetime_format within pd.to_datetime():
df['Client Birthdate'] = pd.to_datetime(df['Client Birthdate'], infer_datetime_format=True)
Output of calling df again:
Client Birthdate
0 1900-01-01
And, dtypes:
Client Birthdate datetime64[ns]
dtype: object
However, to get the hour-minute-second-microsecond result into your column, you have to know and set the format using strftime(). Here's a simple example:
pd.to_datetime(df['Client Birthdate'], format='%Y-%m-%d').dt.strftime('%Y-%m-%d %H:%M:%S.%f')
Output:
0 1900-01-01 00:00:00.000000
Name: Client Birthdate, dtype: object
# Finally, to update your dates, just subsection the dataframe and set it equal to the date that you want. This example uses .loc() since pandas will probably through a SettingWithCopyWarning error otherwise.
df.loc[df['Client Birthdate'] == '1/1/1900', :] = timestamp
I have my data in the following format:
final.head(5)
(Head of the data, displaying sales for each month from May 2015)
I want to add the last day of the month for each record and want an output like this
transactionDate sale_price_after_promo
05/30/2015 30393.8
06/31/2015 24345.68
07/30/2015 26688.91
08/31/2015 46626.1
09/30/2015 27933.84
10/31/2015 76087.55
I tried this
pd.Series(pd.DatetimeIndex(start=final.start_time, end=final.end_time, freq='M')).to_frame('transactionDate')
But getting an error
'DataFrame' object has no attribute 'start_time'
Create PeriodIndex and then convert it to_timestamp:
df = pd.DataFrame({'transactionDate':['2015-05','2015-06','2015-07']})
df['date'] = pd.PeriodIndex(df['transactionDate'], freq='M').to_timestamp(how='end')
print (df)
transactionDate date
0 2015-05 2015-05-31
1 2015-06 2015-06-30
2 2015-07 2015-07-31
I am attempting to convert dynamically all date columns to YYYY-MM-DD format using dataframe that come from read_csv. columns are below.
input
empno,ename,hiredate,report_date,end_date
1,sreenu,17-Jun-2021,18/06/2021,May-22
output
empno,ename,hiredate,report_date,end_date
1,sreenu,2021-06-17,2021-06-18,2022-05-31
rules are
if date is MMM-YY or MM-YYYY(May-22 or 05-2022) (then last day of the month(YYYY-MM-DD format - 2022-05-31)
other than point 1 then it should be YYYY-MM-DD
Now i want create a method/function to identify all date datatype columns in dataframe then convert to YYYY-MM-DD format/user expected format.