Pandas - adding new date column based on parsing other date column - python

I am trying to take a Pandas dataframe, parse a column that represents dates and add a new column to the dataframe with a simple mm/dd/yyyy format.
Here is the data and libraries:
import pandas as pd
import datetime
from dateutil.parser import parse
df = pd.DataFrame([['row1', 'Tue Jun 16 19:05:44 UTC 2020', 'record1'], ['row2', 'Tue Jun 16 17:10:02 UTC 2020', 'record2'], ['row3', 'Fri Jun 12 17:52:37 UTC 2020', 'record3']], columns=["row", "checkin", "record"])
From picking bits and pieces from around here I crafted this line to parse and add the new column of data:
df['NewDate'] = df.apply(lambda row: datetime.date.strftime(parse(df['checkin']), "%m/%d/%Y"), axis = 1)
But I end up with this error when run, can anyone suggest a fix or easier way to do this, seems like it should be simpler and more pythonic than I am finding
TypeError: ('Parser must be a string or character stream, not Series', 'occurred at index 0')
Thanks for any help you can offer.

You could do so without apply
df['newDate'] = pd.to_datetime(df.checkin).dt.strftime("%m/%d/%Y")
row checkin record newDate
0 row1 Tue Jun 16 19:05:44 UTC 2020 record1 06/16/2020
1 row2 Tue Jun 16 17:10:02 UTC 2020 record2 06/16/2020
2 row3 Fri Jun 12 17:52:37 UTC 2020 record3 06/12/2020

Just change df['checkin'] to row['checkin'] as below
df['NewDate'] = df.apply(lambda row: datetime.date.strftime(parse(row['checkin']), "%m/%d/%Y"), axis = 1)

Related

How to categorise datetime in to a new column based on the date range in python

]I have a column called as datetime of type datetime64[ns] and for eg: it is represented as 2019-10-27 06:00:00 I would like to create a new column called waves which groups the date interval from datetime column to different categorical values. For eg:
Before covid: 16th of Nov 2019 until 28th of Feb 2020 First wave: 1st of Mar 2020 until 15th of Jun 2020 Between waves: 16th of Jun 2020 until 30th of Sep 2020 Second wave: 1st of Okt 2020 until 15th of Jan 2021
How do I achieve this in python maybe using a loop function?
My dataset called df looks like this:
provider fid pid datetime
0 CHE-223 2bfc9a62 2f43d557 2021-09-26T23:18:00
1 CHE-223 fff669e9 295b82e2 2021-08-13T09:10:00
2 CHE-223 8693e564 9df9c555 2021-11-05T20:03:00

Is there a way to covert date (with different format) into a standardized format in python?

I have a column calls "date" which is an object and it has very different date format like dd.m.yy, dd.mm.yyyy, dd/mm/yyyy, dd/mm, m/d/yyyy etc as below. Obviously by simply using df['date'] = pd.to_datetime(df['date']) will not work. I wonder for messy date value like that, is there anyway to standardized and covert the date into one single format ?
date
17.2.22 # means Feb 17 2022
23.02.22 # means Feb 23 2022
17/02/2022 # means Feb 17 2022
18.2.22 # means Feb 18 2022
2/22/2022 # means Feb 22 2022
3/1/2022 # means March 1 2022
<more messy different format>
Coerce the dates to datetime and allow invalid entries to be turned into nulls.Also, allow pandas to infer the format. code below
df['date'] = pd.to_datetime(df['date'], errors='coerce',infer_datetime_format=True)
date
0 2022-02-17
1 2022-02-23
2 2022-02-17
3 2022-02-18
4 2022-02-22
5 2022-03-01
Based on wwnde's solution, the following works in my real dataset -
df['date'].fillna('',inplace=True)
df['date'] = df['date'].astype('str')
df['date new'] = df['date'].str.replace('.','/')
df['date new'] = pd.to_datetime(df['date new'],
errors='coerce',infer_datetime_format=True)

Sorting the date column on a pandas dataframe that is not in date time format and has format mmm dd, yyyy

Hi all I am working with a pandas dataframe that contains a date column. I would like to sort this column by date in ascending order, meaning that the most recent date is at the bottom of the dataframe. The problem that I am running into is that the date column displays the dates in the following format:
"Nov 3, 2020"
how can I sort these dates, the suggested advice that I have found online is to convert the date into a date time format and then sort then change it back to this format. Is there a more simple way to do this? I have tried this
new_df.sort_values(by=["Date"],ascending=True)
where new_df is the dataframe, but this does not seem to work.
Any ideas on how can do this? I essentially want the output to have something like
Date
----
Oct 31, 2020
Nov 1, 2020
Nov 12,2020
.
.
.
I would reformat the date column first, then convert to datetime, and then sort:
dates = ['Nov 1, 2020','Nov 12,2020','Oct 31, 2020']
df = pd.DataFrame({'Date':dates, 'Col1':[2,3,1]})
# Date Col1
# 0 Nov 1, 2020 2
# 1 Nov 12,2020 3
# 2 Oct 31, 2020 1
df['Date'] = pd.to_datetime(df['Date'].apply(lambda x: "-".join(x.replace(',',' ').split())))
df = df.sort_values('Date')
# Date Col1
# 2 2020-10-31 1
# 0 2020-11-01 2
# 1 2020-11-12 3
# and if you want to get the dates back in their original format
df['Date'] = df['Date'].apply(lambda x: "{} {}, {}".format(x.month_name()[:3],x.day,x.year))
# Date Col1
# 2 Oct 31, 2020 1
# 0 Nov 1, 2020 2
# 1 Nov 12, 2020 3
df.sort_values(by = "Date", key = pd.to_datetime)

Calculating Elapsed Days From Pandas Dataframe Strings

I have a Pandas dataframe that stores travel dates of people. I'd like to add a column that shows the length of the stay. To do this the string needs to be parsed, converted to a datetime and subtracted. Pandas seems to be treating the datetime conversion as a whole series and not individual strings as a I get TypeError: must be string, not Series. I like to do this with a non-looping option as the actual dataset is quite large, but need a bit of help.
import pandas as pd
from datetime import datetime
df = pd.DataFrame(data=[['Bob', '12 Mar 2015 - 31 Mar 2015'], ['Jessica', '27 Mar 2015 - 31 Mar 2015']], columns=['Names', 'Day of Visit'])
df['Length of Stay'] = (datetime.strptime(df['Day of Visit'][:11], '%d %b %Y') - datetime.strptime(df['Day of Visit'][-11:], '%d %b %Y')).days + 1
print df
Desired Output:
Names Day of Visit Length of Stay
0 Bob 12 Mar 2015 - 31 Mar 2015 20
1 Jessica 27 Mar 2015 - 31 Mar 2015 5
Use Series.str.extract to split the Day of Visit column into two separate columns.
Then use pd.to_datetime to parse the columns as dates.
Computing the Length of Stay can then be done by subtracting the date columns and adding 1:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=[['Bob', '12 Mar 2015 - 31 Mar 2015'], ['Jessica', '27 Mar 2015 - 31 Mar 2015']], columns=['Names', 'Day of Visit'])
tmp = df['Day of Visit'].str.extract(r'([^-]+)-(.*)', expand=True).apply(pd.to_datetime)
df['Length of Stay'] = (tmp[1] - tmp[0]).dt.days + 1
print(df)
yields
Names Day of Visit Length of Stay
0 Bob 12 Mar 2015 - 31 Mar 2015 20
1 Jessica 27 Mar 2015 - 31 Mar 2015 5
The regex pattern ([^-]+)-(.*) means
( # start group #1
[ # begin character class
^- # any character except a literal minus sign `-`
] # end character class
+ # match 1-or-more characters from the character class
) # end group #1
- # match a literal minus sign
( # start group #2
.* # match 0-or-more of any character
) # end group #2
.str.extract returns a DataFrame with the matching text from groups #1 and #2 in columns.
Solution
def length_of_stay(x):
start, end = [datetime.strptime(d, '%d %b %Y') for d in x.split(' - ')]
return end - start
df['Length of Stay'] = df['Day of Visit'].apply(length_of_stay)
print df

PYTHON (Jython) how to get DATE TIME value in string - all before specific string?

I have string which contains timestamp:
Wed Apr 24 14:39:49 CEST 2013
Of course I have this similar values in many records so I want to get all before CEST. (I have records and with CEST 2012, 2013,2014...), and year after CEST. Also I want to delete Day information.
On example I want results:
2013 Apr 24 14:39:49
2013 Apr 26 14:39:49
What methods should I use?
Thank you
from dateutil import parser
dt = parser.parse('Wed Apr 24 14:39:49 CEST 2013')
dt is a datetime object you can use/format any way you want. For example:
dt.strftime('%Y %b %d %H:%M:%S')
# returns '2013 Apr 24 14:39:49'

Categories

Resources