Pandas - adding new date column based on parsing other date column

Pandas - adding new date column based on parsing other date column - python

I am trying to take a Pandas dataframe, parse a column that represents dates and add a new column to the dataframe with a simple mm/dd/yyyy format.
Here is the data and libraries:
import pandas as pd
import datetime
from dateutil.parser import parse
df = pd.DataFrame([['row1', 'Tue Jun 16 19:05:44 UTC 2020', 'record1'], ['row2', 'Tue Jun 16 17:10:02 UTC 2020', 'record2'], ['row3', 'Fri Jun 12 17:52:37 UTC 2020', 'record3']], columns=["row", "checkin", "record"])
From picking bits and pieces from around here I crafted this line to parse and add the new column of data:
df['NewDate'] = df.apply(lambda row: datetime.date.strftime(parse(df['checkin']), "%m/%d/%Y"), axis = 1)
But I end up with this error when run, can anyone suggest a fix or easier way to do this, seems like it should be simpler and more pythonic than I am finding
TypeError: ('Parser must be a string or character stream, not Series', 'occurred at index 0')
Thanks for any help you can offer.

You could do so without apply
df['newDate'] = pd.to_datetime(df.checkin).dt.strftime("%m/%d/%Y")
row checkin record newDate
0 row1 Tue Jun 16 19:05:44 UTC 2020 record1 06/16/2020
1 row2 Tue Jun 16 17:10:02 UTC 2020 record2 06/16/2020
2 row3 Fri Jun 12 17:52:37 UTC 2020 record3 06/12/2020

Just change df['checkin'] to row['checkin'] as below
df['NewDate'] = df.apply(lambda row: datetime.date.strftime(parse(row['checkin']), "%m/%d/%Y"), axis = 1)

Related

How to categorise datetime in to a new column based on the date range in python

]I have a column called as datetime of type datetime64[ns] and for eg: it is represented as 2019-10-27 06:00:00 I would like to create a new column called waves which groups the date interval from datetime column to different categorical values. For eg:
Before covid: 16th of Nov 2019 until 28th of Feb 2020 First wave: 1st of Mar 2020 until 15th of Jun 2020 Between waves: 16th of Jun 2020 until 30th of Sep 2020 Second wave: 1st of Okt 2020 until 15th of Jan 2021
How do I achieve this in python maybe using a loop function?
My dataset called df looks like this:
provider fid pid datetime
0 CHE-223 2bfc9a62 2f43d557 2021-09-26T23:18:00
1 CHE-223 fff669e9 295b82e2 2021-08-13T09:10:00
2 CHE-223 8693e564 9df9c555 2021-11-05T20:03:00

Is there a way to covert date (with different format) into a standardized format in python?

I have a column calls "date" which is an object and it has very different date format like dd.m.yy, dd.mm.yyyy, dd/mm/yyyy, dd/mm, m/d/yyyy etc as below. Obviously by simply using df['date'] = pd.to_datetime(df['date']) will not work. I wonder for messy date value like that, is there anyway to standardized and covert the date into one single format ?
date
17.2.22 # means Feb 17 2022
23.02.22 # means Feb 23 2022
17/02/2022 # means Feb 17 2022
18.2.22 # means Feb 18 2022
2/22/2022 # means Feb 22 2022
3/1/2022 # means March 1 2022
<more messy different format>

Coerce the dates to datetime and allow invalid entries to be turned into nulls.Also, allow pandas to infer the format. code below
df['date'] = pd.to_datetime(df['date'], errors='coerce',infer_datetime_format=True)
date
0 2022-02-17
1 2022-02-23
2 2022-02-17
3 2022-02-18
4 2022-02-22
5 2022-03-01

Based on wwnde's solution, the following works in my real dataset -
df['date'].fillna('',inplace=True)
df['date'] = df['date'].astype('str')
df['date new'] = df['date'].str.replace('.','/')
df['date new'] = pd.to_datetime(df['date new'],
errors='coerce',infer_datetime_format=True)

Sorting the date column on a pandas dataframe that is not in date time format and has format mmm dd, yyyy

Hi all I am working with a pandas dataframe that contains a date column. I would like to sort this column by date in ascending order, meaning that the most recent date is at the bottom of the dataframe. The problem that I am running into is that the date column displays the dates in the following format:
"Nov 3, 2020"
how can I sort these dates, the suggested advice that I have found online is to convert the date into a date time format and then sort then change it back to this format. Is there a more simple way to do this? I have tried this
new_df.sort_values(by=["Date"],ascending=True)
where new_df is the dataframe, but this does not seem to work.
Any ideas on how can do this? I essentially want the output to have something like
Date
----
Oct 31, 2020
Nov 1, 2020
Nov 12,2020
.
.
.

I would reformat the date column first, then convert to datetime, and then sort:
dates = ['Nov 1, 2020','Nov 12,2020','Oct 31, 2020']
df = pd.DataFrame({'Date':dates, 'Col1':[2,3,1]})
# Date Col1
# 0 Nov 1, 2020 2
# 1 Nov 12,2020 3
# 2 Oct 31, 2020 1
df['Date'] = pd.to_datetime(df['Date'].apply(lambda x: "-".join(x.replace(',',' ').split())))
df = df.sort_values('Date')
# Date Col1
# 2 2020-10-31 1
# 0 2020-11-01 2
# 1 2020-11-12 3
# and if you want to get the dates back in their original format
df['Date'] = df['Date'].apply(lambda x: "{} {}, {}".format(x.month_name()[:3],x.day,x.year))
# Date Col1
# 2 Oct 31, 2020 1
# 0 Nov 1, 2020 2
# 1 Nov 12, 2020 3

df.sort_values(by = "Date", key = pd.to_datetime)

Calculating Elapsed Days From Pandas Dataframe Strings

I have a Pandas dataframe that stores travel dates of people. I'd like to add a column that shows the length of the stay. To do this the string needs to be parsed, converted to a datetime and subtracted. Pandas seems to be treating the datetime conversion as a whole series and not individual strings as a I get TypeError: must be string, not Series. I like to do this with a non-looping option as the actual dataset is quite large, but need a bit of help.
import pandas as pd
from datetime import datetime
df = pd.DataFrame(data=[['Bob', '12 Mar 2015 - 31 Mar 2015'], ['Jessica', '27 Mar 2015 - 31 Mar 2015']], columns=['Names', 'Day of Visit'])
df['Length of Stay'] = (datetime.strptime(df['Day of Visit'][:11], '%d %b %Y') - datetime.strptime(df['Day of Visit'][-11:], '%d %b %Y')).days + 1
print df
Desired Output:
Names Day of Visit Length of Stay
0 Bob 12 Mar 2015 - 31 Mar 2015 20
1 Jessica 27 Mar 2015 - 31 Mar 2015 5

Use Series.str.extract to split the Day of Visit column into two separate columns.
Then use pd.to_datetime to parse the columns as dates.
Computing the Length of Stay can then be done by subtracting the date columns and adding 1:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=[['Bob', '12 Mar 2015 - 31 Mar 2015'], ['Jessica', '27 Mar 2015 - 31 Mar 2015']], columns=['Names', 'Day of Visit'])
tmp = df['Day of Visit'].str.extract(r'([^-]+)-(.*)', expand=True).apply(pd.to_datetime)
df['Length of Stay'] = (tmp[1] - tmp[0]).dt.days + 1
print(df)
yields
Names Day of Visit Length of Stay
0 Bob 12 Mar 2015 - 31 Mar 2015 20
1 Jessica 27 Mar 2015 - 31 Mar 2015 5
The regex pattern ([^-]+)-(.*) means
( # start group #1
[ # begin character class
^- # any character except a literal minus sign `-`
] # end character class
+ # match 1-or-more characters from the character class
) # end group #1
- # match a literal minus sign
( # start group #2
.* # match 0-or-more of any character
) # end group #2
.str.extract returns a DataFrame with the matching text from groups #1 and #2 in columns.

Solution
def length_of_stay(x):
start, end = [datetime.strptime(d, '%d %b %Y') for d in x.split(' - ')]
return end - start
df['Length of Stay'] = df['Day of Visit'].apply(length_of_stay)
print df

PYTHON (Jython) how to get DATE TIME value in string - all before specific string?

I have string which contains timestamp:
Wed Apr 24 14:39:49 CEST 2013
Of course I have this similar values in many records so I want to get all before CEST. (I have records and with CEST 2012, 2013,2014...), and year after CEST. Also I want to delete Day information.
On example I want results:
2013 Apr 24 14:39:49
2013 Apr 26 14:39:49
What methods should I use?
Thank you

from dateutil import parser
dt = parser.parse('Wed Apr 24 14:39:49 CEST 2013')
dt is a datetime object you can use/format any way you want. For example:
dt.strftime('%Y %b %d %H:%M:%S')
# returns '2013 Apr 24 14:39:49'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - adding new date column based on parsing other date column - python

You could do so without apply df['newDate'] = pd.to_datetime(df.checkin).dt.strftime("%m/%d/%Y") row checkin record newDate 0 row1 Tue Jun 16 19:05:44 UTC 2020 record1 06/16/2020 1 row2 Tue Jun 16 17:10:02 UTC 2020 record2 06/16/2020 2 row3 Fri Jun 12 17:52:37 UTC 2020 record3 06/12/2020

Just change df['checkin'] to row['checkin'] as below df['NewDate'] = df.apply(lambda row: datetime.date.strftime(parse(row['checkin']), "%m/%d/%Y"), axis = 1)

Related

How to categorise datetime in to a new column based on the date range in python

Is there a way to covert date (with different format) into a standardized format in python?

Sorting the date column on a pandas dataframe that is not in date time format and has format mmm dd, yyyy

Calculating Elapsed Days From Pandas Dataframe Strings

PYTHON (Jython) how to get DATE TIME value in string - all before specific string?

Categories

Resources