Pandas DF convert date string to date year and month [duplicate] - python

This question already has answers here:
Extracting just Month and Year separately from Pandas Datetime column
(13 answers)
Closed 2 years ago.
HI all I have a column in a dataframe that looks like:
print(df['Date']):
29-Nov-16
4-Dec-16
1-Oct-16
30-Nov-19
30-Jun-20
28-Apr-16
24-May-16
And i am trying to get an output that looks like
print(df):
Date Month Year
29-Nov-16 Nov 2016
4-Dec-16 Dec 2016
1-Oct-16 Oct 2016
30-Nov-19 Nov 2019
30-Jun-20 Jun 2020
28-Apr-16 Apr 2016
24-May-16 May 2016
I have tried the following:
df['Month'] = pd.datetime(df['Date']).month
df['Year'] = pd.datetime(df['Date']).year
but am getting a TypeError: cannot convert the series to <class 'int'>
Any ideas or references to help out?
Thanks!

Use strftime and str.split and assign them to new columns
df_final = df.assign(**pd.to_datetime(df['Date']).dt.strftime('%b-%Y')
.str.split('-', expand=True)
.set_axis(['Month','Year'], axis=1))
Out[32]:
Date Month Year
0 29-Nov-16 Nov 2016
1 4-Dec-16 Dec 2016
2 1-Oct-16 Oct 2016
3 30-Nov-19 Nov 2019
4 30-Jun-20 Jun 2020
5 28-Apr-16 Apr 2016
6 24-May-16 May 2016

you are missing dt after pd.datetime(df['Date'])
try this:
df['Month'] = pd.datetime(df['Date']).dt.month
df['Year'] = pd.datetime(df['Date']).dt.year

Related

Combine month name and year in a column pandas python

df
Year Month Name Avg
2015 Jan 12
2015 Feb 13.4
2015 Mar 10
...................
2019 Jan 11
2019 Feb 11
Code
df['Month Name-Year']= pd.to_datetime(df['Month Name'].astype(str)+df['Year'].astype(str),format='%b%Y')
In the dataframe, df, the groupby output avg is on keys month name and year. So month name and year are actually multilevel indices. I want to create a third column Month Name Year so that I can do some operation (create plots etc) using the data.
The output I am getting using the code is as below:
Year Month Name Avg Month Name-Year
2015 Jan 12 2015-01-01
2015 Feb 13.4 2015-02-01
2015 Mar 10 2015-03-01
...................
2019 Nov 11 2019-11-01
2019 Dec 11 2019-12-01
and so on.
The output I want is 2015-Jan, 2015-Feb etc in Month Name-Year column...or I want 2015-01, 2015-02...2019-11, 2019-12 etc (only year and month, no days).
Please help
One type of solution is converting to datetimes and then change format by Series.dt.to_period or Series.dt.strftime:
df['Month Name-Year']=pd.to_datetime(df['Month Name']+df['Year'].astype(str),format='%b%Y')
#for months periods
df['Month Name-Year1'] = df['Month Name-Year'].dt.to_period('m')
#for 2010-02 format
df['Month Name-Year2'] = df['Month Name-Year'].dt.strftime('%Y-%m')
Simpliest is solution without convert to datetimes only join with - and convert years to strings:
#format 2010-Feb
df['Month Name-Year3'] = df['Year'].astype(str) + '-' + df['Month Name']
...what is same like converting to datetimes and then converting to custom strings:
#format 2010-Feb
df['Month Name-Year31'] = df['Month Name-Year'].dt.strftime('%Y-%b')

Pivot and rename Pandas dataframe

I have a dataframe in the format
Date Datediff Cumulative_sum
01 January 2019 1 5
02 January 2019 1 7
02 January 2019 2 15
01 January 2019 2 8
01 January 2019 3 13
and I want to pivot the column Datediff from the dataframe such that the end result looks like
Index Day-1 Day-2 Day-3
01 January 2019 5 8 13
02 January 2019 7 15
I have used the pivot command shuch that
pt = pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum") \
.reset_index() \
.set_index("Date"))
which returns the pivoted table
1 2 3
01 January 2019 5 8 13
02 January 2019 7 15
And I can then rename rename the columns using the loop
for column in pt:
pt.rename(columns = {column : "Day-" + str(column)}, inplace = True)
which returns exactly what I want. However, I was wondering if there is a faster way to rename the columns when pivoting and get rid of the loop altogether.
Use DataFrame.add_prefix:
df.add_prefix('Day-')
In your solution:
pt = (pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum")
.add_prefix('Day-'))

Return dataframe with range of dates

I need a Python function to return a Pandas DataFrame with range of dates, only year and month, for example, from November 2016 to March 2017 and have this as result:
year month
2016 11
2016 12
2017 01
2017 02
2017 03
My dates are in string format Y-m (from = '2016-11', to = '2017-03'). I'm not sure on turning them to datetime type or to separate them into two different integer values.
Any ideas on how to achieve it properly?
Are you looking at something like this?
pd.date_range('November 2016', 'April 2017', freq = 'M')
You get
DatetimeIndex(['2016-11-30', '2016-12-31', '2017-01-31', '2017-02-28',
'2017-03-31'],
dtype='datetime64[ns]', freq='M')
To get dataframe
index = pd.date_range('November 2016', 'April 2017', freq = 'M')
df = pd.DataFrame(index = index)
pd.Series(pd.date_range('2016-11', '2017-4', freq='M').strftime('%Y-%m')) \
.str.split('-', expand=True) \
.rename(columns={0: 'year', 1: 'month'})
year month
0 2016 11
1 2016 12
2 2017 01
3 2017 02
4 2017 03
You can use a combination of pd.to_datetime and pd.date_range.
import pandas as pd
start = 'November 2016'
end = 'March 2017'
s = pd.Series(pd.date_range(*(pd.to_datetime([start, end]) \
+ pd.offsets.MonthEnd()), freq='1M'))
Construct a dataframe using the .dt accessor attributes.
df = pd.DataFrame({'year' : s.dt.year, 'month' : s.dt.month})
df
month year
0 11 2016
1 12 2016
2 1 2017
3 2 2017
4 3 2017

Cleaning inconsistent date formatting in pandas dataframe

I have a very large dataframe in which one of the columns, ['date'], datetime (dtype is string still) is formatted as below.. sometimes it is displayed as hh:mm:ss and sometimes as h:mm:ss (with hours 9 and earlier)
Tue Mar 1 9:23:58 2016
Tue Mar 1 9:29:04 2016
Tue Mar 1 9:42:22 2016
Tue Mar 1 09:43:50 2016
pd.to_datetime() won't work when I'm trying to convert the string into datetime format so I was hoping to find some help in getting 0's in front of the time where missing.
Any help is greatly appreciated!
import pandas as pd
date_stngs = ('Tue Mar 1 9:23:58 2016','Tue Mar 1 9:29:04 2016','Tue Mar 1 9:42:22 2016','Tue Mar 1 09:43:50 2016')
a = pd.Series([pd.to_datetime(date) for date in date_stngs])
print a
output
0 2016-03-01 09:23:58
1 2016-03-01 09:29:04
2 2016-03-01 09:42:22
3 2016-03-01 09:43:50
time = df[0].str.split(' ').str.get(3).str.split('').str.get(0).str.strip().str[:8]
year = df[0].str.split('--').str.get(0).str[-5:].str.strip()
daynmonth = df[0].str[:10].str.strip()
df_1['date'] = daynmonth + ' ' +year + ' ' + time
df_1['date'] = pd.to_datetime(df_1['date'])
Found this to work myself when rearranging the order
Assuming you have a one column DataFrame with strings as above and column name is 0 then the following will split the strings by space and then take the third string and zero-fill it with zfill
Assuming starting df
0
0 Tue Mar 1 9:23:58 2016
1 Tue Mar 1 9:29:04 2016
2 Tue Mar 1 9:42:22 2016
3 Tue Mar 1 09:43:50 2016
df1 = df[0].str.split(expand=True)
df1[3] = df1[3].str.zfill(8)
pd.to_datetime(df1.apply(lambda x: ' '.join(x.tolist()), axis=1))
Output
0 2016-03-01 09:23:58
1 2016-03-01 09:29:04
2 2016-03-01 09:42:22
3 2016-03-01 09:43:50
dtype: datetime64[ns]

Elegant way to sum over duplicate MultiIndex values

I have a DataFrame with a many-levelled MultiIndex.
I know that there are duplicates in the MultiIndex (because I don't care about a distinction that the underlying databse does care about)
I want to sum over these duplicates:
>>> x = pd.DataFrame({'month':['Sep', 'Sep', 'Oct', 'Oct'], 'day':['Mon', 'Mon', 'Mon', 'Tue'], 'sales':[1,2,3,4]})
>>> x
day month sales
0 Mon Sep 1
1 Mon Sep 2
2 Mon Oct 3
3 Tue Oct 4
>>> x = x.set_index(['day', 'month'])
sales
day month
Mon Sep 1
Sep 2
Oct 3
Tue Oct 4
To give me
day month
Mon Sep 3
Oct 3
Tue Oct 4
Buried deep in this SO answer to a similar question is the suggestion:
df.groupby(level=df.index.names).sum()
But this seems to me to fail the 'readability counts' criterion of good Python code.
Does anyone know of a more human-readable way?

Categories

Resources