Pandas extract week of year and year from date - python

I caught up with this scenario and don't know how can I solve this.
I have the data frame where I am trying to add "week_of_year" and "year" column based in the "date" column of the pandas' data frame which is working fine.
import pandas as pd
df = pd.DataFrame({'date': ['2018-12-31', '2019-01-01', '2019-12-31', '2020-01-01']})
df['date'] = pd.to_datetime(df['date'])
df['week_of_year'] = df['date'].apply(lambda x: x.weekofyear)
df['year'] = df['date'].apply(lambda x: x.year)
print(df)
Current Output
date week_of_year year
0 2018-12-31 1 2018
1 2019-01-01 1 2019
2 2019-12-31 1 2019
3 2020-01-01 1 2020
Expected Output
So here what I am expecting is for 2018 and 2019 the last date was the first week of the new year which is 2019 and 2020 respectively so I want to add logic in the year, where the week is 1 but the date belongs for the previous year so the year column would track that as in the expected output.
date week_of_year year
0 2018-12-31 1 2019
1 2019-01-01 1 2019
2 2019-12-31 1 2020
3 2020-01-01 1 2020

Try:
df['date'] = pd.to_datetime(df['date'])
df['week_of_year'] = df['date'].dt.weekofyear
df['year']=(df['date']+pd.to_timedelta(6-df['date'].dt.weekday, unit='d')).dt.year
Outputs:
date week_of_year year
0 2018-12-31 1 2019
1 2019-01-01 1 2019
2 2019-12-31 1 2020
3 2020-01-01 1 2020
Few things - generally avoid .apply(..).
For datetime columns you can just interact with the date through df[col].dt variable.
Then to get the last day of the week just add to date 6-weekday where weekday is between 0 (Monday) and 6 to the date

TLDR CODE
To get the week number as a series
df['DATE'].dt.isocalendar().week
To set a new column to the week use same function and set series returned to a column:
df['WEEK'] = df['DATE'].dt.isocalendar().week
TLDR EXPLANATION
Use the pd.series.dt.isocalendar().week to get the the week for a given series object.
Note:
column "DATE" must be stored as a datetime column

Related

Pandas groupby month output is incorrect [duplicate]

My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30

Incorrect order of M/D in datetimes

I have a date column in my csv file
This is my Date column data
14/3/18
28/3/18
9/4/2018
How to make the year all become 2018 ?
I have tried this
df['DateTime'] = pd.to_datetime(df['Date'])
print (df['DateTime'])
but it return
1 2018-03-14
2 2018-03-28
3 2018-09-04
The Last column 09 become month but it supposed 04 is month.
Add parameter dayfirst=True:
df['DateTime'] = pd.to_datetime(df['Date'], dayfirst=True)
print (df)
Date DateTime
0 14/3/18 2018-03-14
1 28/3/18 2018-03-28
2 9/4/2018 2018-04-09
You can use .dt.strftime:
df['DateTime'] = pd.to_datetime(df['DateTime']).dt.strftime("%Y-%d-%m")
Output:
0 2018-14-03
1 2018-28-03
2 2018-04-09
Name: A, dtype: object

Date and Month Mixup in Pandas

In Pandas, I am trying to format a date column from String to proper date so that I can export it to ElasticSearch. However, date and month are getting mixed up. I have given an example below.
df = pd.DataFrame({'Date':['12/03/2020 0:00', '11/02/2019 0:00', '10/01/2020 0:00'],
'Event':['Music', 'Poetry', 'Theatre'],
'Cost':[10000, 5000, 15000]})
Date is entered in dd/mm/YYYY format.
df['Date1'] = df['Date'].astype('datetime64[ns]')
df['Year'] = pd.DatetimeIndex(df['Date']).year
df['Month'] = pd.DatetimeIndex(df['Date1']).month
df['Day'] = pd.DatetimeIndex(df['Date1']).day
df
This results in the following data frame where the date and month are interchanged.Year is extracted correct.
Date Event Cost Date1 Year Month Day
0 12/03/2020 0:00 Music 10000 2020-12-03 2020 12 3
1 11/02/2019 0:00 Poetry 5000 2019-11-02 2019 11 2
2 10/01/2020 0:00 Theatre 15000 2020-10-01 2020 10 1
Can someone provide inputs on how to format the date column in an appropriate way? Thanks
You'll want to use pd.to_datetime() to convert the data to real datetimes first:
df['Date'] = pd.to_datetime(df['Date'])
Happily, the default parameters seem to work for parsing your example data:
>>> df
Date Event Cost
0 2020-12-03 Music 10000
1 2019-11-02 Poetry 5000
2 2020-10-01 Theatre 15000
If you need the separate d/m/y columns, you can access the series' dt property instead of converting via a DatetimeIndex:
>>> df['Year'] = df['Date'].dt.year
>>> # ... etc ...
>>> df
Date Event Cost Year
0 2020-12-03 Music 10000 2020
1 2019-11-02 Poetry 5000 2019
2 2020-10-01 Theatre 15000 2020

Plot frequency of dates in interval occurred in pandas dataframe

I have dataframe with dates from year 1970 to year 2018, I want to plot frequency of occurrences from year 2016 to 2017.
In[95]: df['last_payout'].dtypes
Out[95]: dtype('<M8[ns]')
The data is stored in this format:
In[96]: df['last_payout'].head
​​
Out[96]: <bound method NDFrame.head of 0 1970-01-01
1 1970-01-01
2 1970-01-01
3 1970-01-01
4 1970-01-01
I plot this by year using group by and count :
In[97]: df['last_payout'].groupby(df['last_payout'].dt.year).count().plot(kind="bar")
I want to get this plot between specific dates, I tried to put df['last_payout'].dt.year > 2016, but I got this:
How do I get the plot for specific date range?
I think need filter by between and boolean indexing first:
rng = pd.date_range('2015-04-03', periods=10, freq='7M')
df = pd.DataFrame({'last_payout': rng})
print (df)
last_payout
0 2015-04-30
1 2015-11-30
2 2016-06-30
3 2017-01-31
4 2017-08-31
5 2018-03-31
6 2018-10-31
7 2019-05-31
8 2019-12-31
9 2020-07-31
(df.loc[df['last_payout'].dt.year.between(2016, 2017), 'last_payout']
.groupby(df['last_payout'].dt.year)
.count()
.plot(kind="bar")
)
Alternative solution:
(df.loc[df['last_payout'].dt.year.between(2016, 2017), 'last_payout']
.dt.year
.value_counts()
.sort_index()
.plot(kind="bar")
)
EDIT: For months with years convert datetimes to month period by to_period:
(df.loc[df['last_payout'].dt.year.between(2016, 2017), 'last_payout']
.dt.to_period('M')
.value_counts()
.sort_index()
.plot(kind="bar")
)
Note that
df['last_payout'].dt.year > 2016
just returns a boolean series, so plotting this will indeed show a bar chart of the number of dates for which this is or not.
Try first creating a relevant df:
relevant_df = df[(df['last_payout'].dt.year > 2016) & (df['last_payout'].dt.year <= 2017)]
(use strict or not inequalities depending on what you want, of course.)
then performing the plot on it:
relevant_df['last_payout'].groupby(relevant_df['last_payout'].dt.year).count().plot(kind="bar")

Convert String to Date [With Year and Quarter]

I have a pandas dataframe, where one column contains a string for the year and quarter in the following format:
2015Q1
My Question:
​How do I convert this into two datetime columns, one for the year and one for the quarter.
You can use split, then cast column year to int and if necessary add Q to column q:
df = pd.DataFrame({'date':['2015Q1','2015Q2']})
print (df)
date
0 2015Q1
1 2015Q2
df[['year','q']] = df.date.str.split('Q', expand=True)
df.year = df.year.astype(int)
df.q = 'Q' + df.q
print (df)
date year q
0 2015Q1 2015 Q1
1 2015Q2 2015 Q2
Also you can use Period:
df['date'] = pd.to_datetime(df.date).dt.to_period('Q')
df['year'] = df['date'].dt.year
df['quarter'] = df['date'].dt.quarter
print (df)
date year quarter
0 2015Q1 2015 1
1 2015Q2 2015 2
You could also construct a datetimeIndex and call year and quarter on it.
df.index = pd.to_datetime(df.date)
df['year'] = df.index.year
df['quarter'] = df.index.quarter
date year quarter
date
2015-01-01 2015Q1 2015 1
2015-04-01 2015Q2 2015 2
Note that you don't even need a dedicated column for year and quarter if you have a datetimeIndex, you could do a groupby like this for example: df.groupby(df.index.quarter)

Categories

Resources