Group Pandas DF for seasonal analysis [duplicate] - python

Hello and thanks in advance for any help. I have a simple dataframe with two columns. I did not set an index explicitly, but I believe a dataframe gets an integer index that I see along the left side of the output. Question below:
df = pandas.DataFrame(res)
df.columns = ['date', 'pb']
df['date'] = pandas.to_datetime(df['date'])
df.dtypes
date datetime64[ns]
pb float64
dtype: object
date pb
0 2016-04-01 24199.933333
1 2016-03-01 23860.870968
2 2016-02-01 23862.275862
3 2016-01-01 25049.193548
4 2015-12-01 24882.419355
5 2015-11-01 24577.000000
date datetime64[ns]
pb float64
dtype: object
I would like to pivot the dataframe so that I have years across the top (columns): 2016, 2015, etc
and a row for each month: 1 - 12.

Using the .dt accessor you can create columns for year and month and then pivot on those:
df['Year'] = df['date'].dt.year
df['Month'] = df['date'].dt.month
pd.pivot_table(df,index='Month',columns='Year',values='pb',aggfunc=np.sum)
Alternately if you don't want those other columns you can do:
pd.pivot_table(df,index=df['date'].dt.month,columns=df['date'].dt.year,
values='pb',aggfunc=np.sum)
With my dummy dataset that produces:
Year 2013 2014 2015 2016
date
1 92924.0 102072.0 134660.0 132464.0
2 79935.0 82145.0 118234.0 147523.0
3 86878.0 94959.0 130520.0 138325.0
4 80267.0 89394.0 120739.0 129002.0
5 79283.0 91205.0 118904.0 125878.0
6 77828.0 89884.0 112488.0 121953.0
7 78839.0 94407.0 113124.0 NaN
8 79885.0 97513.0 116771.0 NaN
9 79455.0 99555.0 114833.0 NaN
10 77616.0 98764.0 115872.0 NaN
11 75043.0 95756.0 107123.0 NaN
12 81996.0 102637.0 114952.0 NaN

Using stack instead of pivot
df = pd.DataFrame(
dict(date=pd.date_range('2013-01-01', periods=42, freq='M'),
pb=np.random.rand(42)))
df.set_index([df.date.dt.month, df.date.dt.year]).pb.unstack()

Related

Issue with converting a pandas column from int64 to datetime64

I'm trying to convert a column of Year values from int64 to datetime64 in pandas. The column currently looks like
Year
2003
2003
2003
2003
2003
...
2021
2021
2021
2021
2021
However the data type listed when I use dataset['Year'].dtypes is int64.
That's after I used pd.to_datetime(dataset.Year, format='%Y') to convert the column from int64 to datetime64. How do I get around this?
You have to assign pd.to_datetime(df['Year'], format="%Y") to df['date']. Once you have done that you should be able to see convert from integer.
df = pd.DataFrame({'Year': [2000,2000,2000,2000,2000,2000]})
df['date'] = pd.to_datetime(df['Year'], format="%Y")
df
The output should be:
Year date
0 2000 2000-01-01
1 2000 2000-01-01
2 2000 2000-01-01
3 2000 2000-01-01
4 2000 2000-01-01
5 2000 2000-01-01
So essentially all you are missing is df['date'] = pd.to_datetime(df['Year'], format="%Y") from your code and it should be working fine with respect to converting.
The pd.to_datetime() will not just return the Year (as far as I understood from your question you wanted the year), if you want more information on what .to_date_time() returns, you can see the documentation.
I hope this helps.
You should be able to convert from an integer:
df = pd.DataFrame({'Year': [2003, 2022]})
df['datetime'] = pd.to_datetime(df['Year'], format='%Y')
print(df)
Output:
Year datetime
0 2003 2003-01-01
1 2022 2022-01-01

Pandas groupby month output is incorrect [duplicate]

My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30

How to show to average sales for each year within ten years for a specific city in Pandas?

What would be the correct way to show what was the average sales volume in Carlisle city for each year between
2010-2020?
Here is an abbreviated form of the large data frame showing only the columns and rows relevant to the question:
import pandas as pd
df = pd.DataFrame({'Date': ['01/09/2009','01/10/2009','01/11/2009','01/12/2009','01/01/2010','01/02/2010','01/03/2010','01/04/2010','01/05/2010','01/06/2010','01/07/2010','01/08/2010','01/09/2010','01/10/2010','01/11/2010','01/12/2010','01/01/2011','01/02/2011'],
'RegionName': ['Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle'],
'SalesVolume': [118,137,122,132,83,81,105,114,110,106,137,130,129,121,129,100,84,62]})
This is what I've tried:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv ('C:/Users/user/AppData/Local/Programs/Python/Python39/Scripts/uk_hpi_dataset_2021_01.csv')
df.Date = pd.to_datetime(df.Date)
df['Year'] = pd.to_datetime(df['Date']).apply(lambda x:
'{year}'.format(year=x.year).zfill(2))
carlisle_vol = df[df['RegionName'].str.contains('Carlisle')]
carlisle_vol.groupby('Year')['SalesVolume'].mean()
print(sales_vol)
When I try to run this code, it doesn't filter the 'Date' column to only calculate the average SalesVolume for the years beginning in '01/01/2010' and ending at '01/12/2020'. For some reason, it also prints out every other column is well. Can anyone please help me to answer this question correctly?
This is the result I've got
>>> df.loc[(df["Date"].dt.year.between(2010, 2020))
& (df["RegionName"] == "Carlisle")] \
.groupby([pd.Grouper(key="Date", freq="Y")])["SalesVolume"].mean()
Date
2010-01-01 112.083333
2011-01-01 73.000000
Freq: A-DEC, Name: SalesVolume, dtype: float64
For further:
The only difference between the answer of #nocibambi is the groupby parameter and particularly the freq argument of pd.Grouper. Imagine your accounting year starts the 1st september.
Sales each 3 months:
>>> df
Date Sales
0 2010-09-01 1 # 1st group: mean=2.5
1 2010-12-01 2
2 2011-03-01 3
3 2011-06-01 4
4 2011-09-01 5 # 2nd group: mean=6.5
5 2011-12-01 6
6 2012-03-01 7
7 2012-06-01 8
>>> df.groupby(pd.Grouper(key="Date", freq="AS-SEP")).mean()
Sales
Date
2010-09-01 2.5
2011-09-01 6.5
Check the documentation to know all values of freq aliases and anchoring suffix
You can access year with the datetime accessor:
df[
(df["RegionName"] == "Carlisle")
& (df["Date"].dt.year >= 2010)
& (df["Date"].dt.year <= 2020)
].groupby(df.Date.dt.year)["SalesVolume"].mean()
>>>
Date
2010 112.083333
2011 73.000000
Name: SalesVolume, dtype: float64

how to convert monthly data to weekly data keeping the other columns constant

I have a data frame as follows.
pd.DataFrame({'Date':['2020-08-01','2020-08-01','2020-09-01'],'value':[10,12,9],'item':['a','d','b']})
I want to convert this to weekly data keeping all the columns apart from the Date column constant.
Expected output
pd.DataFrame({'Date':['2020-08-01','2020-08-08','2020-08-15','2020-08-22','2020-08-29','2020-08-01','2020-08-08','2020-08-15','2020-08-22','2020-08-29','2020-09-01','2020-09-08','2020-09-15','2020-09-22','2020-09-29'],
'value':[10,10,10,10,10,12,12,12,12,12,9,9,9,9,9],'item':['a','a','a','a','a','d','d','d','d','d','b','b','b','b','b']})
It should be able to convert any month data to weekly data. Date in the input data frame is always the first day of that month.
How do I make this happen?
Thanks in advance.
Since the desired new datetime index is irregular (re-starts at the 1st of each month), an iterative creation of the index is an option:
df = pd.DataFrame({'Date':['2020-08-01','2020-09-01'],'value':[10,9],'item':['a','b']})
df = df.set_index(pd.to_datetime(df['Date'])).drop(columns='Date')
dti = pd.to_datetime([]) # start with an empty datetime index
for month in df.index: # for each month, add a 7-day step datetime index to the previous
dti = dti.union(pd.date_range(month, month+pd.DateOffset(months=1), freq='7d'))
# just reindex and forward-fill, no resampling needed
df = df.reindex(dti).ffill()
df
value item
2020-08-01 10.0 a
2020-08-08 10.0 a
2020-08-15 10.0 a
2020-08-22 10.0 a
2020-08-29 10.0 a
2020-09-01 9.0 b
2020-09-08 9.0 b
2020-09-15 9.0 b
2020-09-22 9.0 b
2020-09-29 9.0 b
I added one more date to your data and then used resample:
df = pd.DataFrame({'Date':['2020-08-01', '2020-09-01'],'value':[10, 9],'item':['a', 'b']})
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df = df.resample('W').ffill().reset_index()
print(df)
Date value item
0 2020-08-02 10 a
1 2020-08-09 10 a
2 2020-08-16 10 a
3 2020-08-23 10 a
4 2020-08-30 10 a
5 2020-09-06 9 b

Drop rows after particular year Pandas

I have a column in my dataframe that has years in the following format:
2018-19
2017-18
The years are object data type. I want to change the type of this column to datetime, then drop all rows before 1979-80. However, I tried to do that and I got formatting errors. What is the correct, or better way, of doing this?
BOS['Season'] = pd.to_datetime(BOS['Season'], format = '%Y%y')
I am quite new to Python, so I could appreciate it if you can tell me what I am doing wrong. Thanks!
I think here is simpliest compare years separately, e.g. before -:
print (BOS)
Season
0 1979-80
1 2018-19
2 2017-18
df = BOS[BOS['Season'].str.split('-').str[0].astype(int) < 2017]
print (df)
Season
0 1979-80
Details:
First is splited value by Series.str.split to lists and then seelcted first lists:
print (BOS['Season'].str.split('-'))
0 [1979, 80]
1 [2018, 19]
2 [2017, 18]
Name: Season, dtype: object
print (BOS['Season'].str.split('-').str[0])
0 1979
1 2018
2 2017
Name: Season, dtype: object
Or convert both years to separately columns:
BOS['start'] = pd.to_datetime(BOS['Season'].str.split('-').str[0], format='%Y').dt.year
BOS['end'] = BOS['start'] + 1
print (BOS)
Season start end
0 1979-80 1979 1980
1 2018-19 2018 2019
2 2017-18 2017 2018
I would have use .str.slice accessor of Series to select the part of the date I wish to keep, to insert it into the pd.to_datetime() function. Then, the select with .loc[] and boolean mask becomes easy.
import pandas as pd
data = {
'date' : ['2016-17', '2017-18', '2018-19', '2019-20']
}
df = pd.DataFrame(data)
print(df)
# date
# 0 2016-17
# 1 2017-18
# 2 2018-19
# 3 2019-20
df['date'] = pd.to_datetime(df['date'].str.slice(0, 4), format='%Y')
print(df)
# date
# 0 2016-01-01
# 1 2017-01-01
# 2 2018-01-01
# 3 2019-01-01
df = df.loc[ df['date'].dt.year < 2018 ]
print(df)
# date
# 0 2016-01-01
# 1 2017-01-01

Categories

Resources