Plot frequency of dates in interval occurred in pandas dataframe - python

I have dataframe with dates from year 1970 to year 2018, I want to plot frequency of occurrences from year 2016 to 2017.
In[95]: df['last_payout'].dtypes
Out[95]: dtype('<M8[ns]')
The data is stored in this format:
In[96]: df['last_payout'].head
​​
Out[96]: <bound method NDFrame.head of 0 1970-01-01
1 1970-01-01
2 1970-01-01
3 1970-01-01
4 1970-01-01
I plot this by year using group by and count :
In[97]: df['last_payout'].groupby(df['last_payout'].dt.year).count().plot(kind="bar")
I want to get this plot between specific dates, I tried to put df['last_payout'].dt.year > 2016, but I got this:
How do I get the plot for specific date range?

I think need filter by between and boolean indexing first:
rng = pd.date_range('2015-04-03', periods=10, freq='7M')
df = pd.DataFrame({'last_payout': rng})
print (df)
last_payout
0 2015-04-30
1 2015-11-30
2 2016-06-30
3 2017-01-31
4 2017-08-31
5 2018-03-31
6 2018-10-31
7 2019-05-31
8 2019-12-31
9 2020-07-31
(df.loc[df['last_payout'].dt.year.between(2016, 2017), 'last_payout']
.groupby(df['last_payout'].dt.year)
.count()
.plot(kind="bar")
)
Alternative solution:
(df.loc[df['last_payout'].dt.year.between(2016, 2017), 'last_payout']
.dt.year
.value_counts()
.sort_index()
.plot(kind="bar")
)
EDIT: For months with years convert datetimes to month period by to_period:
(df.loc[df['last_payout'].dt.year.between(2016, 2017), 'last_payout']
.dt.to_period('M')
.value_counts()
.sort_index()
.plot(kind="bar")
)

Note that
df['last_payout'].dt.year > 2016
just returns a boolean series, so plotting this will indeed show a bar chart of the number of dates for which this is or not.
Try first creating a relevant df:
relevant_df = df[(df['last_payout'].dt.year > 2016) & (df['last_payout'].dt.year <= 2017)]
(use strict or not inequalities depending on what you want, of course.)
then performing the plot on it:
relevant_df['last_payout'].groupby(relevant_df['last_payout'].dt.year).count().plot(kind="bar")

Related

Issue with converting a pandas column from int64 to datetime64

I'm trying to convert a column of Year values from int64 to datetime64 in pandas. The column currently looks like
Year
2003
2003
2003
2003
2003
...
2021
2021
2021
2021
2021
However the data type listed when I use dataset['Year'].dtypes is int64.
That's after I used pd.to_datetime(dataset.Year, format='%Y') to convert the column from int64 to datetime64. How do I get around this?
You have to assign pd.to_datetime(df['Year'], format="%Y") to df['date']. Once you have done that you should be able to see convert from integer.
df = pd.DataFrame({'Year': [2000,2000,2000,2000,2000,2000]})
df['date'] = pd.to_datetime(df['Year'], format="%Y")
df
The output should be:
Year date
0 2000 2000-01-01
1 2000 2000-01-01
2 2000 2000-01-01
3 2000 2000-01-01
4 2000 2000-01-01
5 2000 2000-01-01
So essentially all you are missing is df['date'] = pd.to_datetime(df['Year'], format="%Y") from your code and it should be working fine with respect to converting.
The pd.to_datetime() will not just return the Year (as far as I understood from your question you wanted the year), if you want more information on what .to_date_time() returns, you can see the documentation.
I hope this helps.
You should be able to convert from an integer:
df = pd.DataFrame({'Year': [2003, 2022]})
df['datetime'] = pd.to_datetime(df['Year'], format='%Y')
print(df)
Output:
Year datetime
0 2003 2003-01-01
1 2022 2022-01-01

Group Pandas DF for seasonal analysis [duplicate]

Hello and thanks in advance for any help. I have a simple dataframe with two columns. I did not set an index explicitly, but I believe a dataframe gets an integer index that I see along the left side of the output. Question below:
df = pandas.DataFrame(res)
df.columns = ['date', 'pb']
df['date'] = pandas.to_datetime(df['date'])
df.dtypes
date datetime64[ns]
pb float64
dtype: object
date pb
0 2016-04-01 24199.933333
1 2016-03-01 23860.870968
2 2016-02-01 23862.275862
3 2016-01-01 25049.193548
4 2015-12-01 24882.419355
5 2015-11-01 24577.000000
date datetime64[ns]
pb float64
dtype: object
I would like to pivot the dataframe so that I have years across the top (columns): 2016, 2015, etc
and a row for each month: 1 - 12.
Using the .dt accessor you can create columns for year and month and then pivot on those:
df['Year'] = df['date'].dt.year
df['Month'] = df['date'].dt.month
pd.pivot_table(df,index='Month',columns='Year',values='pb',aggfunc=np.sum)
Alternately if you don't want those other columns you can do:
pd.pivot_table(df,index=df['date'].dt.month,columns=df['date'].dt.year,
values='pb',aggfunc=np.sum)
With my dummy dataset that produces:
Year 2013 2014 2015 2016
date
1 92924.0 102072.0 134660.0 132464.0
2 79935.0 82145.0 118234.0 147523.0
3 86878.0 94959.0 130520.0 138325.0
4 80267.0 89394.0 120739.0 129002.0
5 79283.0 91205.0 118904.0 125878.0
6 77828.0 89884.0 112488.0 121953.0
7 78839.0 94407.0 113124.0 NaN
8 79885.0 97513.0 116771.0 NaN
9 79455.0 99555.0 114833.0 NaN
10 77616.0 98764.0 115872.0 NaN
11 75043.0 95756.0 107123.0 NaN
12 81996.0 102637.0 114952.0 NaN
Using stack instead of pivot
df = pd.DataFrame(
dict(date=pd.date_range('2013-01-01', periods=42, freq='M'),
pb=np.random.rand(42)))
df.set_index([df.date.dt.month, df.date.dt.year]).pb.unstack()

Pandas groupby month output is incorrect [duplicate]

My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30

How to show to average sales for each year within ten years for a specific city in Pandas?

What would be the correct way to show what was the average sales volume in Carlisle city for each year between
2010-2020?
Here is an abbreviated form of the large data frame showing only the columns and rows relevant to the question:
import pandas as pd
df = pd.DataFrame({'Date': ['01/09/2009','01/10/2009','01/11/2009','01/12/2009','01/01/2010','01/02/2010','01/03/2010','01/04/2010','01/05/2010','01/06/2010','01/07/2010','01/08/2010','01/09/2010','01/10/2010','01/11/2010','01/12/2010','01/01/2011','01/02/2011'],
'RegionName': ['Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle'],
'SalesVolume': [118,137,122,132,83,81,105,114,110,106,137,130,129,121,129,100,84,62]})
This is what I've tried:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv ('C:/Users/user/AppData/Local/Programs/Python/Python39/Scripts/uk_hpi_dataset_2021_01.csv')
df.Date = pd.to_datetime(df.Date)
df['Year'] = pd.to_datetime(df['Date']).apply(lambda x:
'{year}'.format(year=x.year).zfill(2))
carlisle_vol = df[df['RegionName'].str.contains('Carlisle')]
carlisle_vol.groupby('Year')['SalesVolume'].mean()
print(sales_vol)
When I try to run this code, it doesn't filter the 'Date' column to only calculate the average SalesVolume for the years beginning in '01/01/2010' and ending at '01/12/2020'. For some reason, it also prints out every other column is well. Can anyone please help me to answer this question correctly?
This is the result I've got
>>> df.loc[(df["Date"].dt.year.between(2010, 2020))
& (df["RegionName"] == "Carlisle")] \
.groupby([pd.Grouper(key="Date", freq="Y")])["SalesVolume"].mean()
Date
2010-01-01 112.083333
2011-01-01 73.000000
Freq: A-DEC, Name: SalesVolume, dtype: float64
For further:
The only difference between the answer of #nocibambi is the groupby parameter and particularly the freq argument of pd.Grouper. Imagine your accounting year starts the 1st september.
Sales each 3 months:
>>> df
Date Sales
0 2010-09-01 1 # 1st group: mean=2.5
1 2010-12-01 2
2 2011-03-01 3
3 2011-06-01 4
4 2011-09-01 5 # 2nd group: mean=6.5
5 2011-12-01 6
6 2012-03-01 7
7 2012-06-01 8
>>> df.groupby(pd.Grouper(key="Date", freq="AS-SEP")).mean()
Sales
Date
2010-09-01 2.5
2011-09-01 6.5
Check the documentation to know all values of freq aliases and anchoring suffix
You can access year with the datetime accessor:
df[
(df["RegionName"] == "Carlisle")
& (df["Date"].dt.year >= 2010)
& (df["Date"].dt.year <= 2020)
].groupby(df.Date.dt.year)["SalesVolume"].mean()
>>>
Date
2010 112.083333
2011 73.000000
Name: SalesVolume, dtype: float64

Pandas extract week of year and year from date

I caught up with this scenario and don't know how can I solve this.
I have the data frame where I am trying to add "week_of_year" and "year" column based in the "date" column of the pandas' data frame which is working fine.
import pandas as pd
df = pd.DataFrame({'date': ['2018-12-31', '2019-01-01', '2019-12-31', '2020-01-01']})
df['date'] = pd.to_datetime(df['date'])
df['week_of_year'] = df['date'].apply(lambda x: x.weekofyear)
df['year'] = df['date'].apply(lambda x: x.year)
print(df)
Current Output
date week_of_year year
0 2018-12-31 1 2018
1 2019-01-01 1 2019
2 2019-12-31 1 2019
3 2020-01-01 1 2020
Expected Output
So here what I am expecting is for 2018 and 2019 the last date was the first week of the new year which is 2019 and 2020 respectively so I want to add logic in the year, where the week is 1 but the date belongs for the previous year so the year column would track that as in the expected output.
date week_of_year year
0 2018-12-31 1 2019
1 2019-01-01 1 2019
2 2019-12-31 1 2020
3 2020-01-01 1 2020
Try:
df['date'] = pd.to_datetime(df['date'])
df['week_of_year'] = df['date'].dt.weekofyear
df['year']=(df['date']+pd.to_timedelta(6-df['date'].dt.weekday, unit='d')).dt.year
Outputs:
date week_of_year year
0 2018-12-31 1 2019
1 2019-01-01 1 2019
2 2019-12-31 1 2020
3 2020-01-01 1 2020
Few things - generally avoid .apply(..).
For datetime columns you can just interact with the date through df[col].dt variable.
Then to get the last day of the week just add to date 6-weekday where weekday is between 0 (Monday) and 6 to the date
TLDR CODE
To get the week number as a series
df['DATE'].dt.isocalendar().week
To set a new column to the week use same function and set series returned to a column:
df['WEEK'] = df['DATE'].dt.isocalendar().week
TLDR EXPLANATION
Use the pd.series.dt.isocalendar().week to get the the week for a given series object.
Note:
column "DATE" must be stored as a datetime column

Categories

Resources