Pandas groupby month output is incorrect [duplicate]

Pandas groupby month output is incorrect [duplicate] - python

My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct

Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30

Related

How to show to average sales for each year within ten years for a specific city in Pandas?

What would be the correct way to show what was the average sales volume in Carlisle city for each year between
2010-2020?
Here is an abbreviated form of the large data frame showing only the columns and rows relevant to the question:
import pandas as pd
df = pd.DataFrame({'Date': ['01/09/2009','01/10/2009','01/11/2009','01/12/2009','01/01/2010','01/02/2010','01/03/2010','01/04/2010','01/05/2010','01/06/2010','01/07/2010','01/08/2010','01/09/2010','01/10/2010','01/11/2010','01/12/2010','01/01/2011','01/02/2011'],
'RegionName': ['Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle'],
'SalesVolume': [118,137,122,132,83,81,105,114,110,106,137,130,129,121,129,100,84,62]})
This is what I've tried:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv ('C:/Users/user/AppData/Local/Programs/Python/Python39/Scripts/uk_hpi_dataset_2021_01.csv')
df.Date = pd.to_datetime(df.Date)
df['Year'] = pd.to_datetime(df['Date']).apply(lambda x:
'{year}'.format(year=x.year).zfill(2))
carlisle_vol = df[df['RegionName'].str.contains('Carlisle')]
carlisle_vol.groupby('Year')['SalesVolume'].mean()
print(sales_vol)
When I try to run this code, it doesn't filter the 'Date' column to only calculate the average SalesVolume for the years beginning in '01/01/2010' and ending at '01/12/2020'. For some reason, it also prints out every other column is well. Can anyone please help me to answer this question correctly?
This is the result I've got

>>> df.loc[(df["Date"].dt.year.between(2010, 2020))
& (df["RegionName"] == "Carlisle")] \
.groupby([pd.Grouper(key="Date", freq="Y")])["SalesVolume"].mean()
Date
2010-01-01 112.083333
2011-01-01 73.000000
Freq: A-DEC, Name: SalesVolume, dtype: float64
For further:
The only difference between the answer of #nocibambi is the groupby parameter and particularly the freq argument of pd.Grouper. Imagine your accounting year starts the 1st september.
Sales each 3 months:
>>> df
Date Sales
0 2010-09-01 1 # 1st group: mean=2.5
1 2010-12-01 2
2 2011-03-01 3
3 2011-06-01 4
4 2011-09-01 5 # 2nd group: mean=6.5
5 2011-12-01 6
6 2012-03-01 7
7 2012-06-01 8
>>> df.groupby(pd.Grouper(key="Date", freq="AS-SEP")).mean()
Sales
Date
2010-09-01 2.5
2011-09-01 6.5
Check the documentation to know all values of freq aliases and anchoring suffix

You can access year with the datetime accessor:
df[
(df["RegionName"] == "Carlisle")
& (df["Date"].dt.year >= 2010)
& (df["Date"].dt.year <= 2020)
].groupby(df.Date.dt.year)["SalesVolume"].mean()
>>>
Date
2010 112.083333
2011 73.000000
Name: SalesVolume, dtype: float64

How do I condense a pandas data frame where the rows are the months and I'm trying to condense them into years?

So I have a dataframe
https://docs.google.com/spreadsheets/d/19ssG8bvkZKVDR6V5yU9fZVRJbJNfTTEYmWqLwmDwBa0/edit#gid=0
This is the out put that my code gives.
Here is the code:
from yahoofinancials import YahooFinancials
import pandas as pd
import datetime as datetime
df = pd.read_excel('C:/Users/User/Downloads/Div Tickers.xlsx', sheet_name='Sheet1')
tickers_list = df['Ticker'].tolist()
data = pd.DataFrame(columns=tickers_list)
yahoo_financials_ecommerce = YahooFinancials(data)
ecommerce_income_statement_data = yahoo_financials_ecommerce.get_financial_stmts('annual', 'income')
data = ecommerce_income_statement_data['incomeStatementHistory']
df_dict = dict()
for ticker in tickers_list:
df_dict[ticker] = pd.concat([pd.DataFrame(data[ticker][x]) for x in range(len(data[ticker]))],
sort=False, join='outer', axis=1)
df = pd.concat(df_dict, sort=True)
df_l = pd.DataFrame(df.stack())
df_l.reset_index(inplace=True)
df_l.columns = ['ticker', 'financials', 'date', 'value']
df_w = df_l.pivot_table(index=['date.year', 'financials'], columns='ticker', values='value')
export_excel = df_w.to_excel(r'C:/Users/User/Downloads/Income Statement Histories.xlsx', sheet_name="Sheet1", index= True)
How would I go about condensing the months into years so that the data is comparable Year-over-Year?

IIUC, you need to melt, then use groupby on your date column to group by year.
#df['date'] = pd.to_datetime(df['date'])
df = pd.melt(df,id_vars=['date','financials'],var_name='ticker')
df.groupby([df['date'].dt.year,df['financials'],df['ticker']])['value'].sum().unstack()
ticker AEM AGI ALB \
date financials
2016 costOfRevenue 1.030000e+09 309000000.0 1.710000e+09
discontinuedOperations 0.000000e+00 0.0 2.020000e+08
ebit 3.360000e+08 21300000.0 5.370000e+08
grossProfit 1.110000e+09 173000000.0 9.700000e+08
incomeBeforeTax 2.680000e+08 -7600000.0 5.750000e+08
... ... ... ...
2019 researchDevelopment 0.000000e+00 0.0 5.828700e+07
sellingGeneralAdministrative 1.210000e+08 19800000.0 4.390000e+08
totalOperatingExpenses 1.650000e+09 557000000.0 2.830000e+09
totalOtherIncomeExpenseNet -1.000000e+08 2900000.0 -6.900000e+07
totalRevenue 2.490000e+09 683000000.0 3.590000e+09

Not sure since you didn't give us any data, but you can change a datetime column to year with the following code. The first bit is just generating some smaple data:
from datetime import datetime, timedelta
from random import randint
df = pd.DataFrame({
'dates': [datetime.today() - timedelta(randint(0, 1000)) for _ in range(50)]
})
print(df.head())
dates
0 2019-09-02 21:01:46.702300
1 2019-11-03 21:01:46.702329
2 2019-04-01 21:01:46.702338
3 2019-03-04 21:01:46.702345
4 2019-03-28 21:01:46.702351
The part that matters
df.dates.dt.to_period('Y')
0 2018
1 2018
2 2019
3 2018
4 2019
5 2020

Pandas extract week of year and year from date

I caught up with this scenario and don't know how can I solve this.
I have the data frame where I am trying to add "week_of_year" and "year" column based in the "date" column of the pandas' data frame which is working fine.
import pandas as pd
df = pd.DataFrame({'date': ['2018-12-31', '2019-01-01', '2019-12-31', '2020-01-01']})
df['date'] = pd.to_datetime(df['date'])
df['week_of_year'] = df['date'].apply(lambda x: x.weekofyear)
df['year'] = df['date'].apply(lambda x: x.year)
print(df)
Current Output
date week_of_year year
0 2018-12-31 1 2018
1 2019-01-01 1 2019
2 2019-12-31 1 2019
3 2020-01-01 1 2020
Expected Output
So here what I am expecting is for 2018 and 2019 the last date was the first week of the new year which is 2019 and 2020 respectively so I want to add logic in the year, where the week is 1 but the date belongs for the previous year so the year column would track that as in the expected output.
date week_of_year year
0 2018-12-31 1 2019
1 2019-01-01 1 2019
2 2019-12-31 1 2020
3 2020-01-01 1 2020

Try:
df['date'] = pd.to_datetime(df['date'])
df['week_of_year'] = df['date'].dt.weekofyear
df['year']=(df['date']+pd.to_timedelta(6-df['date'].dt.weekday, unit='d')).dt.year
Outputs:
date week_of_year year
0 2018-12-31 1 2019
1 2019-01-01 1 2019
2 2019-12-31 1 2020
3 2020-01-01 1 2020
Few things - generally avoid .apply(..).
For datetime columns you can just interact with the date through df[col].dt variable.
Then to get the last day of the week just add to date 6-weekday where weekday is between 0 (Monday) and 6 to the date

TLDR CODE
To get the week number as a series
df['DATE'].dt.isocalendar().week
To set a new column to the week use same function and set series returned to a column:
df['WEEK'] = df['DATE'].dt.isocalendar().week
TLDR EXPLANATION
Use the pd.series.dt.isocalendar().week to get the the week for a given series object.
Note:
column "DATE" must be stored as a datetime column

Drop rows after particular year Pandas

I have a column in my dataframe that has years in the following format:
2018-19
2017-18
The years are object data type. I want to change the type of this column to datetime, then drop all rows before 1979-80. However, I tried to do that and I got formatting errors. What is the correct, or better way, of doing this?
BOS['Season'] = pd.to_datetime(BOS['Season'], format = '%Y%y')
I am quite new to Python, so I could appreciate it if you can tell me what I am doing wrong. Thanks!

I think here is simpliest compare years separately, e.g. before -:
print (BOS)
Season
0 1979-80
1 2018-19
2 2017-18
df = BOS[BOS['Season'].str.split('-').str[0].astype(int) < 2017]
print (df)
Season
0 1979-80
Details:
First is splited value by Series.str.split to lists and then seelcted first lists:
print (BOS['Season'].str.split('-'))
0 [1979, 80]
1 [2018, 19]
2 [2017, 18]
Name: Season, dtype: object
print (BOS['Season'].str.split('-').str[0])
0 1979
1 2018
2 2017
Name: Season, dtype: object
Or convert both years to separately columns:
BOS['start'] = pd.to_datetime(BOS['Season'].str.split('-').str[0], format='%Y').dt.year
BOS['end'] = BOS['start'] + 1
print (BOS)
Season start end
0 1979-80 1979 1980
1 2018-19 2018 2019
2 2017-18 2017 2018

I would have use .str.slice accessor of Series to select the part of the date I wish to keep, to insert it into the pd.to_datetime() function. Then, the select with .loc[] and boolean mask becomes easy.
import pandas as pd
data = {
'date' : ['2016-17', '2017-18', '2018-19', '2019-20']
}
df = pd.DataFrame(data)
print(df)
# date
# 0 2016-17
# 1 2017-18
# 2 2018-19
# 3 2019-20
df['date'] = pd.to_datetime(df['date'].str.slice(0, 4), format='%Y')
print(df)
# date
# 0 2016-01-01
# 1 2017-01-01
# 2 2018-01-01
# 3 2019-01-01
df = df.loc[ df['date'].dt.year < 2018 ]
print(df)
# date
# 0 2016-01-01
# 1 2017-01-01

Sorting Python data frame according to dates

How to sort a python data frame according to dates in the format that can be seen on the image. The output that I want to receive is the same data frame but at index 0 I would have January 2013 and the corresponding amount and at index 1 I would have February 2013 etc.

import pandas as pd
df = pd.DataFrame( {'Amount':['54241.25','54008.83','54008.82'] ,
'Date':['05/01/2015','05/01/2017','06/01/2017']})
df['Date'] =pd.to_datetime(df.Date)
df.sort_values('Date', inplace=True)

You just need to convert your Date column to a datetime, then you can sort the dataframe by that column
import pandas as pd
df = pd.DataFrame({'Date': ['05-2016', '05-2017', '06-2017', '01-2017', '02-2017'],
'Amount': [2,5,6,3,2]})
df['Date'] = pd.to_datetime(df['Date'], format='%m-%Y')
df = df.sort_values('Date').reset_index(drop=True)
Which gives:
Date Amount
0 2016-05-01 2
1 2017-01-01 3
2 2017-02-01 2
3 2017-05-01 5
4 2017-06-01 6

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby month output is incorrect [duplicate] - python

Add format. df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

Related

How to show to average sales for each year within ten years for a specific city in Pandas?

How do I condense a pandas data frame where the rows are the months and I'm trying to condense them into years?

Pandas extract week of year and year from date

Drop rows after particular year Pandas

Sorting Python data frame according to dates

Categories

Resources