Pandas: calculate mean of Dataframe column values per "year"

Pandas: calculate mean of Dataframe column values per "year" - python

I have a data frame representing the customers checkins (visits) of restaurants. year is simply the year when a checkin in a restaurant happened .
What i want to do is to add a column average_checkin to my initial Dataframe df that represents the average number of visits of a restaurant per year.
data = {
'restaurant_id': ['--1UhMGODdWsrMastO9DZw', '--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA'],
'year': ['2016','2016','2016','2016','2017','2017','2011','2011','2012','2012'],
}
df = pd.DataFrame (data, columns = ['restaurant_id','year'])
# here i count the total number of checkins a restaurant had
d = df.groupby('restaurant_id')['year'].count().to_dict()
df['nb_checkin'] = df['restaurant_id'].map(d)
mean_checkin= df.groupby(['restaurant_id','year']).agg({'nb_checkin':[np.mean]})
mean_checkin.columns = ['mean_checkin']
mean_checkin.reset_index()
# the values in mean_checkin makes no sens
#I need to merge it with df to add that new column
I am still new with the pandas lib, I tried something like this but my results makes no sens. Is there something wrong with my syntax? If any clarifications needed, please ask.

The average number of visits per year can be calculated as the total number of visits a restaurant has, divided by the number of unique years you have data for.
grouped = df.groupby(["restaurant_id"])
avg_annual_visits = grouped["year"].count() / grouped["year"].nunique()
avg_annual_visits = avg_annual_visits.rename("avg_annual_visits")
print(avg_annual_visits)
restaurant_id
--1UhMGODdWsrMastO9DZw 3.0
--6MefnULPED_I942VcFNA 2.0
Name: avg_annual_visits, dtype: float64
Then if you wanted to merge it back to your original data:
df = df.merge(avg_annual_visits, left_on="restaurant_id", right_index=True)
print(df)
restaurant_id year avg_annual_visits
0 --1UhMGODdWsrMastO9DZw 2016 3.0
1 --1UhMGODdWsrMastO9DZw 2016 3.0
2 --1UhMGODdWsrMastO9DZw 2016 3.0
3 --1UhMGODdWsrMastO9DZw 2016 3.0
4 --1UhMGODdWsrMastO9DZw 2017 3.0
5 --1UhMGODdWsrMastO9DZw 2017 3.0
6 --6MefnULPED_I942VcFNA 2011 2.0
7 --6MefnULPED_I942VcFNA 2011 2.0
8 --6MefnULPED_I942VcFNA 2012 2.0
9 --6MefnULPED_I942VcFNA 2012 2.0

Related

A Yearly data to Daily data in Python

df: (DataFrame)
Open High Close Volume
2020/1/1 1 2 3 323232
2020/1/2 2 3 4 321321
....
2020/12/31 4 5 6 123213
....
2021
The performance i needed is : (Graph NO.1)
Open High Close Volume Year_Sum_Volume
2020/1/1 1 2 3 323232 (323232 + 321321 +....+ 123213)
2020/1/2 2 3 4 321321 (323232 + 321321 +....+ 123213)
....
2020/12/31 4 5 6 123213 (323232 + 321321 +....+ 123213)
....
2021 (x+x+x.....x)
I want a sum of Volume in different year (the Year_Sum_Volume is the volume of each year)
This is the code i try to calculate the sum of volume in each year but how can i add this data
to daily data , i want to add Year_Sum_Volume to df,like(Graph no.1)
df.resample('Y', on='Date')['Volume'].sum()
thanks you for answering

I believe groupby.sum() and merge should be your friends
import pandas as pd
df = pd.DataFrame({"date":['2021-12-30', '2021-12-31', '2022-01-01'], "a":[1,2.1,3.2]})
df.date = pd.to_datetime(df.date)
df["year"] = df.date.dt.year
df_sums = df.groupby("year").sum().rename(columns={"a":"a_sum"})
df = df.merge(df_sums, right_index=True, left_on="year")
which gives:
date
a
year
a_sum
0
2021-12-30 00:00:00
1
2021
3.1
1
2021-12-31 00:00:00
2.1
2021
3.1
2
2022-01-01 00:00:00
3.2
2022
3.2

Based on your output, Year_Sum_Volume is the same value for every row and can be calculated using df['Volume'].sum().
Then you join a column of a scaled list:
df.join(pd.DataFrame( {'Year_Sum_Volume': [your_sum_val] * len(df['Volume'])} ))

Try below code (after converting date column to pd.to_datetime)
df.assign(Year_Sum_Volume = df.groupby(df['date'].dt.year)['a'].transform('sum'))

Count ratios conditional on 2 columns

I am new to pandas and trying to figure out the following how to calculate the percentage change (difference) between 2 years, given that sometimes there is no previous year.
I am given a dataframe as follows:
company date amount
1 Company 1 2020 3
2 Company 1 2021 1
3 COMPANY2 2020 7
4 Company 3 2020 4
5 Company 3 2021 4
.. ... ... ...
766 Company N 2021 9
765 Company N 2020 1
767 Company XYZ 2021 3
768 Company X 2021 3
769 Company Z 2020 2
I wrote something like this:
for company in unique(df2.company):
company_df = df2[df2.company== company]
company_df.sort_values(by ="date")
company_df_year = company_df.amount.tolist()
company_df_year.pop()
company_df_year.insert(0,0)
company_df["value_year_before"] = company_df_year
if any in company_df.value_year_before == None:
company_df["diff"] = 0
else:
company_df["diff"] = (company_df.amount- company_df.value_year_before)/company_df.value_year_before
df2["ratio"] = company_df["diff"]
But I keep getting >NAN.
Where did I make a mistake?

The main issue is that you are overwriting company_df in each iteration of the loop and only keeping the last one.
However, normally when using Pandas if you are starting to use a for loop then you are doing something wrong and there is an easier way to accomplish the goal. Here you could use groupby and pct_change to compute the ratio of each group.
df = df.sort_values(['company', 'date'])
df['ratio'] = df.groupby('company')['amount'].pct_change()
df['ratio'] = df['ratio'].fillna(0.0)
Groupby will keep the order of the rows within each group so we sort before to ensure that the order of the dates is correct and fillna replace any nans with 0.
Result:
company date amount ratio
3 COMPANY2 2020 7 0.000000
1 Company 1 2020 3 0.000000
2 Company 1 2021 1 -0.666667
4 Company 3 2020 4 0.000000
5 Company 3 2021 4 0.000000
765 Company N 2020 1 0.000000
766 Company N 2021 9 8.000000
768 Company X 2021 3 0.000000
767 Company XYZ 2021 3 0.000000
769 Company Z 2020 2 0.000000

Apply an anonymous function that calculate the change percentage and returns that if there is more than one values. Use:
df = pd.DataFrame({'company': [1,1,3], 'date':[2020,2021,2020], 'amount': [4,5,7]})
df.groupby('company')['amount'].apply(lambda x: (list(x)[1]-list(x)[0])/list(x)[0] if len(x)>1 else 'not enough values')
Input df:
Output:

Python / Pandas: Fill NaN with order - linear interpolation --> ffill --> bfill

I have a df:
company year revenues
0 company 1 2019 1,425,000,000
1 company 1 2018 1,576,000,000
2 company 1 2017 1,615,000,000
3 company 1 2016 1,498,000,000
4 company 1 2015 1,569,000,000
5 company 2 2019 nan
6 company 2 2018 1,061,757,075
7 company 2 2017 nan
8 company 2 2016 573,414,893
9 company 2 2015 599,402,347
I would like to fill the nan values, with an order. I want to linearly interpolate first, then forward fill and then backward fill. I currently have:
f_2_impute = [x for x in cl_data.columns if cl_data[x].dtypes != 'O' and 'total' not in x and 'year' not in x]
def ffbf(x):
return x.ffill().bfill()
group_with = ['company']
for x in cl_data[f_2_impute]:
cl_data[x] = cl_data.groupby(group_with)[x].apply(lambda fill_it: ffbf(fill_it))
which performs ffill() and bfill(). Ideally I want a function that tries first to linearly intepolate the missing values, then try forward filling them and then backward filling them.
Any quick ways of achieving it? Thanking you in advance.

I believe you need first convert columns to floats if , there:
df = pd.read_csv(file, thousands=',')
Or:
df['revenues'] = df['revenues'].replace(',','', regex=True).astype(float)
and then add DataFrame.interpolate:
def ffbf(x):
return x.interpolate().ffill().bfill()

Pandas: calculate the std of total column value per "year"

I have a data frame representing the customers checkins (visits) of restaurants. year is simply the year when a checkin in a restaurant happened .
What i want to do is to add a column std_checkin to my initial Dataframe df that represents the standard deviation of visits per year. So, I need to calculate the standard deviation for the total visits per year.
data = {
'restaurant_id': ['--1UhMGODdWsrMastO9DZw', '--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA'],
'year': ['2016','2016','2016','2016','2017','2017','2011','2011','2012','2012'],
}
df = pd.DataFrame (data, columns = ['restaurant_id','year'])
# total number of checkins per restaurant
d = df.groupby('restaurant_id')['year'].count().to_dict()
df['nb_checkin'] = df['restaurant_id'].map(d)
grouped = df.groupby(["restaurant_id"])
avg_annual_visits = grouped["year"].count() / grouped["year"].nunique()
avg_annual_visits = avg_annual_visits.rename("avg_annual_visits")
df = df.merge(avg_annual_visits, left_on="restaurant_id", right_index=True)
df.head(10)
From here, I'm not sure how to write what i want with pandas. If any clarifications needed, please ask.
Thank you!

I think you want to do:
counts = df.groupby('restaurant_id')['year'].value_counts()
counts.std(level='restaurant_id')
Output for counts, which is total visit per restaurant per year:
restaurant_id year
--1UhMGODdWsrMastO9DZw 2016 4
2017 2
--6MefnULPED_I942VcFNA 2011 2
2012 2
Name: year, dtype: int64
And output for std
restaurant_id
--1UhMGODdWsrMastO9DZw 1.414214
--6MefnULPED_I942VcFNA 0.000000
Name: year, dtype: float64

Python pandas dataframe select rows from columns

In an Excel sheet with columns Rainfall / Year / Month, I want to sum rainfall data per year. That is, for instance, for the year 2000, from month 1 to 12, summing all the Rainfall cells into a new one.
I tried using pandas in Python but cannot manage (just started coding). How can I proceed? Any help is welcome, thanks!
Here the head of the data (which has been downloaded):
rainfall (mm) \tyear month country iso3 iso2
0 120.54000 1990 1 ECU NaN NaN
1 231.15652 1990 2 ECU NaN NaN
2 136.62088 1990 3 ECU NaN NaN
3 203.47653 1990 4 ECU NaN NaN
4 164.20956 1990 5 ECU NaN NaN

Use groupby and aggregate sum if need sum of all years:
df = df.groupby('\tyear')['rainfall (mm)'].sum()
But if need only one value:
df.loc[df['\tyear'] == 2000, 'rainfall (mm)'].sum()

If you just want the year 2000, use
df[df['\tyear'] == 2000]['rainfall (mm)'].sum()
Otherwise, jezrael's answer is nice because it sums rainfall (mm) for each distinct value of \tyear.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: calculate mean of Dataframe column values per "year" - python

Related

A Yearly data to Daily data in Python

Count ratios conditional on 2 columns

Python / Pandas: Fill NaN with order - linear interpolation --> ffill --> bfill

Pandas: calculate the std of total column value per "year"

Python pandas dataframe select rows from columns

Categories

Resources