Pandas: calculate the std of total column value per "year" - python

I have a data frame representing the customers checkins (visits) of restaurants. year is simply the year when a checkin in a restaurant happened .
What i want to do is to add a column std_checkin to my initial Dataframe df that represents the standard deviation of visits per year. So, I need to calculate the standard deviation for the total visits per year.
data = {
'restaurant_id': ['--1UhMGODdWsrMastO9DZw', '--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA'],
'year': ['2016','2016','2016','2016','2017','2017','2011','2011','2012','2012'],
}
df = pd.DataFrame (data, columns = ['restaurant_id','year'])
# total number of checkins per restaurant
d = df.groupby('restaurant_id')['year'].count().to_dict()
df['nb_checkin'] = df['restaurant_id'].map(d)
grouped = df.groupby(["restaurant_id"])
avg_annual_visits = grouped["year"].count() / grouped["year"].nunique()
avg_annual_visits = avg_annual_visits.rename("avg_annual_visits")
df = df.merge(avg_annual_visits, left_on="restaurant_id", right_index=True)
df.head(10)
From here, I'm not sure how to write what i want with pandas. If any clarifications needed, please ask.
Thank you!

I think you want to do:
counts = df.groupby('restaurant_id')['year'].value_counts()
counts.std(level='restaurant_id')
Output for counts, which is total visit per restaurant per year:
restaurant_id year
--1UhMGODdWsrMastO9DZw 2016 4
2017 2
--6MefnULPED_I942VcFNA 2011 2
2012 2
Name: year, dtype: int64
And output for std
restaurant_id
--1UhMGODdWsrMastO9DZw 1.414214
--6MefnULPED_I942VcFNA 0.000000
Name: year, dtype: float64

Related

How to get calendar years as column names and month and day as index for one timeseries

I have looked for solutions but seem to find none that point me in the right direction, hopefully, someone on here can help. I have a stock price data set, with a frequency of Month Start. I am trying to get an output where the calendar years are the column names, and the day and month will be the index (there will only be 12 rows since it is monthly data). The rows will be filled with the stock prices corresponding to the year and month. I, unfortunately, have no code since I have looked at for loops, groupby, etc but can't seem to figure this one out.
You might want to split the date into month and year and to apply a pivot:
s = pd.to_datetime(df.index)
out = (df
.assign(year=s.year, month=s.month)
.pivot_table(index='month', columns='year', values='Close', fill_value=0)
)
output:
year 2003 2004
month
1 0 2
2 0 3
3 0 4
12 1 0
Used input:
df = pd.DataFrame({'Close': [1,2,3,4]},
index=['2003-12-01', '2004-01-01', '2004-02-01', '2004-03-01'])
You need multiple steps to do that.
First split your column into the right format.
Then convert this column into two separate columns.
Then pivot the table accordingly.
import pandas as pd
# Test Dataframe
df = pd.DataFrame({'Date': ['2003-12-01', '2004-01-01', '2004-02-01', '2004-12-01'],
'Close': [6.661, 7.053, 6.625, 8.999]})
# Split datestring into list of form [year, month-day]
df = df.assign(Date=df.Date.str.split(pat='-', n=1))
# Separate date-list column into two columns
df = pd.DataFrame(df.Date.to_list(), columns=['Year', 'Date'], index=df.index).join(df.Close)
# Pivot the table
df = df.pivot(columns='Year', index='Date')
df
Output:
Close
Year 2003 2004
Date
01-01 NaN 7.053
02-01 NaN 6.625
12-01 6.661 8.999

Find the missing month in given date range then add that missing date in the data with same records as given in the last date

I have a Statement of accounts, where i have Unique ID, Disbursed date, payment date and the balance amount.
Date range for below data = Disbursed date to May-2022
Example of date:
Unique Disbursed date payment date balance amount
123 2022-Jan-13 2022-Jan-27 10,000
123 2022-Jan-13 2022-Feb-28 5,000
123 2022-Jan-13 2022-Apr-29 2,000
first I want to groupby payment date(last day of each month) and as an aggr function instead of Sum or mean, I want to carry forward the same balance reflecting in the last month last day.
As you can see March is missing in the records, here I want to add a new record for March with same balance given in Feb-22 i.e 5,000 and date for the new record should be last day of Mar-22.
Since date range given till 2022-May then here I want to add another new record for May-22 with same balance given in last month (Apr-22) i.e 2000 and date for the new record should be last day of May-22
Note : I have multiple unique ids like 123, 456, 789, etc.
I'd tried below code to find out the missing month
for i in df['date']:
pd.date_range(i,'2020-11-28').difference(df.index)
print(i)
but, it is giving days wise missing date. I want to find out the missing "month" instead of date for each unique id
You can use:
# generate needed month ends
idx = pd.date_range('2022-01', '2022-06', freq='M')
out = (df
# compute the month end for existing data
.assign(month_end=pd.to_datetime(df['payment date'])
.sub(pd.Timedelta('1d'))
.add(pd.offsets.MonthEnd()))
.set_index(['Unique', 'month_end'])
# reindex with missing ID/month ends
.reindex(pd.MultiIndex.from_product([df['Unique'].unique(),
idx
], names=['Unique', 'idx']))
.reset_index()
# fill missing month end with correct format
.assign(**{'payment date': lambda d:
d['payment date'].fillna(d['idx'].dt.strftime('%Y-%b-%d'))})
# ffill the data per ID
.groupby('Unique').ffill()
)
output:
Unique idx Disbursed date payment date balance amount
0 123 2022-01-31 2022-Jan-13 2022-Jan-27 10,000
1 123 2022-02-28 2022-Jan-13 2022-Feb-28 5,000
2 123 2022-03-31 2022-Jan-13 2022-Mar-31 5,000
3 123 2022-04-30 2022-Jan-13 2022-Apr-29 2,000
4 123 2022-05-31 2022-Jan-13 2022-May-31 2,000

Pandas: calculate mean of Dataframe column values per "year"

I have a data frame representing the customers checkins (visits) of restaurants. year is simply the year when a checkin in a restaurant happened .
What i want to do is to add a column average_checkin to my initial Dataframe df that represents the average number of visits of a restaurant per year.
data = {
'restaurant_id': ['--1UhMGODdWsrMastO9DZw', '--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA'],
'year': ['2016','2016','2016','2016','2017','2017','2011','2011','2012','2012'],
}
df = pd.DataFrame (data, columns = ['restaurant_id','year'])
# here i count the total number of checkins a restaurant had
d = df.groupby('restaurant_id')['year'].count().to_dict()
df['nb_checkin'] = df['restaurant_id'].map(d)
mean_checkin= df.groupby(['restaurant_id','year']).agg({'nb_checkin':[np.mean]})
mean_checkin.columns = ['mean_checkin']
mean_checkin.reset_index()
# the values in mean_checkin makes no sens
#I need to merge it with df to add that new column
I am still new with the pandas lib, I tried something like this but my results makes no sens. Is there something wrong with my syntax? If any clarifications needed, please ask.
The average number of visits per year can be calculated as the total number of visits a restaurant has, divided by the number of unique years you have data for.
grouped = df.groupby(["restaurant_id"])
avg_annual_visits = grouped["year"].count() / grouped["year"].nunique()
avg_annual_visits = avg_annual_visits.rename("avg_annual_visits")
print(avg_annual_visits)
restaurant_id
--1UhMGODdWsrMastO9DZw 3.0
--6MefnULPED_I942VcFNA 2.0
Name: avg_annual_visits, dtype: float64
Then if you wanted to merge it back to your original data:
df = df.merge(avg_annual_visits, left_on="restaurant_id", right_index=True)
print(df)
restaurant_id year avg_annual_visits
0 --1UhMGODdWsrMastO9DZw 2016 3.0
1 --1UhMGODdWsrMastO9DZw 2016 3.0
2 --1UhMGODdWsrMastO9DZw 2016 3.0
3 --1UhMGODdWsrMastO9DZw 2016 3.0
4 --1UhMGODdWsrMastO9DZw 2017 3.0
5 --1UhMGODdWsrMastO9DZw 2017 3.0
6 --6MefnULPED_I942VcFNA 2011 2.0
7 --6MefnULPED_I942VcFNA 2011 2.0
8 --6MefnULPED_I942VcFNA 2012 2.0
9 --6MefnULPED_I942VcFNA 2012 2.0

how to get group by dataframe on the basis of year or week and how to combine two dataset in python

Dataset 1 : Sales Representative ID, Customer ID, Order Date, Revenue
Dataset 2 : Manager ID, Sales Representative ID, Create Date, Termination date
Given Above 2 datasets where “Dataset 1” represents daily revenue data related to a customer and
the sales representative associated with that customer AND “Dataset 2” has mapping of sales
representative with the manager id associated with it in that particular point in time where “Create
Date” represents when new association is created and “Termination date” represents when
association is terminated.
i have to calculate year,month,week and day wise revenue for each manager id for
every date.
Output Dataset: Order Date, Year/Month/Week/Day,Manager ID, Total Revenue
I am confused with two things here how to combine these two dataset and secondly how to get the revenue week,year and day wise like i dont know any way in pandas to group by them according to above. Please help
dataset1 = { 'srid':[1,2,3,1,5],
'custid':[11,12,43,12,34],
'orderdate':["1/2/2019","1/2/2019","2/2/2019","1/2/2019","1/2/2019"],
'Rev':[100,101,102,103,17]
}
dataset2 = {
'manid':[101,102,103,104,105],
'srid':[1,2,1,3,5],
'CreateDate':["1/1/2019","1/1/2019","3/1/2019","1/1/2019","1/1/2019"],
'TerminationDate':["2/1/2019","3/1/2019","5/1/2019","2/1/2019","2/1/2019"]
}
Try this:
df1 = pd.DataFrame(dataset1)
df2 = pd.DataFrame(dataset2)
df = df1.merge(df2, on=['srid'])
df['orderdate'] = pd.to_datetime(df['orderdate'])
df['CreateDate'] = pd.to_datetime(df['CreateDate'])
df['TerminationDate'] = pd.to_datetime(df['TerminationDate'])
# Daily
df_d = df.groupby(by=['manid', pd.Grouper(key='orderdate', freq='D')]).agg({'Rev': 'sum'})
# Monthly
df_m = df.groupby(by=['manid', pd.Grouper(key='orderdate', freq='M')]).agg({'Rev': 'sum'})
# Weekly
df_w = df.groupby(by=['manid', pd.Grouper(key='orderdate', freq='W')]).agg({'Rev': 'sum'})
# Yearly
df_y = df.groupby(by=['manid', pd.Grouper(key='orderdate', freq='Y')]).agg({'Rev': 'sum'})
print(df_y)
Rev
manid orderdate
101 2019-12-31 203
102 2019-12-31 101
103 2019-12-31 203
104 2019-12-31 102
105 2019-12-31 17
great answer above, you can also easily use df.resample(rule='MS').sum() for different time intervals( M: month, MS: start of month, D: day and so on)

Convert and Assign Pandas Series to a dataframe to create CSV

I've got order data with SKUs inside and would like to find out, how often a SKU has been bought per month over the last 3 years.
for row in df_skus.iterrows():
df_filtered = df_orders.loc[df_orders['item_sku'] == row[1]['sku']]
# Remove unwanted rows:
df_filtered = df_filtered[['txn_id', 'date', 'item_sku']].copy()
# Group by year and date:
df_result = df_filtered['date'].groupby([df_filtered.date.dt.year, df_filtered.date.dt.month]).agg('count')
print ( df_result )
print ( type ( df_result ) )
The (shortened) result looks good so far:
date date
2017 3 1
Name: date, dtype: int64
date date
2017 2 1
3 6
4 1
6 1
Name: date, dtype: int64
Now, I'd like to create a CSV which looks like that:
SKU 2017-01 2017-02 2017-03
17 0 0 1
18 0 1 3
Is it possible to simply 'convert' my data into the desired structure?
I do these kind of calculations all the time and this seems to be the fastest.
import pandas as pd
df_orders = df_orders[df_orders["item_sku"].isin(df_skus["sku"])]
monthly_sales = df_orders.groupby(["item_sku", pd.Grouper(key="date",freq="M")]).size()
monthly_sales = monthly_sales.unstack(0)
monthly_sales.to_csv("my_csv.csv")
first line filters to the SKUs you want
the second line does a groupby and counts the number of sales per sku per month
the next line changes the dataframe from a multi index to the format you want
exports to csv

Categories

Resources