How should I find mean quarterly sales by store using pandas - python

I have the below dataframe
Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec
store_id
S_1 8.0 20.0 13.0 21.0 17.0 20.0 24.0 17.0 16.0 9.0 7.0 6.0
S_10 14.0 23.0 20.0 11.0 12.0 13.0 19.0 6.0 5.0 22.0 17.0 16.0
and I want to calculate the mean of each store per quarter:
Q1 Q2 Q3 Q4
store_id
S_1 13.67 19.33 15.67 7.33
S_10 19.0 12.0 10.0 18.33
How can this be achieved?

Convert values to quarter by DatetimeIndex.quarter and aggregate, it working correct also if changed order of columns:
#if necessary
df = df.rename(columns={'July':'Jul'})
df = (df.groupby(pd.to_datetime(df.columns, format='%b').quarter, axis=1)
.mean()
.add_prefix('Q')
.round(2))
print(df)
Q1 Q2 Q3 Q4
store_id
S_1 13.67 19.33 19.0 7.33
S_10 19.00 12.00 10.0 18.33

Assuming you have the columns in order, use groupby on axis=1:
import numpy as np
out = df.groupby([np.arange(df.shape[1])//3+1], axis=1).mean().add_prefix('Q')
output:
Q1 Q2 Q3 Q4
store_id
S_1 13.666667 19.333333 19.0 7.333333
S_10 19.000000 12.000000 10.0 18.333333

Related

How to concatenate variable string data to a row in a dataframe based on numeric value

I have a pandas dataframe result, looks like this:
Weekday Day Store1 Store2 Store3 Store4 Store5
0 Mon 6 0.0 0.0 0.0 0.0 0.0
1 Tue 7 42.0 33.0 23.0 42.0 21.0
2 Wed 8 43.0 29.0 13.0 33.0 22.0
3 Thu 9 45.0 24.0 20.0 29.0 18.0
4 Fri 10 48.0 21.0 22.0 37.0 22.0
5 Sat 11 34.0 22.0 23.0 34.0 18.0
0 Mon 13 39.0 21.0 21.0 25.0 21.0
1 Tue 14 39.0 20.0 18.0 0.0 19.0
2 Wed 15 46.0 26.0 18.0 31.0 24.0
3 Thu 16 38.0 21.0 15.0 45.0 29.0
4 Fri 17 42.0 21.0 21.0 41.0 20.0
5 Sat 18 40.0 25.0 15.0 36.0 19.0
0 Mon 20 39.0 22.0 23.0 36.0 19.0
1 Tue 21 31.0 18.0 16.0 35.0 23.0
2 Wed 22 33.0 25.0 17.0 39.0 22.0
3 Thu 23 34.0 24.0 19.0 18.0 27.0
4 Fri 24 33.0 18.0 24.0 43.0 24.0
5 Sat 25 38.0 22.0 20.0 40.0 12.0
0 Mon 27 41.0 21.0 18.0 31.0 23.0
1 Tue 28 32.0 21.0 14.0 23.0 14.0
2 Wed 29 33.0 18.0 15.0 19.0 23.0
3 Thu 30 36.0 21.0 21.0 23.0 18.0
4 Fri 1 40.0 30.0 24.0 38.0 23.0
5 Sat 2 40.0 19.0 22.0 38.0 21.0
Notice how Day goes from 6 to 30, then back to 1, and 2. In this example, it's referring to
September 6, 2021 - October 2nd, 2021.
I currently have a variable PrimaryMonth = September and SecondaryMonth = October
I know that I can do result['Month'] = 'September' but it will list all the Month values as September, I'd like to find a way, if possible, to iterate through the rows so that when it reaches the bottom 1 and 2 it will show October in the new Month column.
Is it possible to do a For loop or some other iteration to accomplish this? I was initially brainstorming some pseudocode
#for row in result:
# while Day <= 31
#concat PrimaryMonth
#else concat SecondaryMonth
You can kind of get an idea of where I want to go with this.
Many things are easier if you use proper date formats...
date_str = 'Monday, September 6, 2021 - Saturday, October 2, 2021'
new_index = pd.date_range(*map(pd.to_datetime, date_str.split(' - ')))
dates = pd.DataFrame(index=new_index)
dates['day'] = dates.index.day
dates.columns = ['Day']
df = pd.merge(dates, df, 'outer')
df.index = dates.index
df['month'] = df.index.month_name()
print(df.dropna())
Output:
Day Weekday Store1 Store2 Store3 Store4 Store5 month
2021-09-06 6 Mon 0.0 0.0 0.0 0.0 0.0 September
2021-09-07 7 Tue 42.0 33.0 23.0 42.0 21.0 September
2021-09-08 8 Wed 43.0 29.0 13.0 33.0 22.0 September
2021-09-09 9 Thu 45.0 24.0 20.0 29.0 18.0 September
2021-09-10 10 Fri 48.0 21.0 22.0 37.0 22.0 September
2021-09-11 11 Sat 34.0 22.0 23.0 34.0 18.0 September
2021-09-13 13 Mon 39.0 21.0 21.0 25.0 21.0 September
2021-09-14 14 Tue 39.0 20.0 18.0 0.0 19.0 September
2021-09-15 15 Wed 46.0 26.0 18.0 31.0 24.0 September
2021-09-16 16 Thu 38.0 21.0 15.0 45.0 29.0 September
2021-09-17 17 Fri 42.0 21.0 21.0 41.0 20.0 September
2021-09-18 18 Sat 40.0 25.0 15.0 36.0 19.0 September
2021-09-20 20 Mon 39.0 22.0 23.0 36.0 19.0 September
2021-09-21 21 Tue 31.0 18.0 16.0 35.0 23.0 September
2021-09-22 22 Wed 33.0 25.0 17.0 39.0 22.0 September
2021-09-23 23 Thu 34.0 24.0 19.0 18.0 27.0 September
2021-09-24 24 Fri 33.0 18.0 24.0 43.0 24.0 September
2021-09-25 25 Sat 38.0 22.0 20.0 40.0 12.0 September
2021-09-27 27 Mon 41.0 21.0 18.0 31.0 23.0 September
2021-09-28 28 Tue 32.0 21.0 14.0 23.0 14.0 September
2021-09-29 29 Wed 33.0 18.0 15.0 19.0 23.0 September
2021-09-30 30 Thu 36.0 21.0 21.0 23.0 18.0 September
2021-10-01 1 Fri 40.0 30.0 24.0 38.0 23.0 October
2021-10-02 2 Sat 40.0 19.0 22.0 38.0 21.0 October
And no, no matter what you do, a for-loop is probably the wrong answer when it comes to pandas.

How do I manipulate a Dataframe with Pivot_Table in Python

I have spent much time on this but I am nowhere closer to a solution.
I have a dataframe which outputs as
RegionID AreaID Year Jan Feb Mar Apr May Jun
0 20.0 1.0 2020.0 1174.0 1056.0 1051.0 1107.0 1097.0 1118.0
1 19.0 2.0 2020.0 460.0 451.0 421.0 421.0 420.0 457.0
2 20.0 3.0 2020.0 2723.0 2594.0 2590.0 2399.0 2377.0 2331.0
3 21.0 4.0 2020.0 863.0 859.0 813.0 785.0 757.0 765.0
4 19.0 5.0 2020.0 4037.0 3942.0 4069.0 3844.0 3567.0 3721.0
5 19.0 6.0 2020.0 1695.0 1577.0 1531.0 1614.0 1671.0 1693.0
6 18.0 7.0 2020.0 1757.0 1505.0 1445.0 1514.0 1406.0 1444.0
7 18.0 8.0 2020.0 832.0 721.0 747.0 852.0 885.0 872.0
8 18.0 9.0 2020.0 2538.0 2000.0 2026.0 1981.0 1987.0 1949.0
9 21.0 10.0 2020.0 1145.0 1235.0 1114.0 1161.0 1150.0 1189.0
10 20.0 11.0 2020.0 551.0 497.0 503.0 472.0 505.0 532.0
11 19.0 12.0 2020.0 1664.0 1526.0 1389.0 1373.0 1384.0 1404.0
12 21.0 13.0 2020.0 381.0 351.0 299.0 286.0 297.0 319.0
13 21.0 14.0 2020.0 1733.0 1627.0 1567.0 1561.0 1498.0 1511.0
14 18.0 15.0 2020.0 1257.0 1257.0 1160.0 1172.0 1124.0 1113.0
I want to pivot this data so that I have a month combined field like below
RegionID AreaID Year Month Amout
20.0 1.0 2020 Jan 1174
20.0 1.0 2020 Feb 1056
20.0 1.0 2020 Mar 1051
Can this be done using pandas? I have been trying with the pivot_table but I cant get it to work.
I hope I've understood your question well. You can .set_index() and then .stack():
print(
df.set_index(["RegionID", "AreaID", "Year"])
.stack()
.reset_index()
.rename(columns={"level_3": "Month", 0: "Amount"})
)
Prints:
RegionID AreaID Year Month Amount
0 20.0 1.0 2020.0 Jan 1174.0
1 20.0 1.0 2020.0 Feb 1056.0
2 20.0 1.0 2020.0 Mar 1051.0
3 20.0 1.0 2020.0 Apr 1107.0
4 20.0 1.0 2020.0 May 1097.0
5 20.0 1.0 2020.0 Jun 1118.0
6 19.0 2.0 2020.0 Jan 460.0
7 19.0 2.0 2020.0 Feb 451.0
8 19.0 2.0 2020.0 Mar 421.0
9 19.0 2.0 2020.0 Apr 421.0
10 19.0 2.0 2020.0 May 420.0
11 19.0 2.0 2020.0 Jun 457.0
...
Or:
print(
df.melt(
["RegionID", "AreaID", "Year"], var_name="Month", value_name="Amount"
)
)

How can remove a column name/label from a pivot table and remaining column names drop to index name level?

I have a pivot table using CategoricalDtype so I can get the month names in order. How can I can drop the column name/label "Month" and then move the month abbreviation names to the same level as "Year"?
...
.pivot_table(index='Year',columns='Month',values='UpClose',aggfunc=np.sum))
Current output:
Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total
Year
1997 12.0 8.0 8.0 12.0 11.0 12.0 14.0 10.0 10.0 10.0 10.0 9.0 126.0
1998 10.0 12.0 14.0 12.0 9.0 11.0 10.0 8.0 11.0 10.0 10.0 12.0 129.0
Desired output:
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total
1997 12.0 8.0 8.0 12.0 11.0 12.0 14.0 10.0 10.0 10.0 10.0 9.0 126.0
1998 10.0 12.0 14.0 12.0 9.0 11.0 10.0 8.0 11.0 10.0 10.0 12.0 129.0
If I use, data.columns.name = None it will remove the "Month" label, but it doesn't drop the month abbreviations to the same level as "Year.
You need to replace the column name by doing something like this Renaming columns in dataframe w.r.t another specific column
# replace the Month with year
df = df.rename(columns={"Month":"Year"})
# drop first column
df = df.iloc[1:].reset_index(drop=True)

Pandas df.pivot_table - aggfunc = sum not producing desired output

Say I have a data frame, sega_df:
MONTH Character Rings Chili Dogs Emeralds
0 Jun 2017 Sonic 25.0 10.0 6.0
5 Jun 2017 Sonic 19.0 15.0 0.0
8 Jun 2017 Shadow 4.0 1.0 0.0
9 Jun 2017 Shadow 23.0 1.0 0.0
12 Jun 2017 Knuckles 9.0 3.0 1.0
13 Jun 2017 Tails 10.0 6.0 0.0
22 Jul 2017 Sonic 5.0 20.0 0.0
23 Jul 2017 Shadow 3.0 3.0 7.0
24 Jul 2017 Knuckles 9.0 4.0 0.0
27 Jul 2017 Knuckles 11.0 2.0 0.0
28 Jul 2017 Tails 12.0 3.0 0.0
29 Jul 2017 Tails 12.0 5.0 0.0
My pivot_table command gives me a table output of each character by row against each month, but the values are a series of random Nan or 0. The 0s are because there is more data with 0s in later months and I only posted the first few rows. The data types of the values in the three columns (Rings,Chili Dogs, and Emeralds) are numpy.float64, so I'm also curious if that affects it, or if it's how I define aggfunc.
My values argument and pivot_table commmand is as follows:
values = list(sega_df.columns.values)
test = pd.pivot_table(data = sega_df, values = values, index = 'Character', columns = 'MONTH', aggfunc='sum')
Here is my desired pivot_table output, -- with the sum of the three columns per character per month (eg. Sonic for month of June is [25 + 10 + 6 + 19 + 15 + 0] = 75.0):
MONTH Jun 2017 Jul 2017
Character
0 Sonic 75.0 25.0
1 Shadow 29.0 18.0
2 Knuckles 13.0 26.0
3 Tails 16.0 32.0
Just need groupby sum and sum with axis = 1 , then we unstack
df.groupby(['Character','MONTH']).sum().sum(1).unstack()
Out[953]:
MONTH Jul2017 Jun2017
Character
Knuckles 26.0 13.0
Shadow 13.0 29.0
Sonic 25.0 75.0
Tails 32.0 16.0

How to calculate counts on pandas pivot_table

I have data something like this
import random
import pandas as pd
jobs = ['Agriculture', 'Crafts', 'Labor', 'Professional']
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(300)]
})
And I want a simple table showing the count of jobs in each region.
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len))
Output is
MaritalStatus
Region 1 2 3 4 5 All
JobCategory
Agriculture 13.0 23.0 17.0 18.0 8.0 79.0
Crafts 16.0 13.0 18.0 19.0 14.0 80.0
Labor 15.0 11.0 19.0 11.0 14.0 70.0
Professional 22.0 17.0 16.0 7.0 9.0 71.0
All 66.0 64.0 70.0 55.0 45.0 300.0
I assume "MaritalStatus" is showing up in the output because that is the column that the count is being calculated on. How do I get Pandas to calculate based on the Region-JobCategory count and ignore extraneous columns in the dataframe?
Added in edit ---
I am looking for a table with margin values to be output. The values in the table I show are what I want but I don't want MaritalStatus to be what is counted. If there is a Nan in that column, e.g. change the column definition to
'MaritalStatus':[random.choice(['Not Married', 'Married'])
for i in range(299)].append(np.NaN)
This is the output (both with and without values = 'MaritalStatus',)
MaritalStatus
Region 1 2 3 4 5 All
JobCategory
Agriculture 16.0 14.0 16.0 14.0 16.0 NaN
Crafts 25.0 17.0 15.0 14.0 16.0 NaN
Labor 14.0 16.0 8.0 17.0 15.0 NaN
Professional 13.0 14.0 14.0 13.0 13.0 NaN
All NaN NaN NaN NaN NaN 0.0
You can fill the nan values with 0 and then find the len i.e
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(299)].append(np.NaN)})
df = df.fillna(0)
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
values='MaritalStatus',
aggfunc=len))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 19.0 17.0 13.0 20.0 9.0 78.0
Crafts 17.0 14.0 9.0 11.0 16.0 67.0
Labor 10.0 17.0 15.0 19.0 11.0 72.0
Professional 11.0 14.0 19.0 19.0 20.0 83.0
All 57.0 62.0 56.0 69.0 56.0 300.0
If you cut the dataframe down to just the columns that are to be part of the final index counting rows works without having to refer to another column.
pd.pivot_table(testdata[['JobCategory', 'Region']],
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len)
Output is the same as in the question except the line with "MaritialStatus" is not present.
The len aggregation function counts the number of times a value of MaritalStatus appears along a particular combination of JobCategory - Region. Thus you're counting the number of JobCategory - Region instances, which is what you're expecting I guess.
EDIT
We can assign key value to each records and count or size that value.
df = pd.DataFrame({
'JobCategory':[random.choice(jobs) for i in range(300)],
'Region':[random.randint(1,5) for i in range(300)],
'MaritalStatus':[random.choice(['Not Married', 'Married']) for i in range(299)].append(np.NaN)})
print(pd.pivot_table(df.assign(key=1),
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len,
values='key'))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 16.0 14.0 13.0 16.0 16.0 75.0
Crafts 14.0 9.0 17.0 22.0 13.0 75.0
Labor 11.0 18.0 20.0 10.0 16.0 75.0
Professional 16.0 14.0 15.0 14.0 16.0 75.0
All 57.0 55.0 65.0 62.0 61.0 300.0
You could add MaritalStatus as the values parameter, and this would eliminate that extra level in the column index. With aggfunc=len, it really doesn't matter what you select as the values parameter it is going to return a count of 1 for every row in that aggregation.
So, try:
print(pd.pivot_table(df,
index='JobCategory',
columns='Region',
margins=True,
aggfunc=len,
values='MaritalStatus'))
Output:
Region 1 2 3 4 5 All
JobCategory
Agriculture 10.0 18.0 10.0 15.0 19.0 72.0
Crafts 11.0 13.0 17.0 11.0 22.0 74.0
Labor 12.0 10.0 18.0 16.0 12.0 68.0
Professional 21.0 16.0 20.0 13.0 16.0 86.0
All 54.0 57.0 65.0 55.0 69.0 300.0
Option 2
Use groupby and size:
df.groupby(['JobCategory','Region']).size()
Output:
JobCategory Region
Agriculture 1 10
2 18
3 10
4 15
5 19
Crafts 1 11
2 13
3 17
4 11
5 22
Labor 1 12
2 10
3 18
4 16
5 12
Professional 1 21
2 16
3 20
4 13
5 16
dtype: int64

Categories

Resources