Adding a subtotal column to a multilevel column table - python

This is my dataframe after pivoting:
Country London Shanghai
PriceRange 100-200 200-300 300-400 100-200 200-300 300-400
Code
A 1 1 1 2 2 2
B 10 10 10 20 20 20
Is it possible to add columns after every country to achieve the following:
Country London Shanghai All
PriceRange 100-200 200-300 300-400 SubTotal 100-200 200-300 300-400 SubTotal 100-200 200-300 300-400 SubTotal
Code
A 1 1 1 3 2 2 2 6 3 3 3 9
B 10 10 10 30 20 20 20 60 30 30 30 90
I know I can use margins=True, however that just adds a final grand total.
Are there any options that I can use to achieve this? THanks.

Let us using sum with join
s=df.sum(level=0,axis=1)
s.columns=pd.MultiIndex.from_product([list(s),['subgroup']])
df=df.join(s).sort_index(level=0,axis=1).assign(Group=df.sum(axis=1))
df
A B Group
1 2 3 subgroup 1 2 3 subgroup
Code
A 1 1 1 3 2 2 2 6 9
B 10 10 10 30 20 20 20 60 90

Related

Shrink the dataset by taking mean or median

Assuming I have the following dataframe df:
Number Apples
1 40
2 50
3 60
4 70
5 80
6 90
7 100
8 110
9 120
I want to shrink this dataset and create dataframe df2 such that there are only 3 observations. Hence, I want to take the average of 1,2,3 and make that one row, then 4,5,6 and make that the second row, and finally, 7,8,9 and make that the 3rd row
The end result will be the following
Number Apples
1 50
2 80
3 110
This is a simpler approach and should run much faster than a groupby -
df.rolling(3).mean()[2::3]
apples
2 50.0
5 80.0
8 110.0
You can do
n=3
s=df.groupby((df.Number-1)//n).Apples.mean()
Number
0 50
1 80
2 110
Name: Apples, dtype: int64

Split up the total of a value when merging dataframes with rows that contain id multiple times

I have two dataframes that I would like to merge. The first dataframe contains a customer id and a column with a value. The second dataframe contains the customer id and a purchase id. When merging i would like to split up the total value in the first dataframe based on how many times the customer id is present in the second dataframe and attribute every row the correct split of the total value.
Example: Customer with id 1 has a total value of 3000 but has bought products two times in its lifetime the value 3000 should then be split when merging so that each row gets 1500.
First dataframe:
import pandas as pd
df_first = pd.DataFrame({'customer_id': [1,2,3,4,5], 'value': [3000,4000,5000,6000,7000]})
df_first.head()
Out[1]:
customer_id value
0 1 3000
1 2 4000
2 3 5000
3 4 6000
4 5 7000
Second dataframe:
df_second = pd.DataFrame({'customer_id': [1,2,3,4,5,1,2,3,4,5], 'purchase_id': [11,12,13,14,15,21,22,23,24,25]})
df_second.head(10)
Out[2]:
customer_id purchase_id
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 1 21
6 2 22
7 3 23
8 4 24
9 5 25
Expected output when merging:
Out[3]:
customer_id value purchase_id
0 1 1500 11
1 1 1500 21
2 2 2000 12
3 2 2000 22
4 3 2500 13
5 3 2500 23
6 4 3000 14
7 4 3000 24
8 5 3500 15
9 5 3500 25
Use DataFrame.merge with left join and sorted values by customer_id and then divide values by length of groups mapped by Series.map with Series.value_counts :
df = df_second.sort_values('customer_id').merge(df_first, on='customer_id', how='left')
df['value'] /= df['customer_id'].map(df['customer_id'].value_counts())
#alternative
#df['value'] /= df.groupby('customer_id')['customer_id'].transform('size')
print (df)
customer_id purchase_id value
0 1 11 1500.0
1 1 21 1500.0
2 2 12 2000.0
3 2 22 2000.0
4 3 13 2500.0
5 3 23 2500.0
6 4 14 3000.0
7 4 24 3000.0
8 5 15 3500.0
9 5 25 3500.0

how to groupby and aggregate dynamic columns in pandas

I have following dataframe in pandas
code tank nozzle_1 nozzle_2 nozzle_var nozzle_sale
123 1 1 1 10 10
123 1 2 2 12 10
123 2 1 1 10 10
123 2 2 2 12 10
123 1 1 1 10 10
123 2 2 2 12 10
Now, I want to generate cumulative sum of all the columns grouping over tank and take out the last observation. nozzle_1 and nozzle_2 columns are dynamic, it could be nozzle_3, nozzle_4....nozzle_n etc. I am doing following in pandas to get the cumsum
## Below code for calculating cumsum of dynamic columns nozzle_1 and nozzle_2
cols= df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)]
df.assign(**df.groupby('tank')[cols].agg(['cumsum'])\
.pipe(lambda x: x.set_axis(x.columns.map('_'.join), axis=1, inplace=False)))
## nozzle_sale_cumsum is static column
df[nozzle_sale_cumsum] = df.groupby('tank')['nozzle_sale'].cumsum()
From above code I will get cumsum of following columns
tank nozzle_1 nozzle_2 nozzle_var nozzle_1_cumsum nozzle_2_cumsum nozzle_sale_cumsum
1 1 1 10 1 1 10
1 2 2 12 3 3 20
2 1 1 10 1 1 10
2 2 2 12 3 3 20
1 1 1 10 4 4 30
2 2 2 12 5 5 30
Now, I want to get last values of all 3 cumsum columns grouping over tank. I can do it with following code in pandas, but it is hard coded with column names.
final_df= df.groupby('tank').agg({'nozzle_1_cumsum':'last',
'nozzle_2_cumsum':'last',
'nozzle_sale_cumsum':'last',
}).reset_index()
Problem with above code is nozzle_1_cumsum and nozzle_2_cumsum is hard coded which is not the case. How can I do this in pandas with dynamic columns.
How about:
df.filter(regex='_cumsum').groupby(df['tank']).last()
Output:
nozzle_1_cumsum nozzle_2_cumsum nozzle_sale_cumsum
tank
1 4 4 30
2 5 5 30
You can also replace df.filter(...) by, e.g., df.iloc[:,-3:] or df[col_names].

How can we fill the empty values in the column?

I have table A with 3 columns. The column (val) has some of the empty values. The question is: Is there any way to fill the empty values based on the previous value using python. For example, Alex and John take vale 20 and Sam takes value 100.
A = [ id name val
1 jack 10
2 mec 20
3 alex
4 john
5 sam 250
6 tom 100
7 sam
8 hellen 300]
You can try to take in data as a pandas Dataframe and use the built in function fillna() to solve your problem. For an example,
df = # your data
df.fillna(method='pad')
Would return a dataframe like,
id name val
0 1 jack 10
1 2 mec 20
2 3 alex 20
3 4 john 20
4 5 sam 250
5 6 tom 100
6 7 sam 100
7 8 hellen 300
You can refer to this page for more information.

Groupby top n records based on another column

Below is an example of the data set I am working with. I am trying to do a group by on Location for only the top 3 locations based on KGs.
Location KG Dollars
BKK 7 2
BKK 5 3
BKK 4 2
BKK 3 3
BKK 2 2
HKG 8 3
HKG 6 2
HKG 4 3
HKG 3 2
HKG 2 3
The output would look like below. Grouped on the location, summing both KG and Dollars for the top 3 KG records for each location.
Location KG Dollars
BKK 16 7
HKG 18 8
I've tried different types of groupbys, just having a problem only specifying the top n KG records for the groupby.
You could do
In [610]: df.groupby('Location').apply(lambda x:
x.nlargest(3, 'KG')[['KG', 'Dollars']].sum())
Out[610]:
KG Dollars
Location
BKK 16 7
HKG 18 8

Categories

Resources