Shrink the dataset by taking mean or median - python

Assuming I have the following dataframe df:
Number Apples
1 40
2 50
3 60
4 70
5 80
6 90
7 100
8 110
9 120
I want to shrink this dataset and create dataframe df2 such that there are only 3 observations. Hence, I want to take the average of 1,2,3 and make that one row, then 4,5,6 and make that the second row, and finally, 7,8,9 and make that the 3rd row
The end result will be the following
Number Apples
1 50
2 80
3 110

This is a simpler approach and should run much faster than a groupby -
df.rolling(3).mean()[2::3]
apples
2 50.0
5 80.0
8 110.0

You can do
n=3
s=df.groupby((df.Number-1)//n).Apples.mean()
Number
0 50
1 80
2 110
Name: Apples, dtype: int64

Related

Map two pandas dataframe and add a column to the first dataframe

I have posted two sample dataframes. I would like to map one column of a dataframe with respect to the index of a column in another dataframe and place the values back to the first dataframe shown as below
A = np.array([0,1,1,3,5,2,5,4,2,0])
B = np.array([55,75,86,98,100,111])
df1 = pd.Series(A, name='data').to_frame()
df2 = pd.Series(B, name='values_for_replacement').to_frame()
The below is the first dataframe df1
data
0 0
1 1
2 1
3 3
4 5
5 2
6 5
7 4
8 2
9 0
And the below is the second dataframe df2
values_for_replacement
0 55
1 75
2 86
3 98
4 100
5 111
The below is the output needed (Mapped with respect to the index of the df2)
data new_data
0 0 55
1 1 75
2 1 75
3 3 98
4 5 111
5 2 86
6 5 111
7 4 100
8 2 86
9 0 55
I would kindly like to know how one can achieve this using some pandas functions like map.
Looking forward for some answers. Many thanks in advance

How to change value of column with different percentage values depending on the categories of other column

Say I have a data frame as:
df
cat income
1 10
2 20
2 50
3 60
1 20
I want to apply a fixed percentage increase category wise as per the scheme:
If the cat==1-----> income * 1.1 (10% increase)
If the cat==2-----> income * 1.2 (20% increase)
If the cat==3-----> income * 1.3 (30% increase)
Then i need to append the increased column to the above data frame as below:
df
cat income increased_income
1 10 11
2 20 24
2 50 60
3 60 78
1 20 22
How can i achieve the above using pandas?
Try:
cat_map={1:1.1, 2:1.2, 3:1.3}
df["increased_income"]=df["income"].mul(df["cat"].map(cat_map))
Outputs:
cat income increased_income
0 1 10 11.0
1 2 20 24.0
2 2 50 60.0
3 3 60 78.0
4 1 20 22.0
First we create factors of your cat by dividing it by 10 and adding 1. Then we multiple these factors with your income
df['increased_income'] = df['income'].mul(df['cat'].div(10).add(1))
# or with basic Python operators
# df['increased_income'] = df['income'] * (df['cat'] / 10 + 1)
cat income increased_income
0 1 10 11.0
1 2 20 24.0
2 2 50 60.0
3 3 60 78.0
4 1 20 22.0
Condition using boolean select
a=df['cat']==1
b=df['cat']==2
c=df['cat']==3
Apply np.where
df['increased_income']=np.where(a,(df['income']*1.1),(np.where(b,(df['income']*1.2),(np.where(c,(df['income']*1.3),df['income'])))))
df
Outcome

Adding a subtotal column to a multilevel column table

This is my dataframe after pivoting:
Country London Shanghai
PriceRange 100-200 200-300 300-400 100-200 200-300 300-400
Code
A 1 1 1 2 2 2
B 10 10 10 20 20 20
Is it possible to add columns after every country to achieve the following:
Country London Shanghai All
PriceRange 100-200 200-300 300-400 SubTotal 100-200 200-300 300-400 SubTotal 100-200 200-300 300-400 SubTotal
Code
A 1 1 1 3 2 2 2 6 3 3 3 9
B 10 10 10 30 20 20 20 60 30 30 30 90
I know I can use margins=True, however that just adds a final grand total.
Are there any options that I can use to achieve this? THanks.
Let us using sum with join
s=df.sum(level=0,axis=1)
s.columns=pd.MultiIndex.from_product([list(s),['subgroup']])
df=df.join(s).sort_index(level=0,axis=1).assign(Group=df.sum(axis=1))
df
A B Group
1 2 3 subgroup 1 2 3 subgroup
Code
A 1 1 1 3 2 2 2 6 9
B 10 10 10 30 20 20 20 60 90

How do I update pandas dataframes with calculations done group-wise?

Take the following table:
df = pd.DataFrame({'a':[1,1,2,2], 'b':[1,2,3,4], 'c':[10,20,30,40]})
print(df.to_string())
a b c
0 1 1 10
1 1 2 20
2 2 3 30
3 2 4 40
I would like the following result:
result = pd.DataFrame({'a':[1,1,2,2], 'b':[1,2,3,4], 'c':[10,20,30,40], 'group_avg':[13.5,13.5,31.5,31.5]})
print(result.to_string())
a b c group_avg
0 1 1 10 13.5
1 1 2 20 13.5
2 2 3 30 31.5
3 2 4 40 31.5
That is, group_avg is computed by doing c-b and then taking the average group-wise by grouping on a.
Is there a nice way of doing this, or do I have to go the roundabout way of creating a new difference column, grouping by a, getting the average, then joining the result on the original table?
What if I want to apply an arbitrary function which takes 2 series, but I want to apply it group-wise?
Try, using assign to create a temporary column of c-b then, groupby with transform:
df['group_avg'] = df.assign(avg = df.c - df.b)\
.groupby('a')['avg'].transform('mean')
Output:
a b c group_avg
0 1 1 10 13.5
1 1 2 20 13.5
2 2 3 30 31.5
3 2 4 40 31.5
Due to the linear nature of the mean, the mean of the difference is the same as the difference of the mean. So we can use the mean after a groupby then subtract.
df.join(df.groupby('a').mean().eval('c - b').rename('avg'), on='a')
a b c avg
0 1 1 10 13.5
1 1 2 20 13.5
2 2 3 30 31.5
3 2 4 40 31.5

How can we fill the empty values in the column?

I have table A with 3 columns. The column (val) has some of the empty values. The question is: Is there any way to fill the empty values based on the previous value using python. For example, Alex and John take vale 20 and Sam takes value 100.
A = [ id name val
1 jack 10
2 mec 20
3 alex
4 john
5 sam 250
6 tom 100
7 sam
8 hellen 300]
You can try to take in data as a pandas Dataframe and use the built in function fillna() to solve your problem. For an example,
df = # your data
df.fillna(method='pad')
Would return a dataframe like,
id name val
0 1 jack 10
1 2 mec 20
2 3 alex 20
3 4 john 20
4 5 sam 250
5 6 tom 100
6 7 sam 100
7 8 hellen 300
You can refer to this page for more information.

Categories

Resources