Group by custom aggregation function python - python

After performing the groupby on two columns (id and category) using the mean aggregation function over a column (col3) I have something like this:
col3
id category mean
345 A 12
B 2
C 3
D 4
Total 21
What I would like to do is to add a new column called percentage in which I calculate the percentage of each category over the category Total.
This should be done separately for every id.
The result should be something like this:
col3
id category mean percentage
345 A 12 0.57
B 2 0.09
C 3 0.14
D 4 0.19
Total 21 1
Obviously i want to do that for every id, that is the first column on which i have done the groupby. Any suggestion on how to do that?

Using get_level_values filter your df, then we using div
s=df[df.index.get_level_values(level=1)!='Total'].sum(level=0)
df['percentage']=df.div(s,level=0,axis=1)
df
Out[422]:
mean percentage
id category
345 A 12 0.571429
B 2 0.095238
C 3 0.142857
D 4 0.190476
Total 21 1.000000

That's my suggestion:
df['mean'] = df['mean'] / df['mean'].sum()

Related

Iterate over unique values in first column while counting other columns within the value of the first column

I have some data that looks like this:
agent external id,product_group,commission_rate,weeks_held
3000-29,Collection,0.85,1
3000-29,Collection,0.85,2
3000-29,Return,0.85,1
3000-12,Collection,0.85,1
3000-12,Collection,0.85,2
3000-12,Return,0.85,1
3000-34,Collection,0.8,2
3000-34,Collection,0.8,2
3000-34,Return,0.8,1
3022-29,Collection,0.75,1
3022-29,Collection,0.75,2
3022-29,Return,0.75,1
I'm trying to create a dataframe that loops over each agent external id and gives me each as separate columns:
The agent external id
The count() of each product group
The sum() of weeks held by each product group
The output I'm looking for is:
agent_external_id,count_collection,sum_collection_weeks_held,count_return,sum_return_weeks_held
3000-29,2,3,0,0
3000-29,0,0,1,1
3000-12,2,3,0,0
3000-12,0,0,1,1
...
I've tried all types of groupby() but I'm not able to structure the data in the way that I want.
Is this what you would like. Please give a sample output and someone will help.
df.groupby(['agent_external_id','product_group']).agg(['sum','count'])
commission_rate weeks_held
sum count sum count
agent_external_id product_group
3000-12 Collection 1.70 2 3 2
Return 0.85 1 1 1
3000-29 Collection 1.70 2 3 2
Return 0.85 1 1 1
3000-34 Collection 1.60 2 4 2
Return 0.80 1 1 1
3022-29 Collection 1.50 2 3 2
Return 0.75 1 1 1

How to fill the missing values of a column with the mean of a specific class of another column?

I share a part of my big dataframe here to ask my question. In the Age column there are two missing values that are the first two rows. The way I intend to fill them is based on the following steps:
Calculte the mean of age for each group. (Assume the mean value of Age in Group A is X)
Iterate through Age column to detect the null values (which belong to the first two rows)
Return the Group value of each Age null value (which is 'A')
Fill those null values of Age with the mean age value of their corresponding group (The first two rows belong to A then fill their Age null values with X)
I know how to do step 1, I can use data.groupby('Group')['Age'].mean() but don't know how to proceed to the end of step 4.
Thanks.
Use:
df['Age'] = (df['Age'].fillna(df.groupby('Group')['Age'].transform('mean'))
.astype(int))
I'm guessing you're looking for something like this:
df['Age'] = df.groupby(['Name'])['Age'].transform(lambda x: np.where(np.isnan(x), x.mean(),x))
Assuming your data looks like this (I didn't copy the whole dataframe)
Name Age
0 a NaN
1 a NaN
2 b 15.0
3 d 50.0
4 d 45.0
5 a 8.0
6 a 7.0
7 a 8.0
you would run:
df['Age'] = df.groupby(['Name'])['Age'].transform(lambda x: np.where(np.isnan(x), x.mean(),x))
and get:
Name Age
0 a 7.666667 ---> The mean of group 'a'
1 a 7.666667
2 b 15.000000
3 d 50.000000
4 d 45.000000
5 a 8.000000
6 a 7.000000
7 a 8.000000

Python - Pivot and create histograms from Pandas column, with missing values

Having the following Data Frame:
name value count total_count
0 A 0 1 20
1 A 1 2 20
2 A 2 2 20
3 A 3 2 20
4 A 4 3 20
5 A 5 3 20
6 A 6 2 20
7 A 7 2 20
8 A 8 2 20
9 A 9 1 20
----------------------------------
10 B 0 10 75
11 B 5 30 75
12 B 6 20 75
13 B 8 10 75
14 B 9 5 75
I would like to pivot the data, grouping each row by the name value, then create columns based on the value & count columns aggregated into bins.
Explanation: I have 10 possible values, range 0-9, not all the values are present in each group. In the above example group B is missing values 1,2,3,4,7. I would like to create an histogram with 5 bins, ignore missing values and calculate the percentage of count for each bin. So the result will look like so:
name 0-1 2-3 4-5 6-7 8-9
0 A 0.150000 0.2 0.3 0.2 0.150000
1 B 0.133333 0.0 0.4 0.4 0.066667
For example for bin 0-1 of group A the calculation is the sum of count for the values 0,1 (1+2) divided by the total_count of group A
name 0-1
0 A (1+2)/20 = 0.15
I was looking into hist method and this StackOverflow question, but still struggling with figuring out what is the right approach.
Use pd.cut to bin your feature, then use a df.groupby().count() and the .unstack() method to get the dataframe you are looking for. During the group by you can use any aggregation function (.sum(), .count(), etc) to get the results you are looking for. The code below works if you are looking for an example.
import pandas as pd
import numpy as np
df = pd.DataFrame(
data ={'name': ['Group A','Group B']*5,
'number': np.arange(0,10),
'value': np.arange(30,40)})
df['number_bin'] = pd.cut(df['number'], bins=np.arange(0,10))
# Option 1: Sums
df.groupby(['number_bin','name'])['value'].sum().unstack(0)
# Options 2: Counts
df.groupby(['number_bin','name'])['value'].count().unstack(0)
The null values in the original data will not affect the result.
To get the exact result you could try this.
bins=range(10)
res = df.groupby('name')['count'].sum()
intervals = pd.cut(df.value, bins=bins, include_lowest=True)
df1 = (df.groupby([intervals,"name"])['count'].sum()/res).unstack(0)
df1.columns = df1.columns.astype(str) # convert the cols to string
df1.columns = ['a','b','c','d','e','f','g','h','i'] # rename the cols
cols = ['a',"b","d","f","h"]
df1 = df1.add(df1.iloc[:,1:].shift(-1, axis=1), fill_value=0)[cols]
print(df1)
You can manually rename the cols later.
# Output:
a b d f h
name
A 0.150000 0.2 0.3 0.200000 0.15
B 0.133333 NaN 0.4 0.266667 0.20
You can replace the NaN values using df1.fillna("0.0")

Reduce the dataframe rows and lookup

Need help with the following please.
Suppose we have a dataframe:
dictionary ={'Category':['a','a','a','a','a','a','a','a','b','b','b','b','b','b','b'],
'val1':[11,13,14,17,18,21,22,25,2,8,9,13,15,16,19],
'val2':[1,0,5,1,4,3,5,9,4,1,5,2,4,0,3]}
df=pd.DataFrame(dictionary)
'val1' is always increasing within the same value in 'category', i.e first and last rows of a category are min and max values of that category. There are too many rows per category, and I want to make a new dataframe that includes min and max values of each category and contains equally spaced e.g. 5 rows (incluing min and max) from each category.
I think numpy's linspace should be used to create an array of values for each category (e.g. linspace(min, max, 5)) then something similar to excel's 'lookup' function should be used to get the closest values of 'val1' from df.
Or maybe there are some other better ways...
Many thanks for the help.
Is this what you need ? with groupby and reindex
l=[]
for _, x in df.groupby('Category'):
x.index=x['val1']
y=x.reindex(np.linspace(x['val1'].min(), x['val1'].max(), 5),method='nearest')
l.append(y)
pd.concat(l)
Out[330]:
Category val1 val2
val1
11.00 a 11 1
14.50 a 14 5
18.00 a 18 4
21.50 a 22 5
25.00 a 25 9
2.00 b 2 4
6.25 b 8 1
10.50 b 9 5
14.75 b 15 4
19.00 b 19 3

Pandas duplicate rows to unique rows with weight

I'm trying to merge row in a dataframe where I have different inputs for one ID, so I'd like to have a single row for each ID with a weight.
My dataframe looks like this:
ID A B C D weight
1 0.5 2 a 1 1.0
2 0.3 3 b 2 0.35
2 0.6 5 c 3 0.55
3 0.4 2 d 4 0.9
and I would need it to merge the A, B columns for ID=2 into a weighted average (0.3*0.35+0.6*0.55 for A, 3*0.35+5*0.55 for B). For column C I'd need to chose the value associated to the highest weight (C=c for ID=2), column D the maximum value (D=3 in this case) and the final weight as the sum of all weights (0.35+0.55). Basically, I need to assign several different rules to each row for duplicate ID's, and I haven't found how to do this.
I'm using python I believe pandas is the best for this, but I'm just a beginner here, so I'll listen and try anything you suggest!
Thanks a lot!
import pandas as pd
a = pd.read_clipboard()
def agg_func(x):
x.A = x.A*x.weight
x.B = x.B*x.weight
return pd.Series([x.A.sum(), x.B.sum(), x.C[x.weight.idxmax()], x.D.max(), x.weight.max()], index=x.columns[1:])
print(a.groupby('ID').apply(agg_func))
A B C D weight
ID
1 0.500 2.0 a 1 1.00
2 0.435 3.8 c 3 0.55
3 0.360 1.8 d 4 0.90
This should do the job check http://pandas.pydata.org/pandas-docs/stable/groupby.html to know more.

Categories

Resources