Pandas duplicate rows to unique rows with weight - python

I'm trying to merge row in a dataframe where I have different inputs for one ID, so I'd like to have a single row for each ID with a weight.
My dataframe looks like this:
ID A B C D weight
1 0.5 2 a 1 1.0
2 0.3 3 b 2 0.35
2 0.6 5 c 3 0.55
3 0.4 2 d 4 0.9
and I would need it to merge the A, B columns for ID=2 into a weighted average (0.3*0.35+0.6*0.55 for A, 3*0.35+5*0.55 for B). For column C I'd need to chose the value associated to the highest weight (C=c for ID=2), column D the maximum value (D=3 in this case) and the final weight as the sum of all weights (0.35+0.55). Basically, I need to assign several different rules to each row for duplicate ID's, and I haven't found how to do this.
I'm using python I believe pandas is the best for this, but I'm just a beginner here, so I'll listen and try anything you suggest!
Thanks a lot!

import pandas as pd
a = pd.read_clipboard()
def agg_func(x):
x.A = x.A*x.weight
x.B = x.B*x.weight
return pd.Series([x.A.sum(), x.B.sum(), x.C[x.weight.idxmax()], x.D.max(), x.weight.max()], index=x.columns[1:])
print(a.groupby('ID').apply(agg_func))
A B C D weight
ID
1 0.500 2.0 a 1 1.00
2 0.435 3.8 c 3 0.55
3 0.360 1.8 d 4 0.90
This should do the job check http://pandas.pydata.org/pandas-docs/stable/groupby.html to know more.

Related

Iterate over unique values in first column while counting other columns within the value of the first column

I have some data that looks like this:
agent external id,product_group,commission_rate,weeks_held
3000-29,Collection,0.85,1
3000-29,Collection,0.85,2
3000-29,Return,0.85,1
3000-12,Collection,0.85,1
3000-12,Collection,0.85,2
3000-12,Return,0.85,1
3000-34,Collection,0.8,2
3000-34,Collection,0.8,2
3000-34,Return,0.8,1
3022-29,Collection,0.75,1
3022-29,Collection,0.75,2
3022-29,Return,0.75,1
I'm trying to create a dataframe that loops over each agent external id and gives me each as separate columns:
The agent external id
The count() of each product group
The sum() of weeks held by each product group
The output I'm looking for is:
agent_external_id,count_collection,sum_collection_weeks_held,count_return,sum_return_weeks_held
3000-29,2,3,0,0
3000-29,0,0,1,1
3000-12,2,3,0,0
3000-12,0,0,1,1
...
I've tried all types of groupby() but I'm not able to structure the data in the way that I want.
Is this what you would like. Please give a sample output and someone will help.
df.groupby(['agent_external_id','product_group']).agg(['sum','count'])
commission_rate weeks_held
sum count sum count
agent_external_id product_group
3000-12 Collection 1.70 2 3 2
Return 0.85 1 1 1
3000-29 Collection 1.70 2 3 2
Return 0.85 1 1 1
3000-34 Collection 1.60 2 4 2
Return 0.80 1 1 1
3022-29 Collection 1.50 2 3 2
Return 0.75 1 1 1

How to take the maximum value of a row if it has repeated values elsewhere in the column and return a new matrix?

I have created a matrix by concatenating two arrays as column vectors, so I have something like the following:
ErrKappa
error
1
0.5
2
0.76
2
0.5
3
0.15
4
0.5
4
0.9
2
0.5
3
0.05
And then I need it to output another matrix that which just has the maximum error of the values which are the same from the matrix, so the new one will look like the following:
ErrKappa
error
1
0.5
2
0.76
3
0.5
4
0.9
Please note that ErrKappa doesn't need to be put in order, it just so happened that it appeared like that in this toy example. Any help is massively appreciated. Thanks!
EK = [1,2,2,3,4,4,2,3]
er = [0.5,0.76,0.5,0.15,0.5,0.9,0.5,0.5]
import pandas as pd
df = pd.DataFrame({'ErrorKappa':EK,'error':er})
df.groupby('ErrorKappa').max()
error
ErrorKappa
1 0.50
2 0.76
3 0.50
4 0.90

Filter dataframe rows based on column values percentile

I have a pandas dataframe df like this:
ID
Weight
A
a
0.15
1
a
0.25
3
a
0.02
2
a
0.07
3
b
0.01
1
b
0.025
5
b
0.07
7
b
0.06
4
b
0.12
2
I want to remove rows based on the ID column and Percentile of weight column such that, for df['ID'] = a, there are four rows. But if I want to keep at least 80% (it can vary) weight, I have to keep only rows with 0.15 and 0.25 weights (81.6%, whenever adding a weight crosses 80%, rest of the rows with the same 'ID' will be removed).
After the operation, df will become like this:
ID
Weight
A
a
0.15
1
a
0.25
3
b
0.07
7
b
0.06
4
b
0.12
2
Assume you need the highest weight per ID but only keep weights that just cross the 0.8 threshold:
(df.sort_values('Weight', ascending=False)
.groupby('ID', group_keys=False)
.apply(lambda g: g[(g.Weight.cumsum() / g.Weight.sum()).lt(0.8).shift(fill_value=True)]))
ID Weight A
1 a 0.25 3
0 a 0.15 1
8 b 0.12 2
6 b 0.07 7
7 b 0.06 4
We first sort Weight in descending order, and then group by ID and calculates cumulative weight percentage which we compare with the threshold. Note we shift the condition so that we keep the first value that crosses the threshold as indicated in the question.

Choosing values from pandas column with the lowest value

I'm reading a df from csv that has 2 columns showing the prices of various items. In some cases the price is a single int/float, but other cases it could be a range of spaces seperated int/floats or mixture of int/floats with strings.
example df:
item prices
------ ---------------------------
a 2
b 3.5
c 5
d 0.04
e 1 8 3 4 2
f 0.04 0.04 0.01
g Normal: 4.56Premium: 4.75
What I'm looking is a nice pythonic way to get the prices column to display the lowest possible int/float value for every item. e.g.
item prices
------ --------
a 2
b 3.5
c 5
d 0.04
e 1
f 0.01
g 4.56
The only way I could think of solving this problem for items e and f would be to split the value by using str.split(" ") and mapping the output to int or float, but this seems like it would be messy as not all values are the same types. And I don't even know how I would get the lowest value for item g.
any help would be appreciated
Use Series.str.extractall for get integers or floats, convert to floats and get minimal values:
df['prices'] = (df['prices'].str.extractall('(\d+\.\d+|\d+)')[0]
.astype(float)
.groupby(level=0)
.min())
print (df)
item prices
0 a 2.00
1 b 3.50
2 c 5.00
3 d 0.04
4 e 1.00
5 f 0.01
6 g 4.56

Group by custom aggregation function python

After performing the groupby on two columns (id and category) using the mean aggregation function over a column (col3) I have something like this:
col3
id category mean
345 A 12
B 2
C 3
D 4
Total 21
What I would like to do is to add a new column called percentage in which I calculate the percentage of each category over the category Total.
This should be done separately for every id.
The result should be something like this:
col3
id category mean percentage
345 A 12 0.57
B 2 0.09
C 3 0.14
D 4 0.19
Total 21 1
Obviously i want to do that for every id, that is the first column on which i have done the groupby. Any suggestion on how to do that?
Using get_level_values filter your df, then we using div
s=df[df.index.get_level_values(level=1)!='Total'].sum(level=0)
df['percentage']=df.div(s,level=0,axis=1)
df
Out[422]:
mean percentage
id category
345 A 12 0.571429
B 2 0.095238
C 3 0.142857
D 4 0.190476
Total 21 1.000000
That's my suggestion:
df['mean'] = df['mean'] / df['mean'].sum()

Categories

Resources