Greedy most diverse subset of pandas dataframe - python

This is my dataset:
import pandas as pd
import itertools
A = ['A','B','C']
M = ['1','2','3']
F = ['plus','minus','square']
df = pd.DataFrame(list(itertools.product(A,M,F)), columns=['A','M','F'])
print(df)
A M F
0 A 1 plus
1 A 1 minus
2 A 1 square
3 A 2 plus
4 A 2 minus
5 A 2 square
I want to get the top-n rows (subset) from that dataframe which maximum diverse.
To compute diversity, I used 1- jaccard.
def jaccard(a, b):
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
By using dataframe operation, I can do a cartesian product to that dataframe using apply and compute the diversity values of each pair, and get the max value of diversity each pair by using df.idxmax(axis=1). But in here I have to compute all diversity values of each pair first which is not efficient.
0 1 2 3 4 5 6 7 8 9 10
0 0.0 1.0 0.8 0.5 0.5 0.8 0.5 1.0 0.8 0.8 0.8
1 0.0 0.0 1.0 0.8 1.0 0.8 1.0 0.8 0.8 0.8 0.8
2 0.0 0.0 0.0 1.0 0.5 1.0 0.5 0.8 0.8 1.0 1.0
3 0.0 0.0 0.0 0.0 0.8 0.8 0.8 0.8 0.5 0.8 0.5
4 0.0 0.0 0.0 0.0 0.0 0.8 0.8 1.0 0.5 1.0 0.8
df.idxmax(axis=1).sample(4)
5 6
2 3
0 1
8 9
dtype: int64
I want to implement this algorithm, but in some how, I did not understand the lines : 6 and 7.
How to compute argmax in here? and why in the line 10, it returns Sk but there is no initiation Sk value inside the looping?

Related

Pandas dataframe add rows with NaN in one column based on values in other columns

For example, there are three columns in a dataframe, x, y, z. x and y have 3 different values with 0.5 intervals. They are coordinates so they map with each other and there will be 3*3=9 rows with some z values. But the actual dataframe has only let say 7 rows. How to add two missing rows with NaN value in z column? Below are the example input and output. Thank you!
Input:
DataFrame:
x y z
0 -0.5 0 5
1 -0.5 -0.5 10
2 0 -0.5 7
3 0 0.5 6
4 0 0 12
5 0.5 0 8
6 0.5 0.5 2
Output:
DataFrame:
x y z
0 -0.5 0 5
1 -0.5 -0.5 10
2 0 -0.5 7
3 0 0.5 6
4 0 0 12
5 0.5 0 8
6 0.5 0.5 2
7 -0.5 0.5 NaN // missing row
8 0.5 -0.5 NaN // missing row
One option is with complete from pyjanitor, to add the missing rows, based on a combination of x and y :
# pip install pyjanitor
import pandas as pd
import janitor
df.complete('x', 'y')
x y z
0 -0.5 0.0 5.0
1 -0.5 -0.5 10.0
2 -0.5 0.5 NaN
3 0.0 0.0 12.0
4 0.0 -0.5 7.0
5 0.0 0.5 6.0
6 0.5 0.0 8.0
7 0.5 -0.5 NaN
8 0.5 0.5 2.0
complete is just an efficient helper (wrapper around pandas functions); if your data does not have duplicates that can throw off pivot, then use it:
df.pivot('x', 'y', 'z').stack(dropna=False).rename('z').reset_index()
x y z
0 -0.5 -0.5 10.0
1 -0.5 0.0 5.0
2 -0.5 0.5 NaN
3 0.0 -0.5 7.0
4 0.0 0.0 12.0
5 0.0 0.5 6.0
6 0.5 -0.5 NaN
7 0.5 0.0 8.0
8 0.5 0.5 2.0
import numpy as np
new_row = {"x": -0.5, "y": 0.5, "z": np.nan}
df.append(new_row, ignore_index=True)
I don't know exactly how you calculate "x" and "y" values, but I guess your question is directed towards the NaN value in the "z" column.
You can find the product of [-0.5,0,0.5] with itertools.product
import itertools
lst = pd.concat([df['x'], df['y']]).unique().tolist()
p = list(itertools.product(lst, repeat=2))
print(p)
[(-0.5, -0.5), (-0.5, 0.0), (-0.5, 0.5), (0.0, -0.5), (0.0, 0.0), (0.0, 0.5), (0.5, -0.5), (0.5, 0.0), (0.5, 0.5)]
Then fill the missing index:
out = df.set_index(['x', 'y']).reindex(p).reset_index()
print(out)
x y z
0 -0.5 -0.5 10.0
1 -0.5 0.0 5.0
2 -0.5 0.5 NaN
3 0.0 -0.5 7.0
4 0.0 0.0 12.0
5 0.0 0.5 6.0
6 0.5 -0.5 NaN
7 0.5 0.0 8.0
8 0.5 0.5 2.0
You can make a "Default" xy DataFrame, and then outer merge to it~
import itertools
xy = pd.DataFrame(itertools.product([-0.5, 0, 0.5], repeat=2), columns=['x', 'y'])
df.merge(xy, 'outer')
Output:
x y z
0 -0.5 0.0 5.0
1 -0.5 -0.5 10.0
2 0.0 -0.5 7.0
3 0.0 0.5 6.0
4 0.0 0.0 12.0
5 0.5 0.0 8.0
6 0.5 0.5 2.0
7 -0.5 0.5 NaN
8 0.5 -0.5 NaN

Create Max and Min column values from a single column value pandas

I have a dataframe like the one below and I need to create two columns out of the base column.
Input
Kg
0.5
0.5
1
1
1
2
2
5
5
5
Expected Output
Kg_From Kg_To
0 0.5
0 0.5
0.5 1
0.5 1
0.5 1
1 2
1 2
2 5
2 5
2 5
How can this be done in pandas ?
Assuming your kg column is sorted:
s = df["Kg"].unique()
df["Kg_from"] = df["Kg"].map({k:v for k,v in zip(s[1:], s)}).fillna(0)
print (df)
Kg Kg_from
0 0.5 0.0
1 0.5 0.0
2 1.0 0.5
3 1.0 0.5
4 1.0 0.5
5 2.0 1.0
6 2.0 1.0
7 5.0 2.0
8 5.0 2.0
9 5.0 2.0
#get unique values and counts of each value in the Kg column
val,counts = np.unique(df.Kg,return_counts=True)
#shift forward by 1 and replace the first value with 0
val = np.roll(val,1)
val[0] = 0
#repeat the count of each value with the counts generated earlier
df['Kg_from'] = np.repeat(val,counts)
df
Kg Kg_from
0 0.5 0.0
1 0.5 0.0
2 1.0 0.5
3 1.0 0.5
4 1.0 0.5
5 2.0 1.0
6 2.0 1.0
7 5.0 2.0
8 5.0 2.0
9 5.0 2.0
Use zip and dict for mapping new column created by DataFrame.insert with unique sorted values by np.unique with added first 0 value by np.insert:
df = df.rename(columns={'Kg':'Kg_To'})
a = np.unique(df["Kg_To"])
df.insert(0, 'Kg_from', df['Kg_To'].map(dict(zip(a, np.insert(a, 0, 0)))))
print (df)
Kg_from Kg_To
0 0.0 0.5
1 0.0 0.5
2 0.5 1.0
3 0.5 1.0
4 0.5 1.0
5 1.0 2.0
6 1.0 2.0
7 2.0 5.0
8 2.0 5.0
9 2.0 5.0
Code:
kgs = df.Kg.unique()
lower = [0] + list(kgs[:-1])
kg_dict = {k:v for v,k in zip(lower,kgs)}
# new dataframe
new_df = pd.DataFrame({
'Kg_From': df['Kg'].map(kg_dict),
'Kg_To': df['Kg']
})
# or if you want new columns:
df['Kg_from'] = df['Kg'].map(kg_dict)
Output:
Kg_From Kg_To
0 0.0 0.5
1 0.0 0.5
2 0.5 1.0
3 0.5 1.0
4 0.5 1.0
5 1.0 2.0
6 1.0 2.0
7 2.0 5.0
8 2.0 5.0
9 2.0 5.0

Find values from other dataframe and assign to original dataframe

Having input dataframe:
x_1 x_2
0 0.0 0.0
1 1.0 0.0
2 2.0 0.2
3 2.5 1.5
4 1.5 2.0
5 -2.0 -2.0
and additional dataframe as follows:
index x_1_x x_2_x x_1_y x_2_y value dist dist_rank
0 0 0.0 0.0 0.1 0.1 5.0 0.141421 2.0
4 0 0.0 0.0 1.5 1.0 -2.0 1.802776 3.0
5 0 0.0 0.0 0.0 0.0 3.0 0.000000 1.0
9 1 1.0 0.0 0.1 0.1 5.0 0.905539 1.0
11 1 1.0 0.0 2.0 0.4 3.0 1.077033 3.0
14 1 1.0 0.0 0.0 0.0 3.0 1.000000 2.0
18 2 2.0 0.2 0.1 0.1 5.0 1.902630 3.0
20 2 2.0 0.2 2.0 0.4 3.0 0.200000 1.0
22 2 2.0 0.2 1.5 1.0 -2.0 0.943398 2.0
29 3 2.5 1.5 2.0 0.4 3.0 1.208305 3.0
30 3 2.5 1.5 2.5 2.5 4.0 1.000000 1.0
31 3 2.5 1.5 1.5 1.0 -2.0 1.118034 2.0
38 4 1.5 2.0 2.0 0.4 3.0 1.676305 3.0
39 4 1.5 2.0 2.5 2.5 4.0 1.118034 2.0
40 4 1.5 2.0 1.5 1.0 -2.0 1.000000 1.0
45 5 -2.0 -2.0 0.1 0.1 5.0 2.969848 2.0
46 5 -2.0 -2.0 1.0 -2.0 6.0 3.000000 3.0
50 5 -2.0 -2.0 0.0 0.0 3.0 2.828427 1.0
I want to create new columns in input dataframe, basing on additional dataframe with respect to dist_rank. It should extract x_1_y, x_2_y and value for each row, with respect to index and dist_rank so my expected output is following:
I tried following lines:
df['value_dist_rank1']=result.loc[result['dist_rank']==1.0, 'value']
df['value_dist_rank1 ']=result[result['dist_rank']==1.0]['value']
but both gave the same output:
x_1 x_2 value_dist_rank1
0 0.0 0.0 NaN
1 1.0 0.0 NaN
2 2.0 0.2 NaN
3 2.5 1.5 NaN
4 1.5 2.0 NaN
5 -2.0 -2.0 3.0
Here is a way to do it :
(For the sake of clarity I consider the input df as df1 and the additional df as df2)
# First we goupby df2 by index to get all the column information of each index on one line
df2 = df2.groupby('index').agg(lambda x: list(x)).reset_index()
# Then we explode each column into three columns since there is always three columns for each index
columns = ['dist_rank', 'value', 'x_1_y', 'x_2_y']
column_to_add = ['value', 'x_1_y', 'x_2_y']
for index, row in df2.iterrows():
for i in range(3):
column_names = ["{}_dist_rank{}".format(x, row.dist_rank[i])[:-2] for x in column_to_add]
values = [row[x][i] for x in column_to_add]
for column, value in zip(column_names, values):
df2.loc[index, column] = value
# We drop the columns that are not useful :
df2.drop(columns=columns+['dist', 'x_1_x', 'x_2_x'], inplace = True)
# Finally we merge the modified df with our initial dataframe :
result = df1.merge(df2, left_index=True, right_on='index', how='left')
Output :
x_1 x_2 index value_dist_rank2 x_1_y_dist_rank2 x_2_y_dist_rank2 \
0 0.0 0.0 0 5.0 0.1 0.1
1 1.0 0.0 1 3.0 0.0 0.0
2 2.0 0.2 2 -2.0 1.5 1.0
3 2.5 1.5 3 -2.0 1.5 1.0
4 1.5 2.0 4 4.0 2.5 2.5
5 -2.0 -2.0 5 5.0 0.1 0.1
value_dist_rank3 x_1_y_dist_rank3 x_2_y_dist_rank3 value_dist_rank1 \
0 -2.0 1.5 1.0 3.0
1 3.0 2.0 0.4 5.0
2 5.0 0.1 0.1 3.0
3 3.0 2.0 0.4 4.0
4 3.0 2.0 0.4 -2.0
5 6.0 1.0 -2.0 3.0
x_1_y_dist_rank1 x_2_y_dist_rank1
0 0.0 0.0
1 0.1 0.1
2 2.0 0.4
3 2.5 2.5
4 1.5 1.0
5 0.0 0.0

How to compare two values of a column if the values in another column in the same rows match

I got a dataframe and i want to see the perentage of a win (0 = lose; 1 = win) for the team with the higher amount of wardskilled.
matchid team win wardskilled
0 10 1 0.0 8.0
1 10 2 1.0 10.0
2 11 1 0.0 8.0
3 11 2 1.0 8.0
4 12 1 0.0 2.0
5 12 2 1.0 5.0
6 13 1 0.0 5.0
7 13 2 1.0 5.0
8 14 1 0.0 1.0
9 14 2 1.0 1.0
10 15 1 1.0 3.0
11 15 2 0.0 1.0
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
since im a newby to python i have absolutly no idea how to start
i would love to create something like:
Teams with more wardskilled Teams with less wardskilled
win % %
lose % %
i would appreciate any kind of help
Another approach is to compare a team's wardskilled with the mean of the two teams:
means = df.groupby('matchid') .wardskilled.transform('mean')
df['more_skilled'] = np.sign(df.wardskilled.sub(means))
(df.groupby('win')
.more_skilled
.value_counts(normalize=True)
.unstack('more_skilled', fill_value=0)
)
Output
more_skilled -1.0 0.0 1.0
win
0.0 0.5 0.5 0.0
1.0 0.0 0.5 0.5
rank
If all 'matchid' have 2 teams, you can use this to determine if the team has a higher, lower or tied 'wardskilled'. Group by this and calculate the average win.
s = df.groupby('matchid').wardskilled.rank().map({1: 'Less', 1.5: 'Tied', 2: 'More'})
df.groupby(s).win.mean()
#wardskilled
#More 1.0
#Less 0.0
#Tied 0.5
#Name: win, dtype: float64
Having the two columns is redundant, but if you must:
res = df.groupby(s).win.mean().to_frame('win_per')
res['loss_per'] = 1-res['win_per']
# win_per loss_per
#wardskilled
#More 1.0 0.0
#Less 0.0 1.0
#Tied 0.5 0.5

Pandas creating new dataframe from several group by operations

I have a pandas dataframe
test = pd.DataFrame({'d':[1,1,1,2,2,3,3], 'id':[1,2,3,1,2,2,3], 'v1':[10, 20, 15, 35, 5, 10, 30], 'v2':[3, 4, 1, 6, 0, 2, 0], 'w1':[0.1, 0.3, 0.2, 0.1, 0.4, 0.3, 0.2], 'w2':[0.8, 0.1, 0.2, 0.3, 0.1, 0.1, 0.0]})
d id v1 v2 w1 w2
0 1 1 10 3 0.10 0.80
1 1 2 20 4 0.30 0.10
2 1 3 15 1 0.20 0.20
3 2 1 35 6 0.10 0.30
4 2 2 5 0 0.40 0.10
5 3 2 10 2 0.30 0.10
6 3 3 30 0 0.20 0.00
and I would like to get some weighted values by group like
test['w1v1'] = test['w1'] * test['v1']
test['w1v2'] = test['w1'] * test['v2']
test['w2v1'] = test['w2'] * test['v1']
test['w2v2'] = test['w2'] * test['v2']
How can I get the result nicely into a df. something that looks like
test.groupby('id').sum()['w1v1'] / test.groupby('id').sum()['w1']
id
1 22.50
2 11.00
3 22.50
but includes columns for each weighted value, so like
id w1v1 w1v2 w2v1 w2v2
1 22.50 ... ... ...
2 11.00 ... ... ...
3 22.50 ... ... ...
Any ideas how I can achieve this quick and easy?
Use:
cols = ['w1v1','w1v2','w2v1','w2v2']
test1 = (test[['w1', 'w2', 'w1', 'w2']] * test[['v1', 'v1', 'v2', 'v2']].values)
test1.columns = cols
print (test1)
w1v1 w1v2 w2v1 w2v2
0 1.0 8.0 0.3 2.4
1 6.0 2.0 1.2 0.4
2 3.0 3.0 0.2 0.2
3 3.5 10.5 0.6 1.8
4 2.0 0.5 0.0 0.0
5 3.0 1.0 0.6 0.2
6 6.0 0.0 0.0 0.0
df = test.join(test1).groupby('id').sum()
df1 = df[cols] / df[['w1', 'w2', 'w1', 'w2']].values
print (df1)
w1v1 w1v2 w2v1 w2v2
id
1 22.5 16.818182 4.5 3.818182
2 11.0 11.666667 1.8 2.000000
3 22.5 15.000000 0.5 1.000000
Another more dynamic solution with MultiIndex DataFrames:
a = ['v1', 'v2']
b = ['w1', 'w2']
mux = pd.MultiIndex.from_product([a,b])
df1 = test.set_index('id').drop('d', axis=1)
v = df1.reindex(columns=mux, level=0)
w = df1.reindex(columns=mux, level=1)
print (v)
v1 v2
w1 w2 w1 w2
id
1 10 10 3 3
2 20 20 4 4
3 15 15 1 1
1 35 35 6 6
2 5 5 0 0
2 10 10 2 2
3 30 30 0 0
print (w)
v1 v2
w1 w2 w1 w2
id
1 0.1 0.8 0.1 0.8
2 0.3 0.1 0.3 0.1
3 0.2 0.2 0.2 0.2
1 0.1 0.3 0.1 0.3
2 0.4 0.1 0.4 0.1
2 0.3 0.1 0.3 0.1
3 0.2 0.0 0.2 0.0
df = w * v
print (df)
v1 v2
w1 w2 w1 w2
id
1 1.0 8.0 0.3 2.4
2 6.0 2.0 1.2 0.4
3 3.0 3.0 0.2 0.2
1 3.5 10.5 0.6 1.8
2 2.0 0.5 0.0 0.0
2 3.0 1.0 0.6 0.2
3 6.0 0.0 0.0 0.0
df1 = df.groupby('id').sum() / w.groupby('id').sum()
#flatten MultiIndex columns
df1.columns = ['{0[1]}{0[0]}'.format(x) for x in df1.columns]
print (df1)
w1v1 w2v1 w1v2 w2v2
id
1 22.5 16.818182 4.5 3.818182
2 11.0 11.666667 1.8 2.000000
3 22.5 15.000000 0.5 1.000000
If you can take multi index columns, you can use groupby + dot:
test.groupby('id').apply(
lambda g: g.filter(like='v').T.dot(g.filter(like='w')/g.filter(like='w').sum()).stack()
)
# v1 v2
# w1 w2 w1 w2
#id
#1 22.5 16.818182 4.5 3.818182
#2 11.0 11.666667 1.8 2.000000
#3 22.5 15.000000 0.5 1.000000

Categories

Resources