Issue in applying function in Pandas data frame - python

Hi I have the following function to decide the winner:
def winner(T1,T2,S1,S2,PS1,PS2):
if S1>S2:
return T1
elif S2>S1:
return T2
else:
#print('Winner will be decided via penalty shoot out')
Ninit = 5
Ts1 = np.sum(np.random.random(size=Ninit))*PS1
Ts2 = np.sum(np.random.random(size=Ninit))*PS2
if Ts1>Ts1:
return T1
elif Ts2>Ts1:
return T2
else:
return 'Draw'
And I have the following data frame:
df = pd.DataFrame()
df['Team1'] = ['A','B','C','D','E','F']
df['Score1'] = [1,2,3,1,2,4]
df['Team2'] = ['U','V','W','X','Y','Z']
df['Score2'] = [2,2,2,2,3,3]
df['Match'] = df['Team1'] + ' Vs '+ df['Team2']
df['Match_no']= [1,2,3,4,5,6]
df ['P1'] = [0.8,0.7,0.6,0.9,0.75,0.77]
df ['P2'] = [0.75,0.75,0.65,0.78,0.79,0.85]
I want to create a new column in which winner from each match will be assigned.
To decide a winner from each match, I used the function winner. I tested the function using arbitrary inputs. it works. When I used dataframe,
as follow:
df['Winner']= winner(df.Team1,df.Team2,df.Score1,df.Score2,df.P1,df.P2)
it showed me the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Can anyone advise why there is an error?
Thanks
Zep.

Your function isn't set up to take pandas.Series as inputs. Use a different way.
df['Winner'] = [
winner(*t) for t in zip(df.Team1, df.Team2, df.Score1, df.Score2, df.P1, df.P2)]
df
Team1 Score1 Team2 Score2 Match Match_no P1 P2 Winner
0 A 1 U 2 A Vs U 1 0.80 0.75 U
1 B 2 V 2 B Vs V 2 0.70 0.75 V
2 C 3 W 2 C Vs W 3 0.60 0.65 C
3 D 1 X 2 D Vs X 4 0.90 0.78 X
4 E 2 Y 3 E Vs Y 5 0.75 0.79 Y
5 F 4 Z 3 F Vs Z 6 0.77 0.85 F
Another way to go about it
def winner(T1,T2,S1,S2,PS1,PS2):
ninit = 5
Ts1 = np.random.rand(5).sum() * PS1
Ts2 = np.random.rand(5).sum() * PS2
a = np.select(
[S1 > S2, S2 > S1, Ts1 > Ts2, Ts2 > Ts1],
[T1, T2, T1, T2], 'DRAW')
return a
df.assign(Winner=winner(df.Team1, df.Team2, df.Score1, df.Score2, df.P1, df.P2))
Team1 Score1 Team2 Score2 Match Match_no P1 P2 Winner
0 A 1 U 2 A Vs U 1 0.80 0.75 U
1 B 2 V 2 B Vs V 2 0.70 0.75 B
2 C 3 W 2 C Vs W 3 0.60 0.65 C
3 D 1 X 2 D Vs X 4 0.90 0.78 X
4 E 2 Y 3 E Vs Y 5 0.75 0.79 Y
5 F 4 Z 3 F Vs Z 6 0.77 0.85 F

Related

Adding one extra column in a python file depending on the value of another column

I have a file say temp.rule which has say m rows and n columns where each row looks like att1,att2,att3,...attN,class,fitness. Suppose my file looks something like below:
A,B,C,1,0.67
D,E,F,1,0.84
P,Q,R,2,0.77
S,T,U,2,0.51
G,H,I,1,0.45
J,K,L,1,0.82
M,N,O,2,0.28
V,W,X,2,0.41
Y,Z,A,2,0.51
Where for the 1st row, A,B,C are the attributes and 1 is the class and 0.67 is the fitness. Now I want to sort the rows according to the fitness within each class and want to assign rank. So after this my file will look something like:
P,Q,R,2,0.77,5
S,T,U,2,0.51,3.5
Y,Z,A,2,0.51,3.5
V,W,X,2,0.41,2
M,N,O,2,0.28,1
D,E,F,1,0.84,4
J,K,L,1,0.82,3
A,B,C,1,0.67,2
G,H,I,1,0.45,1
With in class 2 as there are 5 rows so they are sorted according to fitness and rank is assigned from 1 to 5 and same goes for class 1 i.e as there are 4 rows so they are sorted according to fitness and rank is assigned from 1 to 4. I have done the sorting part but unable to assign the rank like this. I have also created the dictionary to keep a count of how many class 1 and class 2 and so on. And the 3.5 is there because in case of a tie I want to take the average of the consecutive ranks.
Below I am giving my try:
rule_file_name = 'temp.rule'
rule_fp = open(rule_file_name)
rule_fit_val = []
for line in rule_fp.readlines():
rule_fit_val.append(line.replace("\n","").split(","))
def convert_fitness_to_float(lst):
return lst[:-1] + [float(lst[-1])]
rule_fit_val =[convert_fitness_to_float(i) for i in rule_fit_val]
rule_fit_val = sorted(rule_fit_val, key=lambda x: x[-2:], reverse=True)
item_list = []
for i in rule_fit_val:
i = list(map(str, i))
s = ','.join(i).replace("\n","")
item_list.append(s)
print(*item_list,sep='\n')
with open("check_sorted_fitness.rule", "w") as outfile:
outfile.write("\n".join(item_list))
list1=[]
for i in rule_fit_val:
list1.append(i[-2])
freq = {}
for items in list1:
freq[items] = list1.count(items)
my_dict_new = {k:v for k,v in freq.items()}
print(my_dict_new)
Please help me out saying how I can assign rank like that.
consider using pandas module, then you can get something like this:
import pandas as pd
df = pd.read_csv('temp.rule', names=['att1','att2','att3','class','fitness'])
#-----------------^^^^^^^^^ your file ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ column headers
>>> df
'''
att1 att2 att3 class fitness
0 A B C 1 0.67
1 D E F 1 0.84
2 P Q R 2 0.77
3 S T U 2 0.51
4 G H I 1 0.45
5 J K L 1 0.82
6 M N O 2 0.28
7 V W X 2 0.41
8 Y Z A 2 0.51
'''
out = (df.assign(rank=df.groupby('class')['fitness'].
transform(lambda x: x.rank())).
sort_values(['class','fitness'], ascending=False))
>>> out
'''
att1 att2 att3 class fitness rank
2 P Q R 2 0.77 5.0
3 S T U 2 0.51 3.5
8 Y Z A 2 0.51 3.5
7 V W X 2 0.41 2.0
6 M N O 2 0.28 1.0
1 D E F 1 0.84 4.0
5 J K L 1 0.82 3.0
0 A B C 1 0.67 2.0
4 G H I 1 0.45 1.0
'''
out.to_csv('out.rule', header=False, index=False)
#-----------^^^^^^^^ new file
>>> out.rule
'''
P,Q,R,2,0.77,5.0
S,T,U,2,0.51,3.5
Y,Z,A,2,0.51,3.5
V,W,X,2,0.41,2.0
M,N,O,2,0.28,1.0
D,E,F,1,0.84,4.0
J,K,L,1,0.82,3.0
A,B,C,1,0.67,2.0
G,H,I,1,0.45,1.0
UPD
now it does not matter how many columns are in your file if two last columns supposed to be 'class' and 'fitness' respectively:
import pandas as pd
df = pd.read_csv('temp.rule', header=None)
df = df.rename(columns={df.columns[-1]:'fitness',df.columns[-2]:'class'})
out = (df.assign(rank=df.groupby('class')['fitness'].
transform(lambda x: x.rank())).
sort_values(['class','fitness'],ascending=False))
out.to_csv('out.rule',header=False,index=False)

Calculating % for value in column based on condition or value

I have table below and I wanted to get the % for each type that is >= 10 seconds or more. What is an efficient modular code for that? I would normally just filter for each type and then divid, but wanted to know if better way to calculate the percentage of each value in type column that is >= 10 seconds or more.
Thanks
Type | Seconds
A 23
V 10
V 10
A 7
B 1
V 10
B 72
A 11
V 19
V 3
expected output:
type %
A .67
V .80
B .50
A slightly more efficient option is to create a boolean mask of Seconds.ge(10) and use groupby.mean() on the mask:
df.Seconds.ge(10).groupby(df.Type).mean().reset_index(name='%')
# Type %
# 0 A 0.666667
# 1 B 0.500000
# 2 V 0.800000
Given these functions:
mask_groupby_mean = lambda df: df.Seconds.ge(10).groupby(df.Type).mean().reset_index(name='%')
groupby_apply = lambda df: df.groupby('Type').Seconds.apply(lambda x: (x.ge(10).sum() / len(x)) * 100).reset_index(name='%')
set_index_mean = lambda df: df.set_index('Type').ge(10).mean(level=0).rename(columns={'Seconds': '%'}).reset_index()
You can use .groupby:
x = (
df.groupby("Type")["Seconds"]
.apply(lambda x: (x.ge(10).sum() / len(x)) * 100)
.reset_index(name="%")
)
print(x)
Prints:
Type %
0 A 66.666667
1 B 50.000000
2 V 80.000000
Another other option set_index + ge then mean on level=0:
new_df = (
df.set_index('Type')['Seconds'].ge(10).mean(level=0)
.round(2)
.reset_index(name='%')
)
new_df:
Type %
0 A 0.67
1 V 0.80
2 B 0.50

How to build a efficient function to calculate a specific element's percentage in a multiple deep nested dictionary?

I have a DataFrame as below:
source =
HM IM Ratio
A B 50%
A C 20%
A D 30%
E B 40%
E C 20%
E F 40%
H C 50%
H E 10%
H G 40%
G B 80%
G D 10%
J B 10%
J H 80%
J X 5%
J E 5%
I want to know for each item in 'HM' column, what's its percentage of total "C", for instance:
total C% in 'H' = 50%(C) + 10%(E) * 20%(C) = 52%
I build a function by using recursion shown below:
root = ['C']
BPB = []
BPB_ratio = {}
def spB(mat,root,ratio,level,lay):
items = source.loc[source['HM']==mat,'IM'].tolist()
for item in items:
items_item = source.loc[source['HM']==item,'IM'].tolist()
item_ratio = source.loc[(source['HM']==mat)&(source['IM']==item),'Ratio'].tolist()[0]
BPB.append([level,item,ratio*item_ratio])
if item in root:
BPB_ratio[level] =+ ratio*item_ratio
continue
if len(items_item)==0:
continue
else:
nlevel = level + 1
spB(item,root,ratio*item_ratio,nlevel,lay)
if lay == 0:
return sum(BPB_ratio.values())
else:
return BPB_ratio[lay]
for ss in list(set(source['HM'].tolist())):
percent = spB(ss,root,1,0,0)
print(BPB_ratio)
It can give me correct results;However, its efficiency is too slow....I have a source DataFrame with nearly 60,000 rows. It will take extremely long time to traverse entire dataframe to give the result. I wonder whether there are better solutions than using recursion?
I would try to use merge on the dataframe instead of using recursion.
First I would define a function that computes paths with one intermediate step from your dataframe:
def onestep(df):
df2 = df.merge(df, left_on='IM', right_on='HM')
df2['Ratio'] = df2['Ratio_x'] * df2['Ratio_y'] # compute resulting ratio
# only keep relevant columns and rename them
df2 = df2[['HM_x', 'IM_y', 'Ratio']].rename(
columns={'HM_x': 'HM', 'IM_y': 'IM'})
# sum up paths with same origin and destination
return df2.groupby(['HM', 'IM']).sum().reset_index()
With your sample, we can see:
>>> onestep(df)
HM IM Ratio
0 H B 0.36
1 H C 0.02
2 H D 0.04
3 H F 0.04
4 J B 0.02
5 J C 0.41
6 J E 0.08
7 J F 0.02
8 J G 0.32
We correctly get H->C (through E) at 2%
Then I would try to iterate on onestep until the resulting dataframe is empty (or a maximum depth is reached), and finally combine everything:
dfs = [df]
temp=df
n = 10 # give up at depth 10 (adapt it to your actual use case)
for i in range(n):
temp = onestep(temp)
if (len(temp) == 0): # break when the df is empty
break
dfs.append(temp)
else:
# we gave up before exploring all the paths: warn user
print(f"BEWARE: exiting after {n} steps")
resul = pd.concat(dfs, ignore_index=True).groupby(
['HM', 'IM']).sum().reset_index()
With your sample data it gives (iteration at step 2 gave an empty dataframe):
HM IM Ratio
0 A B 0.50
1 A C 0.20
2 A D 0.30
3 E B 0.40
4 E C 0.20
5 E F 0.40
6 G B 0.80
7 G D 0.10
8 H B 0.36
9 H C 0.52
10 H D 0.04
11 H E 0.10
12 H F 0.04
13 H G 0.40
14 J B 0.12
15 J C 0.41
16 J E 0.13
17 J F 0.02
18 J G 0.32
19 J H 0.80
20 J X 0.05
And we correctly find H->C at 52%
I cannot be sure of the real efficiency on a large dataframe, because if will depend on the actual graph complexity...

Sampling pandas DF to match a second DF within error

Suppose I have two DFs, say df1,df2 as follows:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[0,1,100],[1,1.1,120],[2,0.8,102]],columns=['id','a','b'])
df2 = pd.DataFrame([[0,0.5,110],[1,1.05,94],[2,0.96,145],[3,0.86,112],[4,1.3,97]],
columns=['id','a','b'])
print(df1)
id a b
0 0 1.0 100
1 1 1.1 120
2 2 0.8 102
print(df2)
id a b
0 0 0.50 110
1 1 1.05 94
2 2 0.96 145
3 3 0.86 112
4 4 1.30 97
Now, suppose I choose some interval size da,db. I want, for each row in df1, to pick a random row from df2, such that abs(a1-a2)<da,abs(b1-b2)<db. What I am currently doing is very brute force:
da = 0.2
db = 25
df2_list=[]
nbad = 0
for rid,row in df1.iterrows():
ca = row['a']
cb = row['b']
c_df2 = df2[np.abs(df2['a']-ca)<da]\
[np.abs(df2['b']-cb)<db]
if len(c_df2) == 0:
nbad+=1
continue
c_df2 = c_df2.sample()
df2_list.append(c_df2['id'].values[0])
matched_df = df2[df2['id'].isin(df2_list)]
print(matched_df)
id a b
1 1 1.05 94
3 3 0.86 112
4 4 1.30 97
However, for my real purpose, where my DF is really big, this is very slow.
Is there a faster way to achieve this result?
Here's a solution:
da = 0.2
db = 25
res = pd.merge(df1.assign(dummy = 1), df2.assign(dummy = 1), on = "dummy").drop("dummy", axis = 1)
res = res[(np.abs(res.a_x - res.a_y) < da) & (np.abs(res.b_x - res.b_y) < db)]
res = res.groupby("id_x").apply(lambda x: x.sample(1))[["id_y", "a_y", "b_y"]]
res.index = res.index.droplevel(1)
print(res)
The output is:
id_y a_y b_y
id_x
0 1 1.05 94
1 4 1.30 97
2 3 0.86 112

How do I calculate moving average with customized weight in pandas?

I have a dataframe than contains two columns, a: [1,2,3,4,5]; b: [1,0.4,0.3,0.5,0.2]. How can I make a column c such that:
c[0] = 1
c[i] = c[i-1]*b[i]+a[i]*(1-b[i])
so that c:[1,1.6,2.58,3.29,4.658]
Calculation:
1 = 1
1*0.4+2*0.6 = 1.6
1.6*0.3+3*0.7 = 2.58
2.58*0.5+4*0.5 = 3.29
3.29*0.2+5*0.8 = 4.658
?
I can't see a way to vectorise your recursive algorithm. However, you can use numba to optimize your current logic. This should be preferable to a regular loop.
from numba import jit
df = pd.DataFrame({'a': [1,2,3,4,5],
'b': [1,0.4,0.3,0.5,0.2]})
#jit(nopython=True)
def foo(a, b):
c = np.zeros(a.shape)
c[0] = 1
for i in range(1, c.shape[0]):
c[i] = c[i-1] * b[i] + a[i] * (1-b[i])
return c
df['c'] = foo(df['a'].values, df['b'].values)
print(df)
a b c
0 1 1.0 1.000
1 2 0.4 1.600
2 3 0.3 2.580
3 4 0.5 3.290
4 5 0.2 4.658
There could be a smarter way, but here's my attempt:
import pandas as pd
a = [1,2,3,4,5]
b = [1,0.4,0.3,0.5,0.2]
df = pd.DataFrame({'a':a , 'b': b})
for i in range(len(df)):
if i is 0:
df.loc[i,'c'] = 1
else:
df.loc[i,'c'] = df.loc[i-1,'c'] * df.loc[i,'b'] + df.loc[i,'a'] * (1 - df.loc[i,'b'])
Output:
a b c
0 1 1.0 1.000
1 2 0.4 1.600
2 3 0.3 2.580
3 4 0.5 3.290
4 5 0.2 4.658

Categories

Resources