I have a file say temp.rule which has say m rows and n columns where each row looks like att1,att2,att3,...attN,class,fitness. Suppose my file looks something like below:
A,B,C,1,0.67
D,E,F,1,0.84
P,Q,R,2,0.77
S,T,U,2,0.51
G,H,I,1,0.45
J,K,L,1,0.82
M,N,O,2,0.28
V,W,X,2,0.41
Y,Z,A,2,0.51
Where for the 1st row, A,B,C are the attributes and 1 is the class and 0.67 is the fitness. Now I want to sort the rows according to the fitness within each class and want to assign rank. So after this my file will look something like:
P,Q,R,2,0.77,5
S,T,U,2,0.51,3.5
Y,Z,A,2,0.51,3.5
V,W,X,2,0.41,2
M,N,O,2,0.28,1
D,E,F,1,0.84,4
J,K,L,1,0.82,3
A,B,C,1,0.67,2
G,H,I,1,0.45,1
With in class 2 as there are 5 rows so they are sorted according to fitness and rank is assigned from 1 to 5 and same goes for class 1 i.e as there are 4 rows so they are sorted according to fitness and rank is assigned from 1 to 4. I have done the sorting part but unable to assign the rank like this. I have also created the dictionary to keep a count of how many class 1 and class 2 and so on. And the 3.5 is there because in case of a tie I want to take the average of the consecutive ranks.
Below I am giving my try:
rule_file_name = 'temp.rule'
rule_fp = open(rule_file_name)
rule_fit_val = []
for line in rule_fp.readlines():
rule_fit_val.append(line.replace("\n","").split(","))
def convert_fitness_to_float(lst):
return lst[:-1] + [float(lst[-1])]
rule_fit_val =[convert_fitness_to_float(i) for i in rule_fit_val]
rule_fit_val = sorted(rule_fit_val, key=lambda x: x[-2:], reverse=True)
item_list = []
for i in rule_fit_val:
i = list(map(str, i))
s = ','.join(i).replace("\n","")
item_list.append(s)
print(*item_list,sep='\n')
with open("check_sorted_fitness.rule", "w") as outfile:
outfile.write("\n".join(item_list))
list1=[]
for i in rule_fit_val:
list1.append(i[-2])
freq = {}
for items in list1:
freq[items] = list1.count(items)
my_dict_new = {k:v for k,v in freq.items()}
print(my_dict_new)
Please help me out saying how I can assign rank like that.
consider using pandas module, then you can get something like this:
import pandas as pd
df = pd.read_csv('temp.rule', names=['att1','att2','att3','class','fitness'])
#-----------------^^^^^^^^^ your file ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ column headers
>>> df
'''
att1 att2 att3 class fitness
0 A B C 1 0.67
1 D E F 1 0.84
2 P Q R 2 0.77
3 S T U 2 0.51
4 G H I 1 0.45
5 J K L 1 0.82
6 M N O 2 0.28
7 V W X 2 0.41
8 Y Z A 2 0.51
'''
out = (df.assign(rank=df.groupby('class')['fitness'].
transform(lambda x: x.rank())).
sort_values(['class','fitness'], ascending=False))
>>> out
'''
att1 att2 att3 class fitness rank
2 P Q R 2 0.77 5.0
3 S T U 2 0.51 3.5
8 Y Z A 2 0.51 3.5
7 V W X 2 0.41 2.0
6 M N O 2 0.28 1.0
1 D E F 1 0.84 4.0
5 J K L 1 0.82 3.0
0 A B C 1 0.67 2.0
4 G H I 1 0.45 1.0
'''
out.to_csv('out.rule', header=False, index=False)
#-----------^^^^^^^^ new file
>>> out.rule
'''
P,Q,R,2,0.77,5.0
S,T,U,2,0.51,3.5
Y,Z,A,2,0.51,3.5
V,W,X,2,0.41,2.0
M,N,O,2,0.28,1.0
D,E,F,1,0.84,4.0
J,K,L,1,0.82,3.0
A,B,C,1,0.67,2.0
G,H,I,1,0.45,1.0
UPD
now it does not matter how many columns are in your file if two last columns supposed to be 'class' and 'fitness' respectively:
import pandas as pd
df = pd.read_csv('temp.rule', header=None)
df = df.rename(columns={df.columns[-1]:'fitness',df.columns[-2]:'class'})
out = (df.assign(rank=df.groupby('class')['fitness'].
transform(lambda x: x.rank())).
sort_values(['class','fitness'],ascending=False))
out.to_csv('out.rule',header=False,index=False)
I have table below and I wanted to get the % for each type that is >= 10 seconds or more. What is an efficient modular code for that? I would normally just filter for each type and then divid, but wanted to know if better way to calculate the percentage of each value in type column that is >= 10 seconds or more.
Thanks
Type | Seconds
A 23
V 10
V 10
A 7
B 1
V 10
B 72
A 11
V 19
V 3
expected output:
type %
A .67
V .80
B .50
A slightly more efficient option is to create a boolean mask of Seconds.ge(10) and use groupby.mean() on the mask:
df.Seconds.ge(10).groupby(df.Type).mean().reset_index(name='%')
# Type %
# 0 A 0.666667
# 1 B 0.500000
# 2 V 0.800000
Given these functions:
mask_groupby_mean = lambda df: df.Seconds.ge(10).groupby(df.Type).mean().reset_index(name='%')
groupby_apply = lambda df: df.groupby('Type').Seconds.apply(lambda x: (x.ge(10).sum() / len(x)) * 100).reset_index(name='%')
set_index_mean = lambda df: df.set_index('Type').ge(10).mean(level=0).rename(columns={'Seconds': '%'}).reset_index()
You can use .groupby:
x = (
df.groupby("Type")["Seconds"]
.apply(lambda x: (x.ge(10).sum() / len(x)) * 100)
.reset_index(name="%")
)
print(x)
Prints:
Type %
0 A 66.666667
1 B 50.000000
2 V 80.000000
Another other option set_index + ge then mean on level=0:
new_df = (
df.set_index('Type')['Seconds'].ge(10).mean(level=0)
.round(2)
.reset_index(name='%')
)
new_df:
Type %
0 A 0.67
1 V 0.80
2 B 0.50
I have a DataFrame as below:
source =
HM IM Ratio
A B 50%
A C 20%
A D 30%
E B 40%
E C 20%
E F 40%
H C 50%
H E 10%
H G 40%
G B 80%
G D 10%
J B 10%
J H 80%
J X 5%
J E 5%
I want to know for each item in 'HM' column, what's its percentage of total "C", for instance:
total C% in 'H' = 50%(C) + 10%(E) * 20%(C) = 52%
I build a function by using recursion shown below:
root = ['C']
BPB = []
BPB_ratio = {}
def spB(mat,root,ratio,level,lay):
items = source.loc[source['HM']==mat,'IM'].tolist()
for item in items:
items_item = source.loc[source['HM']==item,'IM'].tolist()
item_ratio = source.loc[(source['HM']==mat)&(source['IM']==item),'Ratio'].tolist()[0]
BPB.append([level,item,ratio*item_ratio])
if item in root:
BPB_ratio[level] =+ ratio*item_ratio
continue
if len(items_item)==0:
continue
else:
nlevel = level + 1
spB(item,root,ratio*item_ratio,nlevel,lay)
if lay == 0:
return sum(BPB_ratio.values())
else:
return BPB_ratio[lay]
for ss in list(set(source['HM'].tolist())):
percent = spB(ss,root,1,0,0)
print(BPB_ratio)
It can give me correct results;However, its efficiency is too slow....I have a source DataFrame with nearly 60,000 rows. It will take extremely long time to traverse entire dataframe to give the result. I wonder whether there are better solutions than using recursion?
I would try to use merge on the dataframe instead of using recursion.
First I would define a function that computes paths with one intermediate step from your dataframe:
def onestep(df):
df2 = df.merge(df, left_on='IM', right_on='HM')
df2['Ratio'] = df2['Ratio_x'] * df2['Ratio_y'] # compute resulting ratio
# only keep relevant columns and rename them
df2 = df2[['HM_x', 'IM_y', 'Ratio']].rename(
columns={'HM_x': 'HM', 'IM_y': 'IM'})
# sum up paths with same origin and destination
return df2.groupby(['HM', 'IM']).sum().reset_index()
With your sample, we can see:
>>> onestep(df)
HM IM Ratio
0 H B 0.36
1 H C 0.02
2 H D 0.04
3 H F 0.04
4 J B 0.02
5 J C 0.41
6 J E 0.08
7 J F 0.02
8 J G 0.32
We correctly get H->C (through E) at 2%
Then I would try to iterate on onestep until the resulting dataframe is empty (or a maximum depth is reached), and finally combine everything:
dfs = [df]
temp=df
n = 10 # give up at depth 10 (adapt it to your actual use case)
for i in range(n):
temp = onestep(temp)
if (len(temp) == 0): # break when the df is empty
break
dfs.append(temp)
else:
# we gave up before exploring all the paths: warn user
print(f"BEWARE: exiting after {n} steps")
resul = pd.concat(dfs, ignore_index=True).groupby(
['HM', 'IM']).sum().reset_index()
With your sample data it gives (iteration at step 2 gave an empty dataframe):
HM IM Ratio
0 A B 0.50
1 A C 0.20
2 A D 0.30
3 E B 0.40
4 E C 0.20
5 E F 0.40
6 G B 0.80
7 G D 0.10
8 H B 0.36
9 H C 0.52
10 H D 0.04
11 H E 0.10
12 H F 0.04
13 H G 0.40
14 J B 0.12
15 J C 0.41
16 J E 0.13
17 J F 0.02
18 J G 0.32
19 J H 0.80
20 J X 0.05
And we correctly find H->C at 52%
I cannot be sure of the real efficiency on a large dataframe, because if will depend on the actual graph complexity...
Suppose I have two DFs, say df1,df2 as follows:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[0,1,100],[1,1.1,120],[2,0.8,102]],columns=['id','a','b'])
df2 = pd.DataFrame([[0,0.5,110],[1,1.05,94],[2,0.96,145],[3,0.86,112],[4,1.3,97]],
columns=['id','a','b'])
print(df1)
id a b
0 0 1.0 100
1 1 1.1 120
2 2 0.8 102
print(df2)
id a b
0 0 0.50 110
1 1 1.05 94
2 2 0.96 145
3 3 0.86 112
4 4 1.30 97
Now, suppose I choose some interval size da,db. I want, for each row in df1, to pick a random row from df2, such that abs(a1-a2)<da,abs(b1-b2)<db. What I am currently doing is very brute force:
da = 0.2
db = 25
df2_list=[]
nbad = 0
for rid,row in df1.iterrows():
ca = row['a']
cb = row['b']
c_df2 = df2[np.abs(df2['a']-ca)<da]\
[np.abs(df2['b']-cb)<db]
if len(c_df2) == 0:
nbad+=1
continue
c_df2 = c_df2.sample()
df2_list.append(c_df2['id'].values[0])
matched_df = df2[df2['id'].isin(df2_list)]
print(matched_df)
id a b
1 1 1.05 94
3 3 0.86 112
4 4 1.30 97
However, for my real purpose, where my DF is really big, this is very slow.
Is there a faster way to achieve this result?
Here's a solution:
da = 0.2
db = 25
res = pd.merge(df1.assign(dummy = 1), df2.assign(dummy = 1), on = "dummy").drop("dummy", axis = 1)
res = res[(np.abs(res.a_x - res.a_y) < da) & (np.abs(res.b_x - res.b_y) < db)]
res = res.groupby("id_x").apply(lambda x: x.sample(1))[["id_y", "a_y", "b_y"]]
res.index = res.index.droplevel(1)
print(res)
The output is:
id_y a_y b_y
id_x
0 1 1.05 94
1 4 1.30 97
2 3 0.86 112
I have a dataframe than contains two columns, a: [1,2,3,4,5]; b: [1,0.4,0.3,0.5,0.2]. How can I make a column c such that:
c[0] = 1
c[i] = c[i-1]*b[i]+a[i]*(1-b[i])
so that c:[1,1.6,2.58,3.29,4.658]
Calculation:
1 = 1
1*0.4+2*0.6 = 1.6
1.6*0.3+3*0.7 = 2.58
2.58*0.5+4*0.5 = 3.29
3.29*0.2+5*0.8 = 4.658
?
I can't see a way to vectorise your recursive algorithm. However, you can use numba to optimize your current logic. This should be preferable to a regular loop.
from numba import jit
df = pd.DataFrame({'a': [1,2,3,4,5],
'b': [1,0.4,0.3,0.5,0.2]})
#jit(nopython=True)
def foo(a, b):
c = np.zeros(a.shape)
c[0] = 1
for i in range(1, c.shape[0]):
c[i] = c[i-1] * b[i] + a[i] * (1-b[i])
return c
df['c'] = foo(df['a'].values, df['b'].values)
print(df)
a b c
0 1 1.0 1.000
1 2 0.4 1.600
2 3 0.3 2.580
3 4 0.5 3.290
4 5 0.2 4.658
There could be a smarter way, but here's my attempt:
import pandas as pd
a = [1,2,3,4,5]
b = [1,0.4,0.3,0.5,0.2]
df = pd.DataFrame({'a':a , 'b': b})
for i in range(len(df)):
if i is 0:
df.loc[i,'c'] = 1
else:
df.loc[i,'c'] = df.loc[i-1,'c'] * df.loc[i,'b'] + df.loc[i,'a'] * (1 - df.loc[i,'b'])
Output:
a b c
0 1 1.0 1.000
1 2 0.4 1.600
2 3 0.3 2.580
3 4 0.5 3.290
4 5 0.2 4.658