Calculating smallest within trio distance - python

I have a pandas dataframe similar to the one below:
Output var1 var2 var3
1 0.487981 0.297929 0.214090
1 0.945660 0.031666 0.022674
2 0.119845 0.828661 0.051495
2 0.095186 0.852232 0.052582
3 0.059520 0.053307 0.887173
3 0.091049 0.342226 0.566725
3 0.119295 0.414376 0.466329
... ... ... ... ...
Basically, I have 3 columns (propensity score values) and one output (treatment). I want to calculate the within-trio distance to find trios of outputs with the smallest within-trio distance.
The experiment is taken from the paper: "Matching by Propensity Score in Cohort Studies with Three Treatment Groups", Rassen et al. Looking at their explanation is like calculating the perimeter of a triangle, but I am not sure.
I think that at this GitHub link: https://github.com/bwh-dope/pharmacoepi_toolbox/blob/master/src/org/drugepi/match/MatchDistanceCalculator.java there is Java code doing this stuff more or less, but I am not sure on how to use it. I use Python, so I have two options: try to adapt this previous code or write something else.
My idea is that var1, var2 and var3 can be considered like spatial x,y,z coordinates, and the output is like a point in the space.
I found a function that calculates the distance between 2 points:
#found here https://stackoverflow.com/questions/68938033/min-distance-between-point-cloud-xyz-points-in-python
import numpy as np
distance = lambda p1, p2: np.sqrt(np.sum((p1 - p2) ** 2, axis=0))
import itertools
def min_distance(cloud):
pairs = itertools.combinations(cloud, 2)
return np.min(map(lambda pair: distance(*pair), pairs))
def get_points(filename):
with open(filename, 'r') as file:
rows = np.genfromtxt(file, delimiter=',', skip_header=True)
return rows
filename = 'cloud.csv'
cloud = get_points(filename)
min_dist = min_distance(cloud)
However, I want to calculate the distance between 3 points, so I think that I need to iterate all the possible combinations of 3 points like XY, XZ and YZ, but I am not sure of this procedure.

Finally, I tried with my own solution, that I think it is correct, but maybe too much computationally expensive.
I created my 3 dataset, according to the Output value: dataset1 = dataset[dataset["Output"]==1] and the same for Output=2 and Output=3.
This is my distance function:
def Euclidean_Dist(df1, df2):
return np.linalg.norm(df1 - df2)
My variables:
tripletta_for = []
tripletta_tot_wr = []
p_inf = float('inf')
counter = 1
These are the steps used to computed the within-trio distance. Hope they are correct.
'''
i[0] = index
i[1] = treatment prop1
i[1][0] = treatment
i[1][1] = prop
'''
#io voglio calcolare la distanza tra i[1][1], j[1][1] e k[1][1]
for i in dataset1.iterrows():
minimum_distance = p_inf
print(counter)
counter = counter + 1
for j in dataset2.iterrows():
dist12 = Euclidean_Dist(i[1][1], j[1][1])
for k in dataset3.iterrows():
dist13 = Euclidean_Dist(i[1][1], k[1][1])
dist23 = Euclidean_Dist(j[1][1], k[1][1])
somma = dist12 + dist13 + dist23
if somma < minimum_distance:
minimum_distance = somma
tripletta_for = i[0], j[0], k[0]
#print(tripletta_for)
dataset2.drop(index=tripletta_for[1], inplace=True)
dataset3.drop(tripletta_for[2], inplace=True)
#print(len(dataset3))
tripletta_tot_wr.append(tripletta_for)
#print(tripletta_tot_wr)

Related

Generate binary outcome dummy data based on probability of items and its feature

I want to generate a synthetic data from scratch which is a binary outcome sequence data (0/1). My data has following property-
For the sake of an example, lets say there are only 3 items in the sequence, namely A,B and C
So data is -
Its sequence based data so item A,B,C will happen in an order
Items A,B,C have Features S,T,U,V,X,Y,Z...etc (these features needs to have some effect on generating outcome 1, think of them as feature importance)
Probability of conversion when A or B or C is encountered in the data is user defined (I want control over if A occurs in any part of the sequence the overall probability of conversion to outcome 1 is 2% lets say, more below)
Items can repeat in a sequence so a Sequence can be like C->C->A etc .
Given the probability of conversion for each item when it occurs in data (like when ever A is encountered in the sequence, probability of outcome 1 is about 2%, when B occurs, its 2.6% and so on, just an example), I want to generate data randomly. So generated data should look something like this -
ID Sequence Feature Outcome
1 A->B X 0
2 C->C->B Y 1
3 A->B X 1
4 A Z 0
5 A->B->A Z 0
6 C->C Y 1
and so on
When generating this data, I want to have control over -
Conversion probability of A,B and C essentially defining when A occurs probability of conversion is let say 2%, for B is 4% and for C is 3.6%.
Number of converted sequence for each sequence length (for example there can be max 3 sequence so for 3 sequence I want at-least 100000 data points having outcome 1)
Control over how many Items I can include (so A,B,C and D, 4 sequence length instead of 3)
Total number of data points if possible?
Is there any simple way through which I generate this data with keeping in mind all these parameters?
import pandas as pd
import itertools
import numpy as np
import random
alphabets=['A','B','C']
combinations=[]
for i in range(1,len(alphabets)+1):
combinations.append(['->'.join(i) for i in itertools.product(alphabets, repeat = i)])
combinations=(sum(combinations, []))
weights=np.random.normal(100,30,len(combinations))
weights/=sum(weights)
weights=weights.tolist()
#weights=np.random.dirichlet(np.ones(len(combinations))*1000.,size=1)
'''n = len(combinations)
weights = [random.random() for _ in range(n)]
sum_weights = sum(weights)
weights = [w/sum_weights for w in weights]'''
df=pd.DataFrame(random.choices(
population=combinations,weights=weights,
k=1000000),columns=['sequence'])
# -
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.hist(weights, bins = 20)
plt.show()
distribution=df.groupby('sequence').agg({'sequence':'count'}).rename(columns={'sequence':'Total_Numbers'}).reset_index()
plt.hist(distribution.Total_Numbers)
plt.show()
# + tags=[]
from tqdm import tqdm
A=0.2
B=0.8
C=0.1
count_AAA=count_AA=count_A=0
count_BBB=count_BB=count_B=0
count_CCC=count_CC=count_C=0
for i in tqdm(range(0,len(df))):
if(df.sequence[i]=='A->A->A'):
count_AAA+=1
if('A->A' in df.sequence[i]):
count_AA+=1
if('A' in df.sequence[i]):
count_A+=1
if(df.sequence[i]=='B->B->B'):
count_BBB+=1
if('B->B' in df.sequence[i]):
count_BB+=1
if('B' in df.sequence[i]):
count_B+=1
if(df.sequence[i]=='C->C->C'):
count_CCC+=1
if('C->C' in df.sequence[i]):
count_CC+=1
if('C' in df.sequence[i]):
count_C+=1
bi_AAA = np.random.binomial(1, A*0.9, count_AAA)
bi_AA = np.random.binomial(1, A*0.5, count_AA)
bi_A = np.random.binomial(1, A*0.1, count_A)
bi_BBB = np.random.binomial(1, B*0.9, count_BBB)
bi_BB = np.random.binomial(1, B*0.5, count_BB)
bi_B = np.random.binomial(1, B*0.1, count_B)
bi_CCC = np.random.binomial(1, C*0.9, count_CCC)
bi_CC = np.random.binomial(1, C*0.5, count_CC)
bi_C = np.random.binomial(1, C*0.15, count_C)
# -
bi_BBB.sum()/count_BBB
# + tags=[]
AAA=AA=A=BBB=BB=B=CCC=CC=C=0
for i in tqdm(range(0,len(df))):
if(df.sequence[i]=='A->A->A'):
df.at[i, 'Outcome_AAA'] = bi_AAA[AAA]
AAA+=1
if('A->A' in df.sequence[i]):
df.at[i, 'Outcome_AA'] = bi_AA[AA]
AA+=1
if('A' in df.sequence[i]):
df.at[i, 'Outcome_A'] = bi_A[A]
A+=1
if(df.sequence[i]=='B->B->B'):
df.at[i, 'Outcome_BBB'] = bi_BBB[BBB]
BBB+=1
if('B->B' in df.sequence[i]):
df.at[i, 'Outcome_BB'] = bi_BB[BB]
BB+=1
if('B' in df.sequence[i]):
df.at[i, 'Outcome_B'] = bi_B[B]
B+=1
if(df.sequence[i]=='C->C->C'):
df.at[i, 'Outcome_CCC'] = bi_CCC[CCC]
CCC+=1
if('C->C' in df.sequence[i]):
df.at[i, 'Outcome_CC'] = bi_CC[CC]
CC+=1
if('C' in df.sequence[i]):
df.at[i, 'Outcome_C'] = bi_C[C]
C+=1
df=df.fillna(0)
df['Outcome']=df.apply(lambda x: 1 if x.Outcome_AAA+x.Outcome_BBB+x.Outcome_CCC+\
x.Outcome_AA+x.Outcome_BB+x.Outcome_CC+\
x.Outcome_A+x.Outcome_B+x.Outcome_C>0 else 0,1)
dataset=df[['sequence','Outcome']]
Although it may not be the most elegant method, you can achieve this using a for loop. For each row, split a that element of Sequence into a list of events using .split(). You can find the count of each element using .count(). You can find the length using len(), and the average/total outcome using np.sum() and np.mean(). Try using this code as a starting point:
df['Outcome'] = 0
for i, j in df.iterrows():
list_of_events = j['Sequence'].split('->')
# do your calculations on list_of_events here
print(len(list_of_events))
print(list_of_events.count("A"))
my_calculation_for_outcome = list_of_events.count("B")*0.02
df.loc(i, ['Outcome']) = my_calculation_for_outcome
May want to look here for ensuring the Outcome column has a given number of True values: A fast way to find the largest N elements in an numpy array

adding new pandas df column based on operations row-wise

I have a Dataframe like this:
Interesting genre_1 probabilities
1 no Empty 0.251306
2 yes Empty 0.042043
3 no Alternative 5.871099
4 yes Alternative 5.723896
5 no Blues 0.027028
6 yes Blues 0.120248
7 no Children's 0.207213
8 yes Children's 0.426679
9 no Classical 0.306316
10 yes Classical 1.044135
I would like to perform GINI index on the same category based on the interesting column. After that, I would like to add such a value in a new pandas column.
This is the function to get the Gini index:
#Gini Function
#a and b are the quantities of each class
def gini(a,b):
a1 = (a/(a+b))**2
b1 = (b/(a+b))**2
return 1 - (a1 + b1)
EDIT* SORRY I had an error in my final desired Dataframe. Being interesting or not matters when it comes to choose prob(A) and prob(B) but the Gini score will be the same, because it will measure how much impurity are we getting to classify a song as interesting or not. So if the probabilities are around 50/50% then it will mean that the Gini score will reach it maximum (0.5) and this is because is equally possible to just be mistaken to choose interesting or not.
So for the first two rows, the Gini index will be:
a=no; b=Empty -> gini(0.251306, 0.042043)= 0.245559831601612
a=yes; b=Empty -> gini(0.042043, 0.251306)= 0.245559831601612
Then I would like to get something like:
Interesting genre_1 percentages. GINI INDEX
1 no Empty 0.251306 0.245559831601612
2 yes Empty 0.042043 0.245559831601612
3 no Alternative 5.871099 0.4999194135183881
4 yes Alternative 5.723896. 0.4999194135183881
5 no Blues 0.027028 ..
6 yes Blues 0.120248
7 no Children's 0.207213
8 yes Children's 0.426679
9 no Classical 0.306316 ..
10 yes Classical 1.044135 ..
Ok, I think I know what you mean. The code below does not care, if the Interesting value is 'yes' or 'no'. But what you want, is to calculate the GINI coefficient in two different ways for each row based on the value in the Interesting value of that row. So if interesting == no, then the result is 0.5, because a == b. But if interesting is 'yes', then you need to use a = probability[i] and b = probability[i+1]. So skip this section for the updated code below.
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
probs = df['probabilities']
def ROLLING_GINI(probabilities):
a1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
b1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
res = 1 - (a1 + b1)
yield res
for i in range(len(probabilities)-1):
a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
res = 1 - (a1 + b1)
yield res
df['GINI'] = [val for val in ROLLING_GINI(probs)]
print(df)
This is where the real trouble starts, because if I understand your idea correctly, then you cannot calculate the last GINI value, because your dataframe won't allow it. The important bit here is that the last Interesting value in your dataframe is 'yes'. This means I have to use a = probability[i] and b = probability[i+1]. But your dataframe doesn't have a row number 11. You have 10 rows and on row i == 10, you'd need a probability in row 11 to calculate a GINI coefficient. So in order for your idea to work, the last Interesting value MUST be 'no', otherwise you will always get an index error.
Here's the code anyways:
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
def ROLLING_GINI(dataframe):
probabilities = dataframe['probabilities']
how_to_calculate = dataframe['Interesting']
for i in range(len(dataframe)-1):
if how_to_calculate[i] == 'yes':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
res = 1 - (a1 + b1)
yield res
elif how_to_calculate[i] == 'no':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
GINI = [val for val in ROLLING_GINI(df)]
print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])
EDIT NUMBER THREE (Sorry for the late realization):
So it does work if I apply the indexing correctly. The problem was that I wanted to use the Next probability, not the previous one. So it's a = probabilities[i-1] and b = probabilities[i]
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
def ROLLING_GINI(dataframe):
probabilities = dataframe['probabilities']
how_to_calculate = dataframe['Interesting']
for i in range(len(dataframe)):
if how_to_calculate[i] == 'yes':
a1 = (probabilities[i-1]/(probabilities[i-1]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i-1]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
elif how_to_calculate[i] == 'no':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
GINI = [val for val in ROLLING_GINI(df)]
print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])
I am not sure how the Interesting column plays into all of this, but I highly recommend that you make the new column by using numpy.where(). The syntax would be something like:
import numpy as np
df['GINI INDEX'] = np.where(__condition__,__what to do if true__,__what to do if false__)

Optimizing grid connections with python

I have the following situation:
(1) I have a large grid. By some conditions I want to further observe specific points/cells in this grid. Each cell has an ID and coordinates X, Y seperately. So in this case lets observe one cell only - marked C on the image, that is located on the edge of the grid. By some formula I can get all the neighbouring cells of the first order (marked 1 on the image) and the second order (marked 2 on the image).
(2) With a further condition I identify some cells in the neighbouring cells and are marked in orange on the second image. What I want to do is to connect all orange cells with each other by optimizing the distances and takih into account only min() distances. My first attempt was to observe cells only by calculating the distances to cells of the lower order. So when looking at cells in neighbours cells 2, i'm looking at the cells in 1 only. The solution of connections is presented on image 2, but it's not optimal, since the ideal solution would compare the distances of all cells and not only of the cells of the lower neighbour order. By doing this, i'm getting the situation presented on image 3. And the problem is that the cells are of course not connected to the centre. What to do?
The current code is:
CO - list of centre points.
data - df all all ID's with X,Y values
CO_list = CO['ID'].tolist()
neighbor100 = []
for p in IskanjeCO_list:
d = get_neighbors100k2(p, len(data)) #function that finds the ID's of neighbours of the first order
neighbor100.append(d)
neighbor200 = []
for p in IskanjeCO_list:
d = get_neighbors200k2(p, len(data)) #function that finds the ID's of neighbours of the second order
neighbor200.append(d)
flat100 = []
for i in neighbor100:
for j in i:
flat100.append(j)
flat200 = []
for i in neighbor200:
for j in i:
flat200.append(j)
neighbors100 = flat100
neighbors200 = flat200
data_sosedi100 = data.iloc[flat100,].reset_index(drop=True)
data_sosedi200 = data.iloc[flat200,].reset_index(drop=True)
dist200 = []
for b in flat200:
d = ((pd.DataFrame((data_sosedi100['X']* - data.iloc[b,]['X'])**2
+ (data_sosedi100['Y'] - data.iloc[b,]['Y'])**2 )**0.5)).sum(1)
dist200.append(d.min())
data_sosedi200['dist'] = dist200
data_sosedi200['id'] = None
for e in CO_list:
data_sosedi200.loc[data_sosedi200['FID_2'].isin((get_neighbors200k2(e, len(data)))),'id'] = e
Do you have any suggestion how to optimize this a bit further? I hope i presented the whole image. If needed, I'll clarify further. If you see a part of the code, where i'd be able to furher optimize this loop, i'd be very grateful!
I defined the points manually to work with:
import numpy as np
from operator import itemgetter, attrgetter
nodes = [[-2,1], [-2,0], [-1,0], [0,0], [1,1], [2,1], [2,0], [1,2], [2,2]]
center = [0,0]
def find_neighbor(node):
n=[]
for i in range(-1,2):
for j in range(-1,2):
if not (i ==0 and j ==0):
n.append([node[0]+i,node[1]+j])
return [N for N in n if N in nodes]
def distance_to_center(node):
return np.sqrt(node[0]**2+node[1]**2)
def distance_between_two_nodes(node1, node2):
return np.sqrt((node1[0]-node2[0])**2+(node1[1]-node2[1])**2)
def next_node_closest_to_center(node):
min = distance_to_center(node)
next_node = node
for n in find_neighbor(node):
if distance_to_center(n) < min:
min = distance_to_center(n)
next_node = n
return next_node, min
def get_path_to_center(node):
node_path = [node]
distance = 0.
while node!= center:
new_node = next_node_closest_to_center(node)[0]
distance += distance_between_two_nodes(node, new_node)
node_path.append(new_node)
node=new_node
return node_path,distance
def furthest_nodes_from_center(nodes):
max = 0.
for n in nodes:
if get_path_to_center(n)[1] > max:
furthest_nodes_pathwise = []
max = get_path_to_center(n)[1]
furthest_nodes_pathwise.append(n)
elif get_path_to_center(n)[1] == max:
furthest_nodes_pathwise.append(n)
return furthest_nodes_pathwise
def farthest_node_from_center(nodes):
max = 0.
farthest_node = center
for n in nodes:
if distance_to_center(n) > max:
max = distance_to_center(n)
farthest_node = n
return farthest_node
def closest_node_to_center(nodes):
min = distance_to_center(farthest_node_from_center(nodes))
for n in nodes:
if distance_to_center(n) < min:
min = distance_to_center(n)
closest_node = n
return closest_node
def closest_node_center_with_furthest_distance(node_selection):
if len(node_selection) == 1:
return node_selection[0]
else:
return closest_node_to_center(node_selection)
print(closest_node_center_with_furthest_distance(furthest_nodes_from_center(nodes)))
Output:
[2, 0]
[Finished in 0.266s]
By running on all nodes I can now determine that the furthest node away path-wise but still closest to the center distance wise is [2,0] and not [2,2]. So we start from there. To find the one on the other side just split the data like I said into negative x values and positive. if you run it over a list of only the negative x value cells you will get [-2,1]
Now that you have your 2 starting cells [2,0] and [-2,1] I will leave you to figure out the algorithm to navigate to the center passing by all cells using the steps in my comments (you can now skip step 1 because this is the answer posted)

Python- Selecting pairs of objects from a data frame

I have a data frame that contains information about the positions of various objects, and a unique index for each object (index in this case is not related to the data frame). Here is some example data:
ind pos
x y z
-1.0 7.0 0.0 21 [-2.76788330078, 217.786453247, 26.6822681427]
0.0 22 [-7.23852539062, 217.274139404, 26.6758270264]
0.0 1.0 152 [-0.868591308594, 2.48404550552, 48.4036369324]
6.0 2.0 427 [-0.304443359375, 182.772140503, 79.4475860596]
The actual data frame is quite long. I have written a function that takes two vectors as inputs and outputs the distance between them:
def dist(a, b):
diff = N.array(a)-N.array(b)
d = N.sqrt(N.dot(diff, diff))
return d
and a function that, given two arrays, will output all the unique combinations of elements between these arrays:
def getPairs(a, b):
if N.array_equal(a, b):
pairs = [(a[i], b[j]) for i in range(len(a)) for j in range(i+1,
len(b))]
else:
pairs = [(a[i], b[j]) for i in range(len(a)) for j in range(len(b))]
return pairs
I want to take my data frame and find all the pairs of elements whose distance between them is less than some value, say 30. For the pairs that meet this requirement, I also need to store the distance I calculated in some other data frame. Here is my attempt at solving this, but this turned out to be extremely slow.
pairs = [getPairs(list(group.ind), list(boxes.get_group((name[0]+i, name[1]+j, name[2]+k)).ind)) \
for i in [0,1] for j in [0,1] for k in [0,1] if name[0]+i != 34 and name[1]+j != 34 and name[2]+k != 34]
pairs = list(itertools.chain(*pairs))
subInfo = pandas.DataFrame()
subInfo['pairs'] = pairs
subInfo['r'] = subInfo.pairs.apply(lambda x: dist(df_yz.query('ind == #x[0]').pos[0], df_yz.query('ind == #x[1]').pos[0]))
Don't worry about what I'm iterating over in this for loop, it works for the system I'm dealing with and isn't where I'm getting slowed down. The step I use .query() is where the major jam happens.
The output I am looking for is something like:
pair distance
(21, 22) 22.59
(21, 152) 15.01
(22, 427) 19.22
I made the distances up, and the pair list would be much longer, but that's the basic idea.
Took me a while, but here are thee possible solution. Hope they are self explanatory. Written in Python 3.x in Jupyter Notebook. One remark: if your coordinates are world coordinates, you may think of using the Haversine distance (circular distance) instead of the Euclidean distance which is a straight line.
First, create your data
import pandas as pd
import numpy as np
values = [
{ 'x':-1.0, 'y':7.0, 'z':0.0, 'ind':21, 'pos':[-2.76788330078, 217.786453247, 26.6822681427] },
{ 'z':0.0, 'ind':22, 'pos':[-7.23852539062, 217.274139404, 26.6758270264] },
{ 'y':0.0, 'z':1.0, 'ind':152, 'pos':[-0.868591308594, 2.48404550552, 48.4036369324] },
{ 'y':6.0, 'z':2.0, 'ind':427, 'pos':[-0.304443359375, 182.772140503, 79.4475860596] }
]
def dist(a, b):
"""
Calculates the Euclidean distance between two 3D-vectors.
"""
diff = np.array(a) - np.array(b)
d = np.sqrt(np.dot(diff, diff))
return d
df_initial = pd.DataFrame(values)
The following three solutions will generate this output:
pairs distance
1 (21, 22) 4.499905
3 (21, 427) 63.373886
7 (22, 427) 63.429709
First solution is based on a full join of the data with itself. Downside is that it may exceed your memory if the dataset is huge. Advantages are the easy readability of the code and the usage of Pandas only:
#%%time
df = df_initial.copy()
# join data with itself, each line will contain two geo-positions
df['tmp'] = 1
df = df.merge(df, on='tmp', suffixes=['1', '2']).drop('tmp', axis=1)
# remove rows with similar index
df = df[df['ind1'] != df['ind2']]
# calculate distance for all
df['distance'] = df.apply(lambda row: dist(row['pos1'], row['pos2']), axis=1)
# filter only those within a specific distance
df = df[df['distance'] < 70]
# combine original indices into a tuple
df['pairs'] = list(zip(df['ind1'], df['ind2']))
# select columns of interest
df = df[['pairs', 'distance']]
def sort_tuple(idx):
x, y = idx
if y < x:
return y, x
return x, y
# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)
# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)
# print result
df
The second solution tries to avoid the memory issue of the first version by iterating over the original data line by line and calculating the distance between the current line and the original data while keeping only values that satisfy the minimum distance constraint. I was expecting a bad performance, but wasn't bad at all (see summary at the end).
#%%time
df = df_initial.copy()
results = list()
for index, row1 in df.iterrows():
# calculate distance between current coordinate and all original rows in the data
df['distance'] = df.apply(lambda row2: dist(row1['pos'], row2['pos']), axis=1)
# filter only those within a specific distance and drop rows with same index as current coordinate
df_tmp = df[(df['distance'] < 70) & (df['ind'] != row1['ind'])].copy()
# prepare final data
df_tmp['ind2'] = row1['ind']
df_tmp['pairs'] = list(zip(df_tmp['ind'], df_tmp['ind2']))
# remember data
results.append(df_tmp)
# combine all into one dataframe
df = pd.concat(results)
# select columns of interest
df = df[['pairs', 'distance']]
def sort_tuple(idx):
x, y = idx
if y < x:
return y, x
return x, y
# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)
# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)
# print result
df
The third solution is based on spatial operations using the KDTree from Scipy.
#%%time
from scipy import spatial
tree = spatial.KDTree(list(df_initial['pos']))
# calculate distances (returns a sparse matrix)
distances = tree.sparse_distance_matrix(tree, max_distance=70)
# convert to a Coordinate (coo) representation of the Compresses-Sparse-Column (csc) matrix.
coo = distances.tocoo(copy=False)
def get_cell_value(idx: int, column: str = 'ind'):
return df_initial.iloc[idx][column]
def extract_indices(row):
distance, idx1, idx2 = row
return get_cell_value(int(idx1)), get_cell_value(int(idx2))
df = pd.DataFrame({'idx1': coo.row, 'idx2': coo.col, 'distance': coo.data})
df['pairs'] = df.apply(extract_indices, axis=1)
# select columns of interest
df = df[['pairs', 'distance']]
def sort_tuple(idx):
x, y = idx
if y < x:
return y, x
return x, y
# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)
# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)
# print result
df
So what about performance. If you just want to know which row of your original data is within the desired distance, then the KDTree version (third version) is super fast. It took just 4ms to generate the sparse matrix. But since I then used the indices from that matrix to extract the data from the original data, the performance dropped. Of course this should be tested on your full dataset.
version 1: 93.4 ms
version 2: 42.2 ms
version 3: 52.3 ms (4 ms)

python genetic algorithm found optimum solution

I explain, I am trying to develop a program to optimize a system based on the parameters it receives. My program will have to vary these parameters to try to find the best possible combination.
here is a code to simplify my problem:
parameters=[["toto1","toto2","toto3"],["tutu1","tutu2","tutu3"],["titi1","titi2","titi3"],["tata1","tata2","tata3"]]
def MySysteme(param1,param2,param3,param4):
result=0
for i in range(0,len(param1)):
result+=ord(param1[i])
for i in range(0,len(param2)):
result+=ord(param1[i])
for i in range(0,len(param3)):
result+=ord(param1[i])
for i in range(0,len(param4)):
result+=ord(param1[i])
return result
print(MySysteme(parameters[0][0],parameters[1][2],parameters[2][2],parameters[3][0]))
print(MySysteme(parameters[1][0],parameters[2][2],parameters[3][2],parameters[0][0]))
print(MySysteme(parameters[3][1],parameters[1][2],parameters[2][2],parameters[0][0]))
#how to find the highest value?
I try to (try) find the highest number, without testing all the parameters naively. hence the use of a genetic algorithm. 1 parameter is a list contained in the list parameters, the contents of the list is a varariante of my parameter
knowing that in my function / my system, one should not have 2 times the same parameter, for example this should not happen: print (MySystem (parameters [1] [0], parameters [1] [0])) or this print (MySystem (parameters [2] [1], parameters [2] [0]))
on the other hand the number of parameters is included in 1 and 4 (there can be 1,2,3 or 4 parameters)
To solve my problem here is the data that I consider: Individual: it is a variant of parameter which carries a name ("toto1", "tata3", "toto2 = 12" ... etc.) Population: set of the variants of the parameters fitness : it is the result of the function according to the parameters a circuit: a set of parameters
but unlike the commercial traveler, I have no starting data => that is to say that I do not have GPS coordinates. and it is at this level that I am stuck for the resolution of my problem.
can anyone help me?
edit:
I have been looking some examples of how I could find the points at which a function achieves its maxium using a genetic algorithm approach in Python. I looked at this tutorial
https://lethain.com/genetic-algorithms-cool-name-damn-simple/
my objective is to found the smaller number to "mySysteme" function
i set a new code :
je re-explique mon probleme plus simplement. J’ai mets un code plus complet, plus clair avec un algo génétique.
from random import randint, random
from operator import add
from functools import reduce
parameters=[["toto123","toto27","toto3000"],["tu","tut","tutu378694245"],["t","choicezaert","titi3=78965"],["blabla","2","conjoncture_is_enable"]]
def individual(length, min, max):
return [ randint(min,max) for x in range(length) ]
def population(count, length, min, max):
return [ individual(length, min, max) for x in range(count) ]
def fitness(individual, target):
sum = reduce(add, individual, 0)
return abs(target-sum)
def grade(pop, target):
individu_number_parameters=randint(1, len(parameters)-1)
for j in range(0,individu_number_parameters):
position=randint(1, len(parameters)-1)
parameter=parameters[position]
if isinstance(parameter, list):
parameter=parameters[position][randint(1, len(parameters[position])-1)]
result=0
for i in range(0,len(parameter)):
result+=ord(parameter[i])
return result
def evolve(pop, target, retain=0.2, random_select=0.05, mutate=0.01):
graded = [ (fitness(x, target), x) for x in pop]
graded = [ x[1] for x in sorted(graded)]
retain_length = int(len(graded)*retain)
parents = graded[:retain_length]
for individual in graded[retain_length:]:
if random_select > random():
parents.append(individual)
for individual in parents:
if mutate > random():
pos_to_mutate = randint(0, len(individual)-1)
individual[pos_to_mutate] = randint(
min(individual), max(individual))
parents_length = len(parents)
desired_length = len(pop) - parents_length
children = []
while len(children) < desired_length:
male = randint(0, parents_length-1)
female = randint(0, parents_length-1)
if male != female:
male = parents[male]
female = parents[female]
half = int(len(male) / 2)
child = male[:half] + female[half:]
children.append(child)
parents.extend(children)
return parents
target = 0
p_count = 100
i_length = 6
i_min = 0
i_max = 100
p = population(p_count, i_length, i_min, i_max)
fitness_history = [grade(p, target),]
for i in range(1000):
p = evolve(p, target)
fitness_history.append(grade(p, target))
for datum in fitness_history:
print(datum)
print(len(fitness_history))
I updated with new code. My ask : i want that my program found smaller number

Categories

Resources