To analyze the text, we transform it into a list P1 of words. Then we apply the Bigram methods and get a list X of couples of words (ai,bi) such that ai and bi occur one after another in P1 quite a lot of times. How to get in Python 3 a list P2 from P1 so that every two items ai and bi if they go one after another in P1 and (ai,bi ) from X would be replaced by one element ai_bi?
My ultimate goal is to prepare the text as a list of words for analysis in Word2Vec.
I have my own code and it works but I think it will be slow on big texts.
import nltk
from nltk.collocations import *
import re
import gensim
bigram_measures = nltk.collocations.BigramAssocMeasures()
sentences=["Total internal reflection ! is the;phenomenon",
"Abrasive flow machining :is an ? ( interior surface finishing process)",
"Technical Data[of Electrical Discharge wire cutting and] Cutting Machine",
"The greenhouse effect. is the process by which, radiation from a {planet atmosphere warms }the planet surface",
"Absolute zero!is the lowest limit ;of the thermodynamic temperature scale:",
"The term greenhouse effect ?is mentioned (a lot)",
"[An interesting] effect known as total internal reflection.",
"effect on impact energies ,Electrical discharge wire cutting of ADI",
"{Absolute zero represents} the coldest possible temperature",
"total internal reflection at an air water interface",
"What is Electrical Discharge wire cutting Machining and how does it work",
"Colder than Absolute Zero",
"A Mathematical Model for Electrical Discharge Wire Cutting Machine Parameters"]
P1=[]
for f in sentences:
f1=gensim.utils.simple_preprocess (f.lower())
P1.extend(f1)
print("First 100 items from P1")
print(P1[:100])
# bigram
finder = BigramCollocationFinder.from_words(P1)
# filter only bigrams that appear 2+ times
finder.apply_freq_filter(2)
# return the all bi-grams with the highest PMI
X=finder.nbest(bigram_measures.pmi, 10000)
print()
print("Number of bigrams= ",len(X))
print("10 first bigrams with the highest PMI")
print(X[:10])
# replace ai and bi which are one after another in P1 and (ai,bi) in X =>> with ai_bi
P2=[]
n=len(P1)
i=0
while i<n:
P2.append(P1[i])
if i<n-2:
for c in X:
if c[0]==P1[i] and c[1]==P1[i+1]:
P2[len(P2)-1]=c[0]+"_"+c[1]
i+=1 # skip second item of couple from X
break
i+=1
print()
print( "first 50 items from P2 - results")
print(P2[:50])
I guess you are looking for something like this.
P2 = []
prev = P1[0]
for this in P1[1:]:
P2.append(prev + "_" + this)
prev = this
This implements a simple sliding window where the previous token is pasted next to the current token.
Related
Inspired by some projects, I have decided to work on a calculator project based on Python.
Essentially, I have 5 teams in a fantasy league, with points assigned to these teams based on their current standings. Teams A-E.
Assuming the league has 10 more matches to be played, my main aim is to calculate the probability that a team makes it to the top 3 in their league given the matches have a 33.3% of either going:
A win to the team (which adds 2 points to the winning team)
A lose to the team (which awards 0 points to the losing team)
A draw (which awards 1 point to both teams)
This also in turn means there will be 3^10 outcomes for the 10 matches to be played.
For each of these 3^10 scenarios, I will also compute how the final points table will look, and from there, I will be able to sort and figure out which are the top 3 teams in the fantasy league.
I've worked halfway through the project, as shown:
Points = { "A":12, "B":14, "C":8, "D":12, "E":6} #The current standings
RemainingMatches = [
A:B
B:D
C:E
A:E
D:C
B:D
C:D
A:E
C:E
D:C
]
n=len(RemainingMatches) # Number of matches left
combinations = pow(3,n) # Number of possible scenarios left assumes each game has 3 outcomes
print( "Number of remaining matches = %d" % n )
print( "Number of possible scenarios = %d" % combinations )
for i in range(0,combinations)
...
for i in range(0,n)
I am currently wondering how do I get these possible combinations to match a certain scenario? For example, when i = 0, it points to the first matchup where A wins, B losses. Hence, Points[A] = Points[A] + 2 etc. I know there will be a nested loop since I have to consider the other matches too. But how do I exactly map each scenario, and then nest it?
Apologies if I am being unclear here, struggling with this part at the moment.
Thinking Process:
3 outcomes per game.
for i to combinations:
#When i =1, means A win/B lost?
#When i =2, means B win/A lost?
#When i =3, means A/B drew?
for i to n:
#Go to next match?
Not exactly sure what is the logic flow for this kind of scenario. Thank you.
Here is a different way to write the code. If you knew in advance the outcome of each of the ten remaining games, you could compute which teams would finish in top three. This is done with the play_out function, which takes current standings, remaining games, and the known outcomes of future games.
Now all that remains is to loop over all possible future outcomes and tally the winning probabilities. This is done in the simulate_league function that takes in current standings and remaining games, and returns a probability that a given team finishes in top 3.
There may be situations where two teams are tied for the third place. In cases like this, the code below allows for four teams or more to be in "top 3". To use a different rule, you can change the top3scores = nlargest(3, set(pts.values())) line.
In terms of useful Python functions, itertools.product is used to generate all possible win/draw/loss outcomes, and heapq.nlargest is used to find the largest 3 scores out of a bunch. The collections.Counter class is used to count the number of possibilities in which a given team finishes in top 3.
from itertools import product
from heapq import nlargest
from collections import Counter
Points = {"A":12, "B":14, "C":8, "D":12, "E":6} # The current standings
RemainingMatches = ["A:B","B:D","C:E","A:E","D:C","B:D","C:D","A:E","C:E","D:C"]
# reformat remaining matches
RemainingMatches = [tuple(s.split(":")) for s in RemainingMatches]
def play_out(points, remaining_matches, winloss_schedule):
pts = points.copy()
# simulate remaining games given predetermine win/loss/draw outcomes
# given by winloss_schedule
for (team_a, team_b), outcome in zip(remaining_matches, winloss_schedule):
if outcome == -1:
pts[team_a] += 2
elif outcome == 0:
pts[team_a] += 1
pts[team_b] += 1
else:
pts[team_b] += 2
# top 3 scores (allows for ties)
top3scores = nlargest(3, set(pts.values()))
return {team: score in top3scores for team, score in pts.items()}
def simulate_league(points, remaining_matches):
top3counter = Counter()
for winloss_schedule in product([-1, 0, 1], repeat=len(remaining_matches)):
top3counter += play_out(points, remaining_matches, winloss_schedule)
total_possibilities = 3 ** len(remaining_matches)
return {team: top3count / total_possibilities
for team, top3count in top3counter.items()}
# simulate_league(Points, RemainingMatches)
# {'A': 0.9293637487510372,
# 'B': 0.9962573455943369,
# 'C': 0.5461057765584515,
# 'D': 0.975088485833799,
# 'E': 0.15439719554945894}
I have a list of 12 cities connected to each other without exception. The only thing of concern is travel time. The name of each city is here. The distance matrix (representing travel time in minutes) between city pairs is here.
How can I find out how many cities I can visited given a certain travel budget (say 800 minutes) from a city of origin (it can be any of the 12).
You can't visit the same city twice during the trip and you don't need to worry about returning to your origin. I can't go above my travel budget.
import numpy as np
from scipy.spatial import distance_matrix
from sklearn.cluster import AgglomerativeClustering
def find_cities(dist, budget): # dist: a 12x12 matrix of travel time in minutes between city pairs; budget: max travel time allowed for the trip (in mins)
assert len(dist) == 12 # there are 12 cities to visit and each one has a pairwise cost with all other 11 citis
clusters = [] # list of cluster labels from 1..n where n is no of cities to be visited
dists = [0] + [row[1:] for row in dist] # start-to-start costs have been excluded from the distances array which only contains intercity distances
linkage = 'complete' # complete linkage used here because we want an optimal solution i.e., finding minimum number of clusters required
ac = AgglomerativeClustering(affinity='precomputed', linkage=linkage, compute_full_tree=True) # affinity must be precomputed or function otherwise it will use euclidean distance by default !!!
# compute full tree ensures that I get all possible clustesr even if they don't make up entire population! This is needed so that I can determine how many clusters need to be created given my budget constraints below
Z = ac.fit_predict(dists).tolist() # AgglomerativeClustering.fit_predict returns list of cluster labels for each city
while budget >= min(dists): # while my budget is greater than the minimum intercity travel cost, i.e., I can still visit another city
if len(set(Z)) > 1: # at least 2 clusters are needed to form a valid tour so continue only when there are more than one cluster left in Z
c1 = np.argmax([sum([i==j for j in Z]) for i in set(Z)]) # find which clustes has max no of cities associated with it and that will be the next destination (cities within this cluster have same label as their parent cluster!) # numpy argmax returns index of maximum value along an axis; here we want to know which group has most elements!
c2 = [j for j,val in enumerate(Z) if val == Z[c1]][0] # retrieve first element from the group whose parent is 'cluster' returned by previous line
clusters += [c2 + 1] ## add new destination found into our trip plan/list "clusters" after converting its city id back into integer starting from 1 instead of 0 like array indices do!!
dists += [dist[c1][c2]] ## update total distance travelled so far based on newly added destination ... note: distances between two adjacent cities always equals zero because they fall under same cluster
budget -= dists[-1] ## update travel budget by subtracting the cost of newly added destination from our total budget
else: break # when there is only one city left in Z, then stop! it's either a single city or two cities are connected which means cost between them will always be zero!
return clusters # this is a list of integers where each integer represents the id of city that needs to be visited next!
def main():
with open('uk12_dist.txt','r') as f: ## read travel time matrix between cities from file ## note: 'with' keyword ensures file will be closed automatically after reading or writing operation done within its block!!!
dist = [[int(num) for num in line.split()] for line in f] ## convert text data into array/list of lists using list comprehension; also ensure all data are converted into int before use!
with open('uk12_name.txt','r') as f: ## read names of 12 cities from file ## note: 'with' keyword ensures file will be closed automatically after reading or writing operation done within its block!!!
name = [line[:-1].lower().replace(" ","") for line in f] ## remove newline character and any leading/trailing spaces, then convert all characters to lowercase; also make sure there's no space between first and last name (which results in empty string!) otherwise won't match later when searching distances!!
budget = 800 # max travel budget allowed (in mins) i.e., 8 hrs travelling at 60 mins per km which means I can cover about 800 kms on a full tank!
print(find_cities(dist,budget), "\n") ## print(out list of city ids to visit next!
print("Total distance travelled: ", sum(dist[i][j] for i, j in enumerate([0]+find_cities(dist,budget))), "\n" ) # calculate total cost/distance travelled so far by adding up all distances between cities visited so far - note index '0' has been added at start because 0-2 is same as 2-0 and it's not included in find_cities() output above !
while True:
try: ## this ensures that correct input from user will be obtained only when required!!
budget = int(raw_input("\nEnter your travel budget (in minutes): ")) # get new travel budget from user and convert into integer before use!!!
if budget <= 800: break ## stop asking for valid input only when the value entered by user isn't greater than 800 mins or 8 hrs !!
except ValueError: ## catch exception raised due to invalid data type; continue asking until a valid number is given by user!!
pass
print(name[find_cities(dist,budget)[1]],"->",name[find_cities(dist,budget)[2]],"-> ...",name[find_cities(dist,budget)[-1]] )## print out the city names of cities to visit next!
return None
if __name__ == '__main__': main()
Goal -
I am trying to implement a genetic algorithm to optimise the fitness of a
species of creatures in a simulated two-dimensional world. The world contains edible foods, placed at random, and a population of monsters (your basic zombies). I need the algorithm to find behaviours that keep the creatures well fed and not dead.
What i have done -
So i start off by generating a 11x9 2d array in numpy, this is filled with random floats between 0 and 1. I then use np.matmul to go through each row of the array and multiply all of the random weights by all of the percepts (w1+p1*w2+p2....w9+p9) = a1.
This first generation is run and I then evaluate the fitness of each creature using (energy + (time of death * 100)). From this I build a list of creatures who performed above the average fitness. I then take the best of these "elite" creatures and put them back into the next population. For the remaining space I use a crossover function which takes two randomly selected "elite" creatures and mixes their genes. I have tested two different crossover functions one which does a two point crossover on each row and one which takes a row from each parent until the new child has a complete chromosome. My issue is that the creatures just don't really seem to be learning, at 75 turns I will only get 1 survivor every so often.
I am fully aware this might not be enough to go off but I am truly stuck on this and cannot figure out how to get these creatures to learn even though I think I am implementing the correct procedures. Occasionally I will get a 3-4 survivors rather than 1 or 2 but it appears to occur completely randomly, doesn't seem like there is much learning happening.
Below is the main section of code, it includes everything I have done but none of the provided code for the simulation
#!/usr/bin/env python
from cosc343world import Creature, World
import numpy as np
import time
import matplotlib.pyplot as plt
import random
import itertools
# You can change this number to specify how many generations creatures are going to evolve over.
numGenerations = 2000
# You can change this number to specify how many turns there are in the simulation of the world for a given generation.
numTurns = 75
# You can change this number to change the world type. You have two choices - world 1 or 2 (described in
# the assignment 2 pdf document).
worldType=2
# You can change this number to modify the world size.
gridSize=24
# You can set this mode to True to have the same initial conditions for each simulation in each generation - good
# for development, when you want to have some determinism in how the world runs from generation to generation.
repeatableMode=False
# This is a class implementing you creature a.k.a MyCreature. It extends the basic Creature, which provides the
# basic functionality of the creature for the world simulation. Your job is to implement the AgentFunction
# that controls creature's behaviour by producing actions in response to percepts.
class MyCreature(Creature):
# Initialisation function. This is where your creature
# should be initialised with a chromosome in a random state. You need to decide the format of your
# chromosome and the model that it's going to parametrise.
#
# Input: numPercepts - the size of the percepts list that the creature will receive in each turn
# numActions - the size of the actions list that the creature must create on each turn
def __init__(self, numPercepts, numActions):
# Place your initialisation code here. Ideally this should set up the creature's chromosome
# and set it to some random state.
#self.chromosome = np.random.uniform(0, 10, size=numActions)
self.chromosome = np.random.rand(11,9)
self.fitness = 0
#print(self.chromosome[1][1].size)
# Do not remove this line at the end - it calls the constructors of the parent class.
Creature.__init__(self)
# This is the implementation of the agent function, which will be invoked on every turn of the simulation,
# giving your creature a chance to perform an action. You need to implement a model here that takes its parameters
# from the chromosome and produces a set of actions from the provided percepts.
#
# Input: percepts - a list of percepts
# numAction - the size of the actions list that needs to be returned
def AgentFunction(self, percepts, numActions):
# At the moment the percepts are ignored and the actions is a list of random numbers. You need to
# replace this with some model that maps percepts to actions. The model
# should be parametrised by the chromosome.
#actions = np.random.uniform(0, 0, size=numActions)
actions = np.matmul(self.chromosome, percepts)
return actions.tolist()
# This function is called after every simulation, passing a list of the old population of creatures, whose fitness
# you need to evaluate and whose chromosomes you can use to create new creatures.
#
# Input: old_population - list of objects of MyCreature type that participated in the last simulation. You
# can query the state of the creatures by using some built-in methods as well as any methods
# you decide to add to MyCreature class. The length of the list is the size of
# the population. You need to generate a new population of the same size. Creatures from
# old population can be used in the new population - simulation will reset them to their
# starting state (not dead, new health, etc.).
#
# Returns: a list of MyCreature objects of the same length as the old_population.
def selection(old_population, fitnessScore):
elite_creatures = []
for individual in old_population:
if individual.fitness > fitnessScore:
elite_creatures.append(individual)
elite_creatures.sort(key=lambda x: x.fitness, reverse=True)
return elite_creatures
def crossOver(creature1, creature2):
child1 = MyCreature(11, 9)
child2 = MyCreature(11, 9)
child1_chromosome = []
child2_chromosome = []
#print("parent1", creature1.chromosome)
#print("parent2", creature2.chromosome)
for row in range(11):
chromosome1 = creature1.chromosome[row]
chromosome2 = creature2.chromosome[row]
index1 = random.randint(1, 9 - 2)
index2 = random.randint(1, 9 - 2)
if index2 >= index1:
index2 += 1
else: # Swap the two cx points
index1, index2 = index2, index1
child1_chromosome.append(np.concatenate([chromosome1[:index1],chromosome2[index1:index2],chromosome1[index2:]]))
child2_chromosome.append(np.concatenate([chromosome2[:index1],chromosome1[index1:index2],chromosome2[index2:]]))
child1.chromosome = child1_chromosome
child2.chromosome = child2_chromosome
#print("child1", child1_chromosome)
return(child1, child2)
def crossOverRows(creature1, creature2):
child = MyCreature(11, 9)
child_chromosome = np.empty([11,9])
i = 0
while i < 11:
if i != 10:
child_chromosome[i] = creature1.chromosome[i]
child_chromosome[i+1] = creature2.chromosome[i+1]
else:
child_chromosome[i] = creature1.chromosome[i]
i += 2
child.chromosome = child_chromosome
return child
# print("parent1", creature1.chromosome[:3])
# print("parent2", creature2.chromosome[:3])
# print("crossover rows ", child_chromosome[:3])
def newPopulation(old_population):
global numTurns
nSurvivors = 0
avgLifeTime = 0
fitnessScore = 0
fitnessScores = []
# For each individual you can extract the following information left over
# from the evaluation. This will allow you to figure out how well an individual did in the
# simulation of the world: whether the creature is dead or not, how much
# energy did the creature have a the end of simulation (0 if dead), the tick number
# indicating the time of creature's death (if dead). You should use this information to build
# a fitness function that scores how the individual did in the simulation.
for individual in old_population:
# You can read the creature's energy at the end of the simulation - it will be 0 if creature is dead.
energy = individual.getEnergy()
# This method tells you if the creature died during the simulation
dead = individual.isDead()
# If the creature is dead, you can get its time of death (in units of turns)
if dead:
timeOfDeath = individual.timeOfDeath()
avgLifeTime += timeOfDeath
else:
nSurvivors += 1
avgLifeTime += numTurns
if individual.isDead() == False:
timeOfDeath = numTurns
individual.fitness = energy + (timeOfDeath * 100)
fitnessScores.append(individual.fitness)
fitnessScore += individual.fitness
#print("fitnessscore", individual.fitness, "energy", energy, "time of death", timeOfDeath, "is dead", individual.isDead())
fitnessScore = fitnessScore / len(old_population)
eliteCreatures = selection(old_population, fitnessScore)
print(len(eliteCreatures))
newSet = []
for i in range(int(len(eliteCreatures)/2)):
if eliteCreatures[i].isDead() == False:
newSet.append(eliteCreatures[i])
print(len(newSet), " elites added to pop")
remainingRequired = w.maxNumCreatures() - len(newSet)
i = 1
while i in range(int(remainingRequired)):
newSet.append(crossOver(eliteCreatures[i], eliteCreatures[i-1])[0])
if i >= (len(eliteCreatures)-2):
i = 1
i += 1
remainingRequired = w.maxNumCreatures() - len(newSet)
# Here are some statistics, which you may or may not find useful
avgLifeTime = float(avgLifeTime)/float(len(population))
print("Simulation stats:")
print(" Survivors : %d out of %d" % (nSurvivors, len(population)))
print(" Average Fitness Score :", fitnessScore)
print(" Avg life time: %.1f turns" % avgLifeTime)
# The information gathered above should allow you to build a fitness function that evaluates fitness of
# every creature. You should show the average fitness, but also use the fitness for selecting parents and
# spawning then new creatures.
# Based on the fitness you should select individuals for reproduction and create a
# new population. At the moment this is not done, and the same population with the same number
# of individuals is returned for the next generation.
new_population = newSet
return new_population
# Pygame window sometime doesn't spawn unless Matplotlib figure is not created, so best to keep the following two
# calls here. You might also want to use matplotlib for plotting average fitness over generations.
plt.close('all')
fh=plt.figure()
# Create the world. The worldType specifies the type of world to use (there are two types to chose from);
# gridSize specifies the size of the world, repeatable parameter allows you to run the simulation in exactly same way.
w = World(worldType=worldType, gridSize=gridSize, repeatable=repeatableMode)
#Get the number of creatures in the world
numCreatures = w.maxNumCreatures()
#Get the number of creature percepts
numCreaturePercepts = w.numCreaturePercepts()
#Get the number of creature actions
numCreatureActions = w.numCreatureActions()
# Create a list of initial creatures - instantiations of the MyCreature class that you implemented
population = list()
for i in range(numCreatures):
c = MyCreature(numCreaturePercepts, numCreatureActions)
population.append(c)
# Pass the first population to the world simulator
w.setNextGeneration(population)
# Runs the simulation to evaluate the first population
w.evaluate(numTurns)
# Show the visualisation of the initial creature behaviour (you can change the speed of the animation to 'slow',
# 'normal' or 'fast')
w.show_simulation(titleStr='Initial population', speed='normal')
for i in range(numGenerations):
print("\nGeneration %d:" % (i+1))
# Create a new population from the old one
population = newPopulation(population)
# Pass the new population to the world simulator
w.setNextGeneration(population)
# Run the simulation again to evaluate the next population
w.evaluate(numTurns)
# Show the visualisation of the final generation (you can change the speed of the animation to 'slow', 'normal' or
# 'fast')
if i==numGenerations-1:
w.show_simulation(titleStr='Final population', speed='normal')
def email_matcher(emails_file, names_file):
matches = {}
with open(names_file, 'r') as names:
for i in names:
with open(emails_file, 'r') as emails:
first = i[:(i.index(' '))]
pattern2 = i[0]
last = i[::-1].strip()
last = last[0:(last.index(' '))][::-1]
for j in emails:
if re.search(first,j):
matches[i] = j
elif re.search(last,j):
matches[i] = j
else:
matches[i] = 'nothing found'
return matches
pass
This is my code so far, i know it does not work and i get the thing to be no matches found. The goal is to look through all the emails for the best matching email to a name. I have no idea how to make the pattern for regex, i tried looking at the documentation but idk the exact thing to do. What i want to do is check different things in the most accurate order
1 - check if first name last name and middle name are in email
2- check if first name and last name are in email
3 - check if first name last initial
4 - check if first initial last name
5 - check if first name
6 - check if last name
Would it be multiple searches throughout the email with like 6 different regex searches, or is there a way to do one search on every email and see if it hits any of the groups in the pattern
Right now in my code I just have a first name and last name searching that gets none right at all.
Adding emails
Mary Williams - mary.williams#gmail.com
Charles Deanna West - charles.west#yahoo.com
Jacob Jessica Andrews - jandrews#hotmail.com
Javier Daisy Sparks - javier.sparks#gmail.com
Paula A. Graham - graham#gmail.com ( could not find the best matching one, none had paula. there are multiple paulas and grahams in the names list as well)
Jasmine Sherman - jherman#hotmail.com
Matthew Foster - matthew.foster#gmail.com
Ernest Michael Bowman - ernest.bowman#gmail.com
Chad Hernandez - hernandez#gmail.com
So i just looked through all of these and it seems the pattern is firstinitiallastname, firstname.lastname, or lastname#email. The thing is though there are a shit ton of names and even more emails so I dont know the general case. But I feel like it would suffice if i looked for firstname.lastname#email, then firstinitiallastname#email,then middleinitallastname#email, and then the worst case would just be lastname#email?
Here's a way that you can do it without regex but by using a fuzzy matching system called Levenshtein.
First, separate the email from the domain so that the #something.com is in a different column.
Next, it sounds like you're describing an algorithm for fuzzy matching called Levenshtein distance. You can use a module designed for this, or perhaps write a custom one:
import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
Now you can get a numerical value for how similar they are. You'll still need to establish a cutoff as to what number is acceptable to you.
Str1 = "Apple Inc."
Str2 = "apple Inc"
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)
There are other similarity algorithms other than Levenshtein. You might try Jaro-Winkler, or perhaps Trigram.
I got this code from: https://www.datacamp.com/community/tutorials/fuzzy-string-python
Ok I figured out that the pattern works for everything
I have an xml file like the following:
<edge from="0/0" to="0/1" speed="10"/>
<edge from="0/0" to="1/0" speed="10"/>
<edge from="0/1" to="0/0" speed="10"/>
<edge from="0/1" to="0/2" speed="10"/>
...
Note, that there exist pairs of from-to and vice versa. (In the example above only the pair ("0/0","0/1") and ("0/1","0/0") is visible, however there is a partner for every entry.) Also, note that those pairs are not ordered.
The file describes edges within a SUMO network simulation. I want to assign new speeds randomly to the different streets. However, every <edge> entry only describes one direction(lane) of a street. Hence, I need to find its "partner".
The following code distributes the speed values lane-wise only:
import xml.dom.minidom as dom
import random
edgexml = dom.parse("plain.edg.xml")
MAX_SPEED_OPTIONS = ["8","9","10"]
for edge in edgexml.getElementsByTagName("edge"):
x = random.randint(0,2)
edge.setAttribute("speed", MAX_SPEED_OPTIONS[x])
Is there a simple (pythonic) way to maybe gather those pairs in tuples and then assign the same value to both?
If you know a better way to solve my problem using SUMO tools, I'd be happy too. However I'm still interested in how I can solve the given abstract list problem in python as it is not just a simple zip like in related questions.
Well, you can walk the list of edges and nest another iteration over all edges to search for possible partners. Since this is of quadratic complexity, we can even reduce calculation time by only walking over not yet visited edges in the nested run.
Solution
(for a detailed description, scroll down)
import xml.dom.minidom as dom
import random
edgexml = dom.parse('sampledata/tmp.xml')
MSO = [8, 9, 10]
edge_groups = []
passed = []
for idx, edge in enumerate(edgexml.getElementsByTagName('edge')):
if edge in passed:
continue
partners = []
for partner in edgexml.getElementsByTagName('edge')[idx:]:
if partner.getAttribute('from') == edge.getAttribute('to') \
and partner.getAttribute('to') == edge.getAttribute('from'):
partners.append(partner)
edge_groups.append([edge] + partners)
passed.extend([edge] + partners)
for e in edge_groups:
print('NEW EDGE GROUP')
x = random.choice(MSO)
for p in e:
p.setAttribute('speed', x)
print(' E from "%s" to "%s" at "%s"' % (p.getAttribute('from'), p.getAttribute('to'), x))
Yields the output:
NEW EDGE GROUP
E from "0/0" to "0/1" at "8"
E from "0/1" to "0/0" at "8"
NEW EDGE GROUP
E from "0/0" to "1/0" at "10"
NEW EDGE GROUP
E from "0/1" to "0/2" at "9"
Detailed description
edge_groups = []
passed = []
Initialize the result structure edge_groups, which will be a list of lists holding partnered edges in groups. The additional list passed will help us to avoid redundant edges in our result.
for idx, edge in enumerate(edgexml.getElementsByTagName('edge')):
Start iterating over the list of all edges. I use enumerate here to obtain the index at the same time, because our nested iteration will only iterate over a sub-list starting at the current index to reduce complexity.
if edge in passed:
continue
Stop, if we have visited this edge at any point in time before. This does only happen if the edge has been recognized as a partner of another list before (due to index-based sublisting). If it has been taken as the partner of another list, we can omit it with no doubt.
partners = []
for partner in edgexml.getElementsByTagName('edge')[idx:]:
if partner.getAttribute('from') == edge.getAttribute('to') \
and partner.getAttribute('to') == edge.getAttribute('from'):
partners.append(partner)
Initialize helper list to store identified partner edges. Then, walk through all edges in the remaining list starting from the current index. I.e. do not iterate over edges that have already been passed in the outer iteration. If the potential partner is an actual partner (from/to matches), then append it to our partners list.
edge_groups.append([edge] + partners)
passed.extend([edge] + partners)
The nested iteration has passed and partners holds all identified partners for the current edge. Push them into one list and append it to the result variable edge_groups. Since it is unneccessarily complex to check against the 2-level list edge_groups to see whether we have already traversed an edge in the next run, we will additionally keep a list of already used nodes and call it passed.
for e in edge_groups:
print('NEW EDGE GROUP')
x = random.choice(MSO)
for p in e:
p.setAttribute('speed', x)
print(' E from "%s" to "%s" at "%s"' % (p.getAttribute('from'), p.getAttribute('to'), x))
Finally, we walk over all groups of edges in our result edge_groups, randomly draw a speed from MSO (hint: use random.choice() to randomly choose from a list), and assign it to all edges in this group.