movie_dataset = {'Avatar': [0.01940156245995175, 0.4812286689419795, 0.9213483146067416], "Pirates of the Caribbean: At World's End": [0.02455894456664483, 0.45051194539249145, 0.898876404494382], 'Spectre': [0.02005646812429373, 0.378839590443686, 0.9887640449438202], ... }
movie_ratings = {'Avatar': 7.9, "Pirates of the Caribbean: At World's End": 7.1, 'Spectre': 6.8, ...}
def distance(movie1, movie2):
squared_difference = 0
for i in range(len(movie1)):
squared_difference += (movie1[i] - movie2[i]) ** 2
final_distance = squared_difference ** 0.5
return final_distance
def predict(unknown, dataset, movie_ratings, k):
distances = []
#Looping through all points in the dataset
for title in dataset:
movie = dataset[title]
distance_to_point = distance(movie, unknown)
#Adding the distance and point associated with that distance
distances.append([distance_to_point, title])
distances.sort()
#Taking only the k closest points
neighbors = distances[0:k]
total_rating = 0
for i in neighbors[1]:
total_rating += movie_ratings[i] <----- Why is this an error?
return total_rating / len(neighbors) <----- Why can I not divide by total rating
#total_rating = 0
#for i in neighbors:
# title = neighbors[1]
#total_rating += movie_ratings[title] <----- Why is this not an error?
#return total_rating / len(neighbors)
print(movie_dataset["Life of Pi"])
print(movie_ratings["Life of Pi"])
print(predict([0.016, 0.300, 1.022], movie_dataset, movie_ratings, 5))
Two questions here. First, why is this an error?
for i in neighbors[1]:
total_rating += movie_ratings[i]
It seems to be the same as
for i in neighbors:
title = neighbors[1]
total_rating += movie_ratings[title]
Second, why can I not divide by len(total_rating)?
Second question first, because it's more straightforward:
Second, why can I not divide by len(total_rating)?
You're trying to compute an average, right? So you want the sum of the ratings divided by the number of ratings?
Okay. So, you're trying to figure out how many ratings there are. What's the rule that tells you that? It seems like you're expecting to count up the ratings from where they are stored. Where are they stored? It is not total_rating; that's where you stored the numerical sum. Where did the ratings come from? They came from looking up the names of movies in the movie_ratings. So the ratings were not actually stored at all; there is nothing to measure the len of. Right? Well, not quite. What is the rule that determines the ratings we are adding up? We are looking them up in the movie_ratings by title. So how many of them are there? As many as there are titles. Where were the titles stored? They were paired up with distances in the neighbors. So there are as many titles as there are neighbors (whatever "neighbor" is supposed to mean here; I don't really understand why you called it that). So that is what you want the len() of.
Onward to fixing the summation.
total_rating = 0
for i in neighbors[1]:
total_rating += movie_ratings[i]
First, this computes neighbors[1], which will be one of the [distance_to_point, title] pairs that was .appended to the list (assuming there are at least two such values, to make the [1] index valid).
Then, the loop iterates over that two-element list, so it runs twice: the first time, i is equal to the distance value, and the second time it is equal to the title. An error occurs because the title is a string and you try to do math with it.
total_rating = 0
for i in neighbors:
title = neighbors[1]
total_rating += movie_ratings[title]
This loop makes i take on each of the pairs as a value. The title = neighbors[1] is broken; now we ignore the i value completely and instead always use a specific pair, and also we try to use the pair (which is a list) as a title (we need a string).
What you presumably wanted is:
total_rating = 0
for neighbor in neighbors:
title = neighbor[1]
total_rating += movie_ratings[title]
Notice I use a clearer name for the loop variable, to avoid confusion. neighbor is one of the values from the neighbors list, i.e., one of the distance-title pairs. From that, we can get the title, and then from the ratings data and the title, we can get the rating.
I can make it clearer, by using argument unpacking:
total_rating = 0
for neighbor in neighbors:
distance, title = neighbor
total_rating += movie_ratings[title]
Instead of having to understand the reason for a [1] index, now we label each part of the neighbor value, and then use the one that's relevant for our purpose.
I can make it simpler, by doing the unpacking right away:
total_rating = 0
for distance, title in neighbors:
total_rating += movie_ratings[title]
I can make it more elegant, by not trying to explain to Python how to do sums, and just telling it what to sum:
total_rating = sum(movie_ratings[title] for distance, title in neighbors)
This uses a generator expression along with the built-in sum function, which does exactly what it sounds like.
distances is generated in the form:
[
[0.08565491616637051, 'Spectre'],
[0.1946446017955758, "Pirates of the Caribbean: At World's End"],
[0.20733104650812437, 'Avatar']
]
which is what neighbors is derived from, and the names are in position 1 of each list.
neighbors[1] would just retrieve [0.1946446017955758, "Pirates of the Caribbean: At World's End"], or a single element, which doesn't look like is what you want. It would try to use 0.19... and Pirates... as keys in dict movie_ratings.
I'm guessing you want this, to average all the ratings of the closest by the extracted distance values from dataset?:
for dist, name in neighbors:
total_rating += movie_ratings[name]
return total_rating / len(neighbors)
Related
I'm a newbie programmer working on an idea for a small game. I wanted my play space to be a grid for various reasons. Without a lot of good reason, I decided to create a class of GridSquare objects, each object having properties like size, an index to describe what (x,y) coordinates they represented, and some flags to determine if the grid squares were on land or empty space, for example. My grid is a dictionary of these objects, where each GridSquare is a key. The values in the dictionary are going to be various objects in the place space, so that I can easily look up which objects are on each grid square.
Just describing this I feel like a complete lunatic. Please bear in mind that I've only been at this a week.
My problem appears when I try to change the GridSquare objects. For example, I want to use a list to generate the land on each level. So I iterate over the list, and for each value I look through my grid squares using a for loop until I find one with the right index, and flip the GridSquare.land property. But I found that this caused a runtime error, since I was changing keys in a dictionary I was looping through. OK.
Now what I'm trying to do is to create a list of the keys I want to change. For each item in my level-generating list, I go through all the GridSquares in my grid dictionary until I find the one with the index I'm looking for, then I append that GridSquare to a list of old GridSquares that need updating. I then make another copy of the GridSquare, with some properties changed, in a list of altered GridSquares. Finally, I delete any keys from my grid dictionary which match my list of "old" GridSquares, and then add all of the altered ones into my grid dictionary.
The problem is that when I delete keys from my grid dictionary which match my list of "old" keys, I run into keyerrors. I can't understand what is happening to my keys before I can delete them. Using try/except, I can see that it's only a small number of the keys, which seems to vary kind of arbitrarily when I change parts of my code.
I would appreciate any insight into this behaviour.
Here is code for anyone still reading:
aspect_ratio = (4, 3)
screen_size = (1280, 720)
#defining a class of objects called GridSquares
class GridSquare:
def __init__(self, x, y):
self.index = (x, y)
self.land = 0
#creates a dictionary of grid squares which I hope will function as a grid......
grid = {}
for x_index in range(1, (aspect_ratio[0] + 1)):
for y_index in range (1, (aspect_ratio[1] + 1)):
new_square = GridSquare(x_index, y_index)
grid[new_square] = None
#these are lists to hold changes I need to make to the dictionary of grid squares
grid_changes = []
old_gridsquares = []
#this unweildly list is meant to be used to generate a level. Numbers represent land, spaces are empty space.
for number_no, number in enumerate(["1", "1", "1", "1",
" ", " ", " ", " ",
"1", "1", "1", "1"]):
#makes grid squares land if they are designated as such in the list
for gridsquare in grid.keys():
#this if statement is meant to convert each letter's position in the list into an index like the grid squares have.
if gridsquare.index == ((number_no + 1) % (aspect_ratio[0]), ((number_no + 1) // (aspect_ratio[0] + 1)) + 1):
#create a list of squares that need to be updated, and a list of squares to be deleted
old_gridsquares.append(gridsquare)
flagged_gridsquare = GridSquare((number_no + 1) % (aspect_ratio[0]), ((number_no + 1) // (aspect_ratio[0] + 1)) + 1)
flagged_gridsquare.land = 1
#this part is meant to set the flag for the gridsquare that indicates if it is on the far side or the near side,
#if it is land
if number == "1":
flagged_gridsquare.near = 1
grid_changes.append(flagged_gridsquare)
#deletes from grid any items with a key that matches the old squares, and adds updated versions.
for old_gridsquare in old_gridsquares:
try:
del grid[old_gridsquare]
except:
print(old_gridsquare.index)
print(old_gridsquare.land)
for grid_change in grid_changes:
grid[grid_change] = None
Inspired by some projects, I have decided to work on a calculator project based on Python.
Essentially, I have 5 teams in a fantasy league, with points assigned to these teams based on their current standings. Teams A-E.
Assuming the league has 10 more matches to be played, my main aim is to calculate the probability that a team makes it to the top 3 in their league given the matches have a 33.3% of either going:
A win to the team (which adds 2 points to the winning team)
A lose to the team (which awards 0 points to the losing team)
A draw (which awards 1 point to both teams)
This also in turn means there will be 3^10 outcomes for the 10 matches to be played.
For each of these 3^10 scenarios, I will also compute how the final points table will look, and from there, I will be able to sort and figure out which are the top 3 teams in the fantasy league.
I've worked halfway through the project, as shown:
Points = { "A":12, "B":14, "C":8, "D":12, "E":6} #The current standings
RemainingMatches = [
A:B
B:D
C:E
A:E
D:C
B:D
C:D
A:E
C:E
D:C
]
n=len(RemainingMatches) # Number of matches left
combinations = pow(3,n) # Number of possible scenarios left assumes each game has 3 outcomes
print( "Number of remaining matches = %d" % n )
print( "Number of possible scenarios = %d" % combinations )
for i in range(0,combinations)
...
for i in range(0,n)
I am currently wondering how do I get these possible combinations to match a certain scenario? For example, when i = 0, it points to the first matchup where A wins, B losses. Hence, Points[A] = Points[A] + 2 etc. I know there will be a nested loop since I have to consider the other matches too. But how do I exactly map each scenario, and then nest it?
Apologies if I am being unclear here, struggling with this part at the moment.
Thinking Process:
3 outcomes per game.
for i to combinations:
#When i =1, means A win/B lost?
#When i =2, means B win/A lost?
#When i =3, means A/B drew?
for i to n:
#Go to next match?
Not exactly sure what is the logic flow for this kind of scenario. Thank you.
Here is a different way to write the code. If you knew in advance the outcome of each of the ten remaining games, you could compute which teams would finish in top three. This is done with the play_out function, which takes current standings, remaining games, and the known outcomes of future games.
Now all that remains is to loop over all possible future outcomes and tally the winning probabilities. This is done in the simulate_league function that takes in current standings and remaining games, and returns a probability that a given team finishes in top 3.
There may be situations where two teams are tied for the third place. In cases like this, the code below allows for four teams or more to be in "top 3". To use a different rule, you can change the top3scores = nlargest(3, set(pts.values())) line.
In terms of useful Python functions, itertools.product is used to generate all possible win/draw/loss outcomes, and heapq.nlargest is used to find the largest 3 scores out of a bunch. The collections.Counter class is used to count the number of possibilities in which a given team finishes in top 3.
from itertools import product
from heapq import nlargest
from collections import Counter
Points = {"A":12, "B":14, "C":8, "D":12, "E":6} # The current standings
RemainingMatches = ["A:B","B:D","C:E","A:E","D:C","B:D","C:D","A:E","C:E","D:C"]
# reformat remaining matches
RemainingMatches = [tuple(s.split(":")) for s in RemainingMatches]
def play_out(points, remaining_matches, winloss_schedule):
pts = points.copy()
# simulate remaining games given predetermine win/loss/draw outcomes
# given by winloss_schedule
for (team_a, team_b), outcome in zip(remaining_matches, winloss_schedule):
if outcome == -1:
pts[team_a] += 2
elif outcome == 0:
pts[team_a] += 1
pts[team_b] += 1
else:
pts[team_b] += 2
# top 3 scores (allows for ties)
top3scores = nlargest(3, set(pts.values()))
return {team: score in top3scores for team, score in pts.items()}
def simulate_league(points, remaining_matches):
top3counter = Counter()
for winloss_schedule in product([-1, 0, 1], repeat=len(remaining_matches)):
top3counter += play_out(points, remaining_matches, winloss_schedule)
total_possibilities = 3 ** len(remaining_matches)
return {team: top3count / total_possibilities
for team, top3count in top3counter.items()}
# simulate_league(Points, RemainingMatches)
# {'A': 0.9293637487510372,
# 'B': 0.9962573455943369,
# 'C': 0.5461057765584515,
# 'D': 0.975088485833799,
# 'E': 0.15439719554945894}
Currently attempting the Travelling Salesman Problem with a simulated annealing solution. All points are stored in a dictionary with the point name as the key and co-ordinates as the value. Having trouble writing a for loop(path_tour function) that goes through every key in a a given path(randomly shuffled dictionary of locations), calculates distances and adds the value to a list to retrun a total length. The current function I have below returns a KeyError, I cant figure out why.
#Calculate distances between points
def point_distance(point_a, point_b):
return math.sqrt((point_a[0] - point_b[0])**2 + (point_a[1] - point_b[1])**2)
def get_distance_matrix(points):
distance_matrix = {}
for location_a in points:
distance_matrix[location_a] = {}
for location_b in points:
distance_matrix[location_a][location_b] = point_distance(
points[location_a], points[location_b])
return distance_matrix
#Calculate length of path
def path_tour(tour):
path_length = 0
distances = get_distance_matrix(tour)
for key, value in tour.items():
path_length += distances[key][key[1:]]
return path_length
how the get_distance_matrix is called
example of a path
error message
As you can see from the error, it was trying to look up the key "tudent Information Desk". I assume the location name was "Student Information Desk", so key[1:] removed the first letter. That's obviously not the correct location to look up.
I guess you want the distance from the current to the next location in the tour. That would be something like
locations = list(tour.keys())
for first, second in zip(locations[:-1], locations[1:]):
path_length += distances[first][second]
I don't get why your tour is a dictionary, though. Shouldn't it be a list to begin with? I know that dictionaries in Python-3 keep their insertion order but this usage seems counter-intuitive.
def email_matcher(emails_file, names_file):
matches = {}
with open(names_file, 'r') as names:
for i in names:
with open(emails_file, 'r') as emails:
first = i[:(i.index(' '))]
pattern2 = i[0]
last = i[::-1].strip()
last = last[0:(last.index(' '))][::-1]
for j in emails:
if re.search(first,j):
matches[i] = j
elif re.search(last,j):
matches[i] = j
else:
matches[i] = 'nothing found'
return matches
pass
This is my code so far, i know it does not work and i get the thing to be no matches found. The goal is to look through all the emails for the best matching email to a name. I have no idea how to make the pattern for regex, i tried looking at the documentation but idk the exact thing to do. What i want to do is check different things in the most accurate order
1 - check if first name last name and middle name are in email
2- check if first name and last name are in email
3 - check if first name last initial
4 - check if first initial last name
5 - check if first name
6 - check if last name
Would it be multiple searches throughout the email with like 6 different regex searches, or is there a way to do one search on every email and see if it hits any of the groups in the pattern
Right now in my code I just have a first name and last name searching that gets none right at all.
Adding emails
Mary Williams - mary.williams#gmail.com
Charles Deanna West - charles.west#yahoo.com
Jacob Jessica Andrews - jandrews#hotmail.com
Javier Daisy Sparks - javier.sparks#gmail.com
Paula A. Graham - graham#gmail.com ( could not find the best matching one, none had paula. there are multiple paulas and grahams in the names list as well)
Jasmine Sherman - jherman#hotmail.com
Matthew Foster - matthew.foster#gmail.com
Ernest Michael Bowman - ernest.bowman#gmail.com
Chad Hernandez - hernandez#gmail.com
So i just looked through all of these and it seems the pattern is firstinitiallastname, firstname.lastname, or lastname#email. The thing is though there are a shit ton of names and even more emails so I dont know the general case. But I feel like it would suffice if i looked for firstname.lastname#email, then firstinitiallastname#email,then middleinitallastname#email, and then the worst case would just be lastname#email?
Here's a way that you can do it without regex but by using a fuzzy matching system called Levenshtein.
First, separate the email from the domain so that the #something.com is in a different column.
Next, it sounds like you're describing an algorithm for fuzzy matching called Levenshtein distance. You can use a module designed for this, or perhaps write a custom one:
import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
Now you can get a numerical value for how similar they are. You'll still need to establish a cutoff as to what number is acceptable to you.
Str1 = "Apple Inc."
Str2 = "apple Inc"
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)
There are other similarity algorithms other than Levenshtein. You might try Jaro-Winkler, or perhaps Trigram.
I got this code from: https://www.datacamp.com/community/tutorials/fuzzy-string-python
Ok I figured out that the pattern works for everything
I have an xml file like the following:
<edge from="0/0" to="0/1" speed="10"/>
<edge from="0/0" to="1/0" speed="10"/>
<edge from="0/1" to="0/0" speed="10"/>
<edge from="0/1" to="0/2" speed="10"/>
...
Note, that there exist pairs of from-to and vice versa. (In the example above only the pair ("0/0","0/1") and ("0/1","0/0") is visible, however there is a partner for every entry.) Also, note that those pairs are not ordered.
The file describes edges within a SUMO network simulation. I want to assign new speeds randomly to the different streets. However, every <edge> entry only describes one direction(lane) of a street. Hence, I need to find its "partner".
The following code distributes the speed values lane-wise only:
import xml.dom.minidom as dom
import random
edgexml = dom.parse("plain.edg.xml")
MAX_SPEED_OPTIONS = ["8","9","10"]
for edge in edgexml.getElementsByTagName("edge"):
x = random.randint(0,2)
edge.setAttribute("speed", MAX_SPEED_OPTIONS[x])
Is there a simple (pythonic) way to maybe gather those pairs in tuples and then assign the same value to both?
If you know a better way to solve my problem using SUMO tools, I'd be happy too. However I'm still interested in how I can solve the given abstract list problem in python as it is not just a simple zip like in related questions.
Well, you can walk the list of edges and nest another iteration over all edges to search for possible partners. Since this is of quadratic complexity, we can even reduce calculation time by only walking over not yet visited edges in the nested run.
Solution
(for a detailed description, scroll down)
import xml.dom.minidom as dom
import random
edgexml = dom.parse('sampledata/tmp.xml')
MSO = [8, 9, 10]
edge_groups = []
passed = []
for idx, edge in enumerate(edgexml.getElementsByTagName('edge')):
if edge in passed:
continue
partners = []
for partner in edgexml.getElementsByTagName('edge')[idx:]:
if partner.getAttribute('from') == edge.getAttribute('to') \
and partner.getAttribute('to') == edge.getAttribute('from'):
partners.append(partner)
edge_groups.append([edge] + partners)
passed.extend([edge] + partners)
for e in edge_groups:
print('NEW EDGE GROUP')
x = random.choice(MSO)
for p in e:
p.setAttribute('speed', x)
print(' E from "%s" to "%s" at "%s"' % (p.getAttribute('from'), p.getAttribute('to'), x))
Yields the output:
NEW EDGE GROUP
E from "0/0" to "0/1" at "8"
E from "0/1" to "0/0" at "8"
NEW EDGE GROUP
E from "0/0" to "1/0" at "10"
NEW EDGE GROUP
E from "0/1" to "0/2" at "9"
Detailed description
edge_groups = []
passed = []
Initialize the result structure edge_groups, which will be a list of lists holding partnered edges in groups. The additional list passed will help us to avoid redundant edges in our result.
for idx, edge in enumerate(edgexml.getElementsByTagName('edge')):
Start iterating over the list of all edges. I use enumerate here to obtain the index at the same time, because our nested iteration will only iterate over a sub-list starting at the current index to reduce complexity.
if edge in passed:
continue
Stop, if we have visited this edge at any point in time before. This does only happen if the edge has been recognized as a partner of another list before (due to index-based sublisting). If it has been taken as the partner of another list, we can omit it with no doubt.
partners = []
for partner in edgexml.getElementsByTagName('edge')[idx:]:
if partner.getAttribute('from') == edge.getAttribute('to') \
and partner.getAttribute('to') == edge.getAttribute('from'):
partners.append(partner)
Initialize helper list to store identified partner edges. Then, walk through all edges in the remaining list starting from the current index. I.e. do not iterate over edges that have already been passed in the outer iteration. If the potential partner is an actual partner (from/to matches), then append it to our partners list.
edge_groups.append([edge] + partners)
passed.extend([edge] + partners)
The nested iteration has passed and partners holds all identified partners for the current edge. Push them into one list and append it to the result variable edge_groups. Since it is unneccessarily complex to check against the 2-level list edge_groups to see whether we have already traversed an edge in the next run, we will additionally keep a list of already used nodes and call it passed.
for e in edge_groups:
print('NEW EDGE GROUP')
x = random.choice(MSO)
for p in e:
p.setAttribute('speed', x)
print(' E from "%s" to "%s" at "%s"' % (p.getAttribute('from'), p.getAttribute('to'), x))
Finally, we walk over all groups of edges in our result edge_groups, randomly draw a speed from MSO (hint: use random.choice() to randomly choose from a list), and assign it to all edges in this group.