Match a name to the best fitting email using regex - python

def email_matcher(emails_file, names_file):
matches = {}
with open(names_file, 'r') as names:
for i in names:
with open(emails_file, 'r') as emails:
first = i[:(i.index(' '))]
pattern2 = i[0]
last = i[::-1].strip()
last = last[0:(last.index(' '))][::-1]
for j in emails:
if re.search(first,j):
matches[i] = j
elif re.search(last,j):
matches[i] = j
else:
matches[i] = 'nothing found'
return matches
pass
This is my code so far, i know it does not work and i get the thing to be no matches found. The goal is to look through all the emails for the best matching email to a name. I have no idea how to make the pattern for regex, i tried looking at the documentation but idk the exact thing to do. What i want to do is check different things in the most accurate order
1 - check if first name last name and middle name are in email
2- check if first name and last name are in email
3 - check if first name last initial
4 - check if first initial last name
5 - check if first name
6 - check if last name
Would it be multiple searches throughout the email with like 6 different regex searches, or is there a way to do one search on every email and see if it hits any of the groups in the pattern
Right now in my code I just have a first name and last name searching that gets none right at all.
Adding emails
Mary Williams - mary.williams#gmail.com
Charles Deanna West - charles.west#yahoo.com
Jacob Jessica Andrews - jandrews#hotmail.com
Javier Daisy Sparks - javier.sparks#gmail.com
Paula A. Graham - graham#gmail.com ( could not find the best matching one, none had paula. there are multiple paulas and grahams in the names list as well)
Jasmine Sherman - jherman#hotmail.com
Matthew Foster - matthew.foster#gmail.com
Ernest Michael Bowman - ernest.bowman#gmail.com
Chad Hernandez - hernandez#gmail.com
So i just looked through all of these and it seems the pattern is firstinitiallastname, firstname.lastname, or lastname#email. The thing is though there are a shit ton of names and even more emails so I dont know the general case. But I feel like it would suffice if i looked for firstname.lastname#email, then firstinitiallastname#email,then middleinitallastname#email, and then the worst case would just be lastname#email?

Here's a way that you can do it without regex but by using a fuzzy matching system called Levenshtein.
First, separate the email from the domain so that the #something.com is in a different column.
Next, it sounds like you're describing an algorithm for fuzzy matching called Levenshtein distance. You can use a module designed for this, or perhaps write a custom one:
import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
Now you can get a numerical value for how similar they are. You'll still need to establish a cutoff as to what number is acceptable to you.
Str1 = "Apple Inc."
Str2 = "apple Inc"
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)
There are other similarity algorithms other than Levenshtein. You might try Jaro-Winkler, or perhaps Trigram.
I got this code from: https://www.datacamp.com/community/tutorials/fuzzy-string-python

Ok I figured out that the pattern works for everything

Related

Google Kickstart 2014 Round D Sort a scrambled itinerary - Do I need to bring the input in a ready-to-use array format?

Problem:
Once upon a day, Mary bought a one-way ticket from somewhere to somewhere with some flight transfers.
For example: SFO->DFW DFW->JFK JFK->MIA MIA->ORD.
Obviously, transfer flights at a city twice or more doesn't make any sense. So Mary will not do that.
Unfortunately, after she received the tickets, she messed up the tickets and she forgot the order of the ticket.
Help Mary rearrange the tickets to make the tickets in correct order.
Input:
The first line contains the number of test cases T, after which T cases follow.
For each case, it starts with an integer N. There are N flight tickets follow.
Each of the next 2 lines contains the source and destination of a flight ticket.
Output:
For each test case, output one line containing "Case #x: itinerary", where x is the test case number (starting from 1) and the itinerary is a sorted list of flight tickets that represent the actual itinerary.
Each flight segment in the itinerary should be outputted as pair of source-destination airport codes.
Sample Input: Sample Output:
2 Case #1: SFO-DFW
1 Case #2: SFO-DFW DFW-JFK JFK-MIA MIA-ORD
SFO
DFW
4
MIA
ORD
DFW
JFK
SFO
DFW
JFK
MIA
My question:
I am a beginner in the field of competitive programming. My question is how to interpret the given input in this case. How did Googlers program this input? When I write a function with a Python array as its argument, will this argument be in a ready-to-use array format or will I need to deal with the above mentioned T and N numbers in the input and then arrange airport strings in an array format to make it ready to be passed in the function's argument?
I have looked up at the following Google Kickstart's official Python solution to this problem and was confused how they simply pass the ticket_list argument in the function. Don't they need to clear the input from the numbers T and N and then arrange the airport strings into an array, as I have explained above?
Also, I could not understand how could the methods first and second simply appear if no Class has been initialized? But I think this should be another question...
def print_itinerary(ticket_list):
arrival_map = {}
destination_map = {}
for ticket in ticket_list:
arrival_map[ticket.second] += 1
destination_map[ticket.first] += 1
current = FindStart(arrival_map)
while current in destination_map:
next = destination_map[current]
print current + "-" + next
current = next
You need to implement it yourself to read data from standard input and write results to standard output.
Sample code for reading from standard input and writing to standard output can be found in the coding section of the FAQ on the KickStart Web site.
If you write the solution to this problem in python, you can get T and N as follows.
T = int(input())
for t in range(1, T + 1):
N = int(input())
...
Then if you want to get the source and destination of the flight ticket as a list, you can use the same input method to get them in the list.
ticket_list = [[input(), input()] for _ in range(N)]
# [['MIA', 'ORD'], ['DFW', 'JFK'], ['SFO', 'DFW'], ['JFK', 'MIA']]
If you want to use first and second, try a namedtuple.
Pair = namedtuple('Pair', ['first', 'second'])
ticket_list = [Pair(input(), input()) for _ in range(N)]

How can I exhaustively scrape the contents of a web directory?

TLDR:
How do I efficiently determine a sequence of queries that give back every result from a directory of names, given that responses to each query are limited to a small fraction of the number of entries in the whole directory?
Goal
I have been tasked with scraping all information from a number of university directories. These directories consist of faculty and staff members and each person has information that I am interested in collecting (name, email address, title, department, etc.). For most directories, my goal is to get URLs which corresponds to each member of the directory, so that information about that person can be gathered individually. Thus, for my purposes, getting a list of every name in the directory is sufficient.
For some of these directories, I am required to make a search query that then returns some results (others display all results at once). Usually, I am given the option to search by one (or several) fields, including first name, last name, and department, among others. Unfortunately, queries often have a maximum results limit which prevents me from simply searching A, B, C, etc.
How are queries interpreted?
Across the board, all queries are case-insensitive. Note that directories variously interpret search queries. I have seen three interpretations:
Assume the following toy directory: ["Abby", "Abraham", "Alb", "Babbage"]
1. Implicit following wildcard: results that start with the query are returned
In this case, searching "ab" would return "Abby" and "Abraham" but not "Babbage".
2. Implicit double wildcard: results that contain the query are returned
Here, searching "ab" would return "Abby", "Abraham", and "Babbage".
3. Fuzzy matching: results that contain or are close to the query are returned
In this case, searching "ab" would return all four names.
Algorithm
From these interpretations, I designed an algorithm which assumes its queries are treated with implicit following wildcards. I chose this interpretation because, when the same queries are interpreted as having an implicit double wildcard or with fuzzy matching, the results will be a superset of the expected interpretation. Thus, the same algorithm could be applied to all situations.
Potential caveat: with the double wildcard or fuzzy matching interpretations, there will be many more results for each query, requiring many more queries to cover the whole directory at the same number of maximum results.
Algorithm specification
The algorithm proceeds as follows:
1. Set the last name query to be a.
2. Make the last name query.
2a. If not over the results limit, save the results to the results set and increment the last character of the last name query (e.g. go from a to b or apple to applf) and return to step 2 (note that this could induce a carrying process such as from azzz to b). If this increment overflows, the search is complete and you can should skip to step 4. If over the results limit, continue to step 2b.
2b. Set the first name query to be a.
2c. Query the first name and the last name at the same time.
2d. If not over the results limit, save the results to the results set, increment the last character of the first name query, and return to step 2c. If this increment overflows (e.g. if a first name query of z is under the results limit), set the first name query to be blank and continue to step 3. If over the results limit, add an a to the end of the first name query and return to step 2c.
3. Add an a to the end of the last name query and return to step 2.
4. Return the set of results.
Python implentation
Here is the above algorithm, implemented in Python pseudocode. I included helper functions such as make_query(), increment(), and append_a().
import string
alphabet = string.ascii_lowercase
names = get_random_names(n=10000)
results_limit = 25
def make_query(first="", last=""):
print("Querying for the following:")
print("first:", first)
print("last:", last)
results = set()
for n in names:
f, l = n
if f.lower().startswith(first) and l.lower().startswith(last):
results.add(n)
if len(results) > results_limit:
print("Too many results")
print()
return set(), True
else:
print("Success! This gave " + str(len(results)) + " results")
print()
return results, False
def increment(q):
ql = [chars.index(c) for c in q]
while ql[-1] == len(chars) - 1:
del ql[-1]
if len(ql) == 0:
return ql, True
ql[-1] += 1
return "".join([chars[i] for i in ql]), False
def append_a(q):
ql = [chars.index(c) for c in q]
ql.append(0)
return "".join([chars[i] for i in ql])
def search_directory(field="last", fixed_last=None):
all_results = set()
query = "a"
num_queries = 0
while True:
if field == "last":
query_results, over_limit = make_query(last=query)
num_queries += 1
elif field == "first":
query_results, over_limit = make_query(first=query, last=fixed_last)
num_queries += 1
if not over_limit:
all_results = all_results.union(query_results)
query, is_finished = increment(query)
if is_finished:
return all_results, num_queries
continue
elif over_limit and field == "last":
first_name_results, first_num_queries = search_directory(field="first", fixed_last=query)
num_queries += first_num_queries
all_results = all_results.union(first_name_results)
query = append_a(query)
results, num_queries = search_directory()
print(results)
print("Number of results:", len(results))
print("Number of entries in directory:", len(set(names)))
print("Accuracy:", str(len(results)/len(set(names))))
print("Number of queries:", num_queries)
print("Missed names:")
print(set(names) - set(results))
Search example
To help people understand this algorithm better, I am providing an example sequence of queries and responses. For brevity, assume the directory consists of the following names only (in (first, last) format):
[("aa", "bac"), ("aa", "bba"), ("aa", "aaa"), ("ab", "bc"), ("b", "bab"), ("ccc", "a")]
Additionally, the results limit will be two (searches can have at most two results). Finally, assume that our queries will be interpreted as having wildcards following them. Here are the queries the algorithm performs:
#
Last name query
First name query
Response
1
a
-
Under results limit
2
b
-
Over results limit
3
b
a
Over results limit
4
b
aa
Under results limit
5
b
ab
Under results limit
6
b
ac
Under results limit
7
b
b
Under results limit
8
b
c
Under results limit
9
ba
-
Under results limit
10
bb
-
Under results limit
11
bc
-
Under results limit
12
c
-
Under results limit
And the results:
{('aa', 'bba'), ('b', 'bab'), ('aa', 'bac'), ('aa', 'aaa'), ('ab', 'bc'), ('ccc', 'a')}
Issues
With these 12 queries, I was able to get every name in the database. However, I have concerns with the efficiency of the algorithm. I tested it on random subsets of names from the 2015 Facebook leak, and I was able to achieve 100% completeness or greater on databases of thousands of names. From what I can tell, it took as many as 18,000 queries to retrieve a database of 9,000 names and 240,000 queries to retrieve a database of 90,000 names.
This is not a desirable level of performance, given that many of the directories I need to run the algorithm on have on the order of 10,000 entries, and each query could take as much as a second or two. More problematically, when adapting this algorithm to use the double wildcard interpretation, it takes as many as 280,000 queries to recover an 8,000 entry database, which is clearly too many.
Is there a more efficient way for me to achieve full coverage, both in the case of the following wildcard and the double wildcard interpretation?
Problem restatement
How do I efficiently determine a sequence of queries that give back every result from a directory of names, given that responses to each query are limited to a small fraction of the number of entries in the whole directory?

How can I find the maximum number of cities that I can visit given a travel budget (in minutes) using a travel time matrix

I have a list of 12 cities connected to each other without exception. The only thing of concern is travel time. The name of each city is here.  The distance matrix (representing travel time in minutes) between city pairs is here. 
How can I find out how many cities I can visited given a certain travel budget (say 800 minutes) from a city of origin (it can be any of the 12).
You can't visit the same city twice during the trip and you don't need to worry about returning to your origin. I can't go above my travel budget.
import numpy as np
from scipy.spatial import distance_matrix
from sklearn.cluster import AgglomerativeClustering
def find_cities(dist, budget): # dist: a 12x12 matrix of travel time in minutes between city pairs; budget: max travel time allowed for the trip (in mins)
assert len(dist) == 12 # there are 12 cities to visit and each one has a pairwise cost with all other 11 citis
clusters = [] # list of cluster labels from 1..n where n is no of cities to be visited
dists = [0] + [row[1:] for row in dist] # start-to-start costs have been excluded from the distances array which only contains intercity distances
linkage = 'complete' # complete linkage used here because we want an optimal solution i.e., finding minimum number of clusters required
ac = AgglomerativeClustering(affinity='precomputed', linkage=linkage, compute_full_tree=True) # affinity must be precomputed or function otherwise it will use euclidean distance by default !!!
# compute full tree ensures that I get all possible clustesr even if they don't make up entire population! This is needed so that I can determine how many clusters need to be created given my budget constraints below
Z = ac.fit_predict(dists).tolist() # AgglomerativeClustering.fit_predict returns list of cluster labels for each city
while budget >= min(dists): # while my budget is greater than the minimum intercity travel cost, i.e., I can still visit another city
if len(set(Z)) > 1: # at least 2 clusters are needed to form a valid tour so continue only when there are more than one cluster left in Z
c1 = np.argmax([sum([i==j for j in Z]) for i in set(Z)]) # find which clustes has max no of cities associated with it and that will be the next destination (cities within this cluster have same label as their parent cluster!) # numpy argmax returns index of maximum value along an axis; here we want to know which group has most elements!
c2 = [j for j,val in enumerate(Z) if val == Z[c1]][0] # retrieve first element from the group whose parent is 'cluster' returned by previous line
clusters += [c2 + 1] ## add new destination found into our trip plan/list "clusters" after converting its city id back into integer starting from 1 instead of 0 like array indices do!!
dists += [dist[c1][c2]] ## update total distance travelled so far based on newly added destination ... note: distances between two adjacent cities always equals zero because they fall under same cluster
budget -= dists[-1] ## update travel budget by subtracting the cost of newly added destination from our total budget
else: break # when there is only one city left in Z, then stop! it's either a single city or two cities are connected which means cost between them will always be zero!
return clusters # this is a list of integers where each integer represents the id of city that needs to be visited next!
def main():
with open('uk12_dist.txt','r') as f: ## read travel time matrix between cities from file ## note: 'with' keyword ensures file will be closed automatically after reading or writing operation done within its block!!!
dist = [[int(num) for num in line.split()] for line in f] ## convert text data into array/list of lists using list comprehension; also ensure all data are converted into int before use!
with open('uk12_name.txt','r') as f: ## read names of 12 cities from file ## note: 'with' keyword ensures file will be closed automatically after reading or writing operation done within its block!!!
name = [line[:-1].lower().replace(" ","") for line in f] ## remove newline character and any leading/trailing spaces, then convert all characters to lowercase; also make sure there's no space between first and last name (which results in empty string!) otherwise won't match later when searching distances!!
budget = 800 # max travel budget allowed (in mins) i.e., 8 hrs travelling at 60 mins per km which means I can cover about 800 kms on a full tank!
print(find_cities(dist,budget), "\n") ## print(out list of city ids to visit next!
print("Total distance travelled: ", sum(dist[i][j] for i, j in enumerate([0]+find_cities(dist,budget))), "\n" ) # calculate total cost/distance travelled so far by adding up all distances between cities visited so far - note index '0' has been added at start because 0-2 is same as 2-0 and it's not included in find_cities() output above !
while True:
try: ## this ensures that correct input from user will be obtained only when required!!
budget = int(raw_input("\nEnter your travel budget (in minutes): ")) # get new travel budget from user and convert into integer before use!!!
if budget <= 800: break ## stop asking for valid input only when the value entered by user isn't greater than 800 mins or 8 hrs !!
except ValueError: ## catch exception raised due to invalid data type; continue asking until a valid number is given by user!!
pass
print(name[find_cities(dist,budget)[1]],"->",name[find_cities(dist,budget)[2]],"-> ...",name[find_cities(dist,budget)[-1]] )## print out the city names of cities to visit next!
return None
if __name__ == '__main__': main()

CS50 'DNA': Ways to speed up my Week 6 'dna.py' program?

So for this problem I had to create a program that takes in two arguments. A CSV database like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
And a DNA sequence like this:
TAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
My program works by first getting the "Short Tandem Repeat" (STR) headers from the database (AGATC, etc.), then counting the highest number of times each STR repeats consecutively within the sequence. Finally, it compares these counted values to the values of each row in the database, printing out a name if a match is found, or "No match" otherwise.
The program works for sure, but is ridiculously slow whenever ran using the larger database provided, to the point where the terminal pauses for an entire minute before returning any output. And unfortunately this is causing the 'check50' marking system to time-out and return a negative result upon testing with this large database.
I'm presuming the slowdown is caused by the nested loops within the 'STR_count' function:
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
And the 'DNA_match' function:
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
However, I'm new to Python, and haven't really had to consider speed before, so I'm not sure how to improve upon this.
I'm not particularly looking for people to do my work for me, so I'm happy for any suggestions to be as vague as possible. And honestly, I'll value any feedback, including stylistic advice, as I can only imagine how disgusting this code looks to those more experienced.
Here's a link to the full program, if helpful.
Thanks :) x
Thanks for providing a link to the entire program. It seems needlessly complex, but I'd say it's just a lack of knowing what features are available to you. I think you've already identified the part of your code that's causing the slowness - I haven't profiled it or anything, but my first impulse would also be the three nested loops in STR_count.
Here's how I would write it, taking advantage of the Python standard library. Every entry in the database corresponds to one person, so that's what I'm calling them. people is a list of dictionaries, where each dictionary represents one line in the database. We get this for free by using csv.DictReader.
To find the matches in the sequence, for every short tandem repeat in the database, we create a regex pattern (the current short tandem repeat, repeated one or more times). If there is a match in the sequence, the total number of repetitions is equal to the length of the match divided by the length of the current tandem repeat. For example, if AGATCAGATCAGATC is present in the sequence, and the current tandem repeat is AGATC, then the number of repetitions will be len("AGATCAGATCAGATC") // len("AGATC") which is 15 // 5, which is 3.
count is just a dictionary that maps short tandem repeats to their corresponding number of repetitions in the sequence. Finally, we search for a person whose short tandem repeat counts match those of count exactly, and print their name. If no such person exists, we print "No match".
def main():
import argparse
from csv import DictReader
import re
parser = argparse.ArgumentParser()
parser.add_argument("database_filename")
parser.add_argument("sequence_filename")
args = parser.parse_args()
with open(args.database_filename, "r") as file:
reader = DictReader(file)
short_tandem_repeats = reader.fieldnames[1:]
people = list(reader)
with open(args.sequence_filename, "r") as file:
sequence = file.read().strip()
count = dict(zip(short_tandem_repeats, [0] * len(short_tandem_repeats)))
for short_tandem_repeat in short_tandem_repeats:
pattern = f"({short_tandem_repeat}){{1,}}"
match = re.search(pattern, sequence)
if match is None:
continue
count[short_tandem_repeat] = len(match.group()) // len(short_tandem_repeat)
try:
person = next(person for person in people if all(int(person[k]) == count[k] for k in short_tandem_repeats))
print(person["name"])
except StopIteration:
print("No match")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

Python: Why do I not need 2 variables when unpacking a dictionary?

movie_dataset = {'Avatar': [0.01940156245995175, 0.4812286689419795, 0.9213483146067416], "Pirates of the Caribbean: At World's End": [0.02455894456664483, 0.45051194539249145, 0.898876404494382], 'Spectre': [0.02005646812429373, 0.378839590443686, 0.9887640449438202], ... }
movie_ratings = {'Avatar': 7.9, "Pirates of the Caribbean: At World's End": 7.1, 'Spectre': 6.8, ...}
def distance(movie1, movie2):
squared_difference = 0
for i in range(len(movie1)):
squared_difference += (movie1[i] - movie2[i]) ** 2
final_distance = squared_difference ** 0.5
return final_distance
def predict(unknown, dataset, movie_ratings, k):
distances = []
#Looping through all points in the dataset
for title in dataset:
movie = dataset[title]
distance_to_point = distance(movie, unknown)
#Adding the distance and point associated with that distance
distances.append([distance_to_point, title])
distances.sort()
#Taking only the k closest points
neighbors = distances[0:k]
total_rating = 0
for i in neighbors[1]:
total_rating += movie_ratings[i] <----- Why is this an error?
return total_rating / len(neighbors) <----- Why can I not divide by total rating
#total_rating = 0
#for i in neighbors:
# title = neighbors[1]
#total_rating += movie_ratings[title] <----- Why is this not an error?
#return total_rating / len(neighbors)
print(movie_dataset["Life of Pi"])
print(movie_ratings["Life of Pi"])
print(predict([0.016, 0.300, 1.022], movie_dataset, movie_ratings, 5))
Two questions here. First, why is this an error?
for i in neighbors[1]:
total_rating += movie_ratings[i]
It seems to be the same as
for i in neighbors:
title = neighbors[1]
total_rating += movie_ratings[title]
Second, why can I not divide by len(total_rating)?
Second question first, because it's more straightforward:
Second, why can I not divide by len(total_rating)?
You're trying to compute an average, right? So you want the sum of the ratings divided by the number of ratings?
Okay. So, you're trying to figure out how many ratings there are. What's the rule that tells you that? It seems like you're expecting to count up the ratings from where they are stored. Where are they stored? It is not total_rating; that's where you stored the numerical sum. Where did the ratings come from? They came from looking up the names of movies in the movie_ratings. So the ratings were not actually stored at all; there is nothing to measure the len of. Right? Well, not quite. What is the rule that determines the ratings we are adding up? We are looking them up in the movie_ratings by title. So how many of them are there? As many as there are titles. Where were the titles stored? They were paired up with distances in the neighbors. So there are as many titles as there are neighbors (whatever "neighbor" is supposed to mean here; I don't really understand why you called it that). So that is what you want the len() of.
Onward to fixing the summation.
total_rating = 0
for i in neighbors[1]:
total_rating += movie_ratings[i]
First, this computes neighbors[1], which will be one of the [distance_to_point, title] pairs that was .appended to the list (assuming there are at least two such values, to make the [1] index valid).
Then, the loop iterates over that two-element list, so it runs twice: the first time, i is equal to the distance value, and the second time it is equal to the title. An error occurs because the title is a string and you try to do math with it.
total_rating = 0
for i in neighbors:
title = neighbors[1]
total_rating += movie_ratings[title]
This loop makes i take on each of the pairs as a value. The title = neighbors[1] is broken; now we ignore the i value completely and instead always use a specific pair, and also we try to use the pair (which is a list) as a title (we need a string).
What you presumably wanted is:
total_rating = 0
for neighbor in neighbors:
title = neighbor[1]
total_rating += movie_ratings[title]
Notice I use a clearer name for the loop variable, to avoid confusion. neighbor is one of the values from the neighbors list, i.e., one of the distance-title pairs. From that, we can get the title, and then from the ratings data and the title, we can get the rating.
I can make it clearer, by using argument unpacking:
total_rating = 0
for neighbor in neighbors:
distance, title = neighbor
total_rating += movie_ratings[title]
Instead of having to understand the reason for a [1] index, now we label each part of the neighbor value, and then use the one that's relevant for our purpose.
I can make it simpler, by doing the unpacking right away:
total_rating = 0
for distance, title in neighbors:
total_rating += movie_ratings[title]
I can make it more elegant, by not trying to explain to Python how to do sums, and just telling it what to sum:
total_rating = sum(movie_ratings[title] for distance, title in neighbors)
This uses a generator expression along with the built-in sum function, which does exactly what it sounds like.
distances is generated in the form:
[
[0.08565491616637051, 'Spectre'],
[0.1946446017955758, "Pirates of the Caribbean: At World's End"],
[0.20733104650812437, 'Avatar']
]
which is what neighbors is derived from, and the names are in position 1 of each list.
neighbors[1] would just retrieve [0.1946446017955758, "Pirates of the Caribbean: At World's End"], or a single element, which doesn't look like is what you want. It would try to use 0.19... and Pirates... as keys in dict movie_ratings.
I'm guessing you want this, to average all the ratings of the closest by the extracted distance values from dataset?:
for dist, name in neighbors:
total_rating += movie_ratings[name]
return total_rating / len(neighbors)

Categories

Resources