Binary search for name - python

I have a list of countries in a separate file (countries.txt), and I need to do a binary search to find a country, and for it to state the information given on it.
My file:
Afghanistan, 647500.0, 25500100
Albania, 28748.0, 2821977
Algeria, 2381740.0, 38700000
American Samoa, 199.0, 55519
Andorra, 468.0, 76246
Angola, 1246700.0, 20609294
If I wanted to find the area and population for Albania, and I put getCountry(Albania) in the shell, how would I get it to state the provided info?
I have this so far...
def getCountry(key):
start = "%s" #index
end = len("%s")-1 #index
while start<=end:
mid = (start + end) / 2
if '%s'[mid] == key: #found it!
return True
elif "%s"[mid] > key:
end = mid -1
else:
start = mid + 1
#end < start
return False

I would use a dictionary:
def get_countries(filename):
with open(filename) as f:
country_iter = (line.strip().split(',') for line in f)
return {
country: {"area": area, "population": population}
for country, area, population in country_iter
}
if __name__ == '__main__':
d = get_countries("countries.csv")
print(d)
If you really have your heart set on a binary search, it looks more like this:
def get_countries(filename):
with open(filename) as f:
return [line.strip().split(',') for line in f]
def get_country_by_name(countries, name):
lo, hi = 0, len(countries) - 1
while lo <= hi:
mid = lo + (hi - lo) // 2
country = countries[mid]
test_name = country[0]
if name > test_name:
lo = mid + 1
elif name < test_name:
hi = mid - 1
else:
return country
return countries[lo] if countries[lo][0] == name else None
if __name__ == '__main__':
a = get_countries("countries.csv")
print(a)
c = get_country_by_name(a, "Albania")
print(c)
But this is coding a binary search off the top of my head. If you don't have to code the binary search and can use a library routine instead, it looks like this:
from bisect import bisect_left
def get_country_by_name(countries, name):
country_names = [country[0] for country in countries]
i = bisect_left(country_names, name)
return countries[i]

Conquer this problem in steps.
Start with a sorted list, and implement a binary search on the list in a function.
Make sure it works for empty lists, lists of one item, etc.
Write a function to take an unsorted list, sort it and return the result on it from the first function.
Write a function that takes a list of tuples with a string as a key and other strings as data. It should sort the data on your key, and return what you want.
Write a function that reads a file and constructs data compatible with 4 and returns the selected item.
Pat yourself on the back for solving your more complex problem in digestible steps.
Note: This clearly is an assignment to learn how to implement an algorithm. If it were truly to find the information from a file, using a dictionary would simply be wrong. Correct would be to read each line until the country was found making a single comparison for on average one half of the entries in the file. No wasted storage, no wasted time comparing, or hashing.

As Ashwini suggested in his comment, you can use the dictionary in python. It will look something like this:
countries = {'Afghanistan': (647500.0, 25500100),
'Albania': (28748.0, 2821977),
'Algeria': (2381740.0, 38700000),
'American Samoa': (199.0, 55519),
'Andorra': (468.0, 76246),
'Angola': (1246700.0, 20609294)}
print countries['Angola'][0]
You can learn more about dictionary and tuple from this python documentation

the other answer is correct you should use a dictionary but since Im guessing this is an assignment the first thing you need is a list
with open("countries.txt") as f:
#filter(none,a_list) will remove all falsey values (empty strings/lists/etc)
#map(some_function,a_list) will apply a function to all elements in a list and return the results as a new list
#in this case the iterable we are handing in as a_list is an open file handle and we are spliting each line on ","
country_list = filter(None,map(lambda x:x.split(","),f))
then you just need to search through your ordered list like any other binary search
in order to do a binary search you do something like (recursive version)
def bin_search(a_sorted_list,target):
mid_pt = len(a_sorted_list) // 2
if target < a_sorted_list[mid_pt]:
return bin_search(a_sorted_list[:mid_pt], target)
elif target > a_sorted_list[mid_pt]:
return bin_search(a_sorted_list[mid_pt:], target)
elif target == a_sorted_list[mid_pt]:
return mid_pt
in your case you will need some minor modification

Related

Somewhere inside my loop it's not appending results to a list. Why?

So I have two files/dictionaries I want to compare, using a binary search implementation (yes, this is very obviously homework).
One file is
american-english
Amazon
Americana
Americanization
Civilization
And the other file is
british-english
Amazon
Americana
Americanisation
Civilisation
The code below should be pretty straight forward. Import files, compare them, return differences. However, somewhere near the bottom, where it says entry == found_difference: I feel as if the debugger skips right over, even though I can see the two variables in memory being different, and I only get the final element returned in the end. Where am I going wrong?
# File importer
def wordfile_to_list(filename):
"""Converts a list of words to a Python list"""
wordlist = []
with open(filename) as f:
for line in f:
wordlist.append(line.rstrip("\n"))
return wordlist
# Binary search algorithm
def binary_search(sorted_list, element):
"""Search for element in list using binary search. Assumes sorted list"""
matches = []
index_start = 0
index_end = len(sorted_list)
while (index_end - index_start) > 0:
index_current = (index_end - index_start) // 2 + index_start
if element == sorted_list[index_current]:
return True
elif element < sorted_list[index_current]:
index_end = index_current
elif element > sorted_list[index_current]:
index_start = index_current + 1
return element
# Check file differences using the binary search algorithm
def wordfile_differences_binarysearch(file_1, file_2):
"""Finds the differences between two plaintext lists,
using binary search algorithm, and returns them in a new list"""
wordlist_1 = wordfile_to_list(file_1)
wordlist_2 = wordfile_to_list(file_2)
matches = []
for entry in wordlist_1:
found_difference = binary_search(sorted_list=wordlist_2, element=entry)
if entry == found_difference:
pass
else:
matches.append(found_difference)
return matches
# Check if it works
differences = wordfile_differences_binarysearch(file_1="british-english", file_2="american-english")
print(differences)
You don't have an else suite for your if statement. Your if statement does nothing (it uses pass when the test is true, skipped otherwise).
You do have an else suite for the for loop:
for entry in wordlist_1:
# ...
else:
matches.append(found_difference)
A for loop can have an else suite as well; it is executed when a loop completes without a break statement. So when your for loop completes, the current value for found_difference is appended; so whatever was assigned last to that name.
Fix your indentation if the else suite was meant to be part of the if test:
for entry in wordlist_1:
found_difference = binary_search(sorted_list=wordlist_2, element=entry)
if entry == found_difference:
pass
else:
matches.append(found_difference)
However, you shouldn't use a pass statement there, just invert the test:
matches = []
for entry in wordlist_1:
found_difference = binary_search(sorted_list=wordlist_2, element=entry)
if entry != found_difference:
matches.append(found_difference)
Note that the variable name matches feels off here; you are appending words that are missing in the other list, not words that match. Perhaps missing is a better variable name here.
Note that your binary_search() function always returns element, the word you searched on. That'll always be equal to the element you passed in, so you can't use that to detect if a word differed! You need to unindent that last return line and return False instead:
def binary_search(sorted_list, element):
"""Search for element in list using binary search. Assumes sorted list"""
matches = []
index_start = 0
index_end = len(sorted_list)
while (index_end - index_start) > 0:
index_current = (index_end - index_start) // 2 + index_start
if element == sorted_list[index_current]:
return True
elif element < sorted_list[index_current]:
index_end = index_current
elif element > sorted_list[index_current]:
index_start = index_current + 1
return False
Now you can use a list comprehension in your wordfile_differences_binarysearch() loop:
[entry for entry in wordlist_1 if not binary_search(wordlist_2, entry)]
Last but not least, you don't have to re-invent the binary seach wheel, just use the bisect module:
from bisect import bisect_left
def binary_search(sorted_list, element):
return sorted_list[bisect(sorted_list, element)] == element
With sets
Binary search is used to improve efficiency of an algorithm, and decrease complexity from O(n) to O(log n).
Since the naive approach would be to check every word in wordlist1 for every word in wordlist2, the complexity would be O(n**2).
Using binary search would help to get O(n * log n), which is already much better.
Using sets, you could get O(n):
american = """Amazon
Americana
Americanization
Civilization"""
british = """Amazon
Americana
Americanisation
Civilisation"""
american = {line.strip() for line in american.split("\n")}
british = {line.strip() for line in british.split("\n")}
You could get the american words not present in the british dictionary:
print(american - british)
# {'Civilization', 'Americanization'}
You could get the british words not present in the american dictionary:
print(british - american)
# {'Civilisation', 'Americanisation'}
You could get the union of the two last sets. I.e. words that are present in exactly one dictionary:
print(american ^ british)
# {'Americanisation', 'Civilisation', 'Americanization', 'Civilization'}
This approach is faster and more concise than any binary search implementation. But if you really want to use it, as usual, you cannot go wrong with #MartijnPieters' answer.
With two iterators
Since you know the two lists are sorted, you could simply iterate in parallel over the two sorted lists and look for any difference:
american = """Amazon
Americana
Americanism
Americanization
Civilization"""
british = """Amazon
Americana
Americanisation
Americanism
Civilisation"""
american = [line.strip() for line in american.split("\n")]
british = [line.strip() for line in british.split("\n")]
n1, n2 = len(american), len(british)
i, j = 0, 0
while True:
try:
w1 = american[i]
w2 = british[j]
if w1 == w2:
i += 1
j += 1
elif w1 < w2:
print('%s is in american dict only' % w1)
i += 1
else:
print('%s is in british dict only' % w2)
j += 1
except IndexError:
break
for w1 in american[i:]:
print('%s is in american dict only' % w1)
for w2 in british[j:]:
print('%s is in british dict only' % w2)
It outputs:
Americanisation is in british dict only
Americanization is in american dict only
Civilisation is in british dict only
Civilization is in american dict only
It's O(n) as well.

Sorting from smallest to biggest

I would like to sort several points from smallest to biggest however.
I will wish to get this result:
Drogba 2 pts
Owen 4 pts
Henry 6 pts
However, my ranking seems to be reversed for now :-(
Henry 6 pts
Owen 4 pts
Drogba 2 pts
I think my problem is with my function Bubblesort ?
def Bubblesort(name, goal1, point):
swap = True
while swap:
swap = False
for i in range(len(name)-1):
if goal1[i+1] > goal1[i]:
goal1[i], goal1[i+1] = goal1[i+1], goal1[i]
name[i], name[i+1] = name[i+1], name[i]
point[i], point[i + 1] = point[i + 1], point[i]
swap = True
return name, goal1, point
def ranking(name, point):
for i in range(len(name)):
print(name[i], "\t" , point[i], " \t ")
name = ["Henry", "Owen", "Drogba"]
point = [0]*3
goal1 = [68, 52, 46]
gain = [6,4,2]
name, goal1, point = Bubblesort( name, goal1, point )
for i in range(len(name)):
point[i] += gain[i]
ranking (name, point)
In your code:
if goal1[i+1] > goal1[i]:
that checks if it is greater. You need to swap it if the next one is less, not greater.
Change that to:
if goal1[i+1] < goal1[i]:
A bunch of issues:
def Bubblesort - PEP8 says function names should be lowercase, ie def bubblesort
You are storing your data as a bunch of parallel lists; this makes it harder to work on and think about (and sort!). You should transpose your data so that instead of having a list of names, a list of points, a list of goals you have a list of players, each of whom has a name, points, goals.
def bubblesort(name, goal1, point): - should look like def bubblesort(items) because bubblesort does not need to know that it is getting names and goals and points and sorting on goals (specializing it that way keeps you from reusing the function later to sort other things). All it needs to know is that it is getting a list of items and that it can compare pairs of items using >, ie Item.__gt__ is defined.
Instead of using the default "native" sort order, Python sort functions usually let you pass an optional key function which allows you to tell it what to sort on - that is, sort on key(items[i]) > key(items[j]) instead of items[i] > items[j]. This is often more efficient and/or convenient than reshuffling your data to get the sort order you want.
for i in range(len(name)-1): - you are iterating more than needed. After each pass, the highest value in the remaining list gets pushed to the top (hence "bubble" sort, values rise to the top of the list like bubbles). You don't need to look at those top values again because you already know they are higher than any of the remaining values; after the nth pass, you can ignore the last n values.
actually, the situation is a bit better than that; you will often find runs of values which are already in sorted order. If you keep track of the highest index that actually got swapped, you don't need to go beyond that on your next pass.
So your sort function becomes
def bubblesort(items, *, key=None):
"""
Return items in sorted order
"""
# work on a copy of the list (don't destroy the original)
items = list(items)
# process key values - cache the result of key(item)
# so it doesn't have to be called repeatedly
keys = items if key is None else [key(item) for item in items]
# initialize the "last item to sort on the next pass" index
last_swap = len(items) - 1
# sort!
while last_swap:
ls = 0
for i in range(last_swap):
j = i + 1
if keys[i] > keys[j]:
# have to swap keys and items at the same time,
# because keys may be an alias for items
items[i], items[j], keys[i], keys[j] = items[j], items[i], keys[j], keys[i]
# made a swap - update the last_swap index
ls = i
last_swap = ls
return items
You may not be sure that this is actually correct, so let's test it:
from random import sample
def test_bubblesort(tries = 1000):
# example key function
key_fn = lambda item: (item[2], item[0], item[1])
for i in range(tries):
# create some sample data to sort
data = [sample("abcdefghijk", 3) for j in range(10)]
# no-key sort
assert bubblesort(data) == sorted(data), "Error: bubblesort({}) gives {}".format(data, bubblesort(data))
# keyed sort
assert bubblesort(data, key=key_fn) == sorted(data, key=key_fn), "Error: bubblesort({}, key) gives {}".format(data, bubblesort(data, key_fn))
test_bubblesort()
Now the rest of your code becomes
class Player:
def __init__(self, name, points, goals, gains):
self.name = name
self.points = points
self.goals = goals
self.gains = gains
players = [
Player("Henry", 0, 68, 6),
Player("Owen", 0, 52, 4),
Player("Drogba", 0, 46, 2)
]
# sort by goals
players = bubblesort(players, key = lambda player: player.goals)
# update points
for player in players:
player.points += player.gains
# show the result
for player in players:
print("{player.name:<10s} {player.points:>2d} pts".format(player=player))
which produces
Drogba 2 pts
Owen 4 pts
Henry 6 pts

different result from recursive and dynamic programming

Working on below problem,
Problem,
Given a m * n grids, and one is allowed to move up or right, find the different paths between two grid points.
I write a recursive version and a dynamic programming version, but they return different results, and any thoughts what is wrong?
Source code,
from collections import defaultdict
def move_up_right(remaining_right, remaining_up, prefix, result):
if remaining_up == 0 and remaining_right == 0:
result.append(''.join(prefix[:]))
return
if remaining_right > 0:
prefix.append('r')
move_up_right(remaining_right-1, remaining_up, prefix, result)
prefix.pop(-1)
if remaining_up > 0:
prefix.append('u')
move_up_right(remaining_right, remaining_up-1, prefix, result)
prefix.pop(-1)
def move_up_right_v2(remaining_right, remaining_up):
# key is a tuple (given remaining_right, given remaining_up),
# value is solutions in terms of list
dp = defaultdict(list)
dp[(0,1)].append('u')
dp[(1,0)].append('r')
for right in range(1, remaining_right+1):
for up in range(1, remaining_up+1):
for s in dp[(right-1,up)]:
dp[(right,up)].append(s+'r')
for s in dp[(right,up-1)]:
dp[(right,up)].append(s+'u')
return dp[(right, up)]
if __name__ == "__main__":
result = []
move_up_right(2,3,[],result)
print result
print '============'
print move_up_right_v2(2,3)
In version 2 you should be starting your for loops at 0 not at 1. By starting at 1 you are missing possible permutations where you traverse the bottom row or leftmost column first.
Change version 2 to:
def move_up_right_v2(remaining_right, remaining_up):
# key is a tuple (given remaining_right, given remaining_up),
# value is solutions in terms of list
dp = defaultdict(list)
dp[(0,1)].append('u')
dp[(1,0)].append('r')
for right in range(0, remaining_right+1):
for up in range(0, remaining_up+1):
for s in dp[(right-1,up)]:
dp[(right,up)].append(s+'r')
for s in dp[(right,up-1)]:
dp[(right,up)].append(s+'u')
return dp[(right, up)]
And then:
result = []
move_up_right(2,3,[],result)
set(move_up_right_v2(2,3)) == set(result)
True
And just for fun... another way to do it:
from itertools import permutations
list(map(''.join, set(permutations('r'*2+'u'*3, 5))))
The problem with the dynamic programming version is that it doesn't take into account the paths that start from more than one move up ('uu...') or more than one move right ('rr...').
Before executing the main loop you need to fill dp[(x,0)] for every x from 1 to remaining_right+1 and dp[(0,y)] for every y from 1 to remaining_up+1.
In other words, replace this:
dp[(0,1)].append('u')
dp[(1,0)].append('r')
with this:
for right in range(1, remaining_right+1):
dp[(right,0)].append('r'*right)
for up in range(1, remaining_up+1):
dp[(0,up)].append('u'*up)

Python: Concatenate similiar objects in List

I have a list containing strings as ['Country-Points'].
For example:
lst = ['Albania-10', 'Albania-5', 'Andorra-0', 'Andorra-4', 'Andorra-8', ...other countries...]
I want to calculate the average for each country without creating a new list. So the output would be (in the case above):
lst = ['Albania-7.5', 'Andorra-4.25', ...other countries...]
Would realy appreciate if anyone can help me with this.
EDIT:
this is what I've got so far. So, "data" is actually a dictionary, where the keys are countries and the values are list of other countries points' to this country (the one as Key). Again, I'm new at Python so I don't realy know all the built-in functions.
for key in self.data:
lst = []
index = 0
score = 0
cnt = 0
s = str(self.data[key][0]).split("-")[0]
for i in range(len(self.data[key])):
if s in self.data[key][i]:
a = str(self.data[key][i]).split("-")
score += int(float(a[1]))
cnt+=1
index+=1
if i+1 != len(self.data[key]) and not s in self.data[key][i+1]:
lst.append(s + "-" + str(float(score/cnt)))
s = str(self.data[key][index]).split("-")[0]
score = 0
self.data[key] = lst
itertools.groupby with a suitable key function can help:
import itertools
def get_country_name(item):
return item.split('-', 1)[0]
def get_country_value(item):
return float(item.split('-', 1)[1])
def country_avg_grouper(lst) :
for ctry, group in itertools.groupby(lst, key=get_country_name):
values = list(get_country_value(c) for c in group)
avg = sum(values)/len(values)
yield '{country}-{avg}'.format(country=ctry, avg=avg)
lst[:] = country_avg_grouper(lst)
The key here is that I wrote a function to do the change out of place and then I can easily make the substitution happen in place by using slice assignment.
I would probabkly do this with an intermediate dictionary.
def country(s):
return s.split('-')[0]
def value(s):
return float(s.split('-')[1])
def country_average(lst):
country_map = {}|
for point in lst:
c = country(pair)
v = value(pair)
old = country_map.get(c, (0, 0))
country_map[c] = (old[0]+v, old[1]+1)
return ['%s-%f' % (country, sum/count)
for (country, (sum, count)) in country_map.items()]
It tries hard to only traverse the original list only once, at the expense of quite a few tuple allocations.

Do dictionaries keep track of the point in time where a item was assigned?

I was coding a High Scores system where the user would enter a name and a score then the program would test if the score was greater than the lowest score in high_scores. If it was, the score would be written and the lowest score, deleted. Everything was working just fine, but i noticed something. The high_scores.txt file was like this:
PL1 50
PL2 50
PL3 50
PL4 50
PL5 50
PL1 was the first score added, PL2 was the second, PL3 the third and so on. Then I tried adding another score, higher than all the others (PL6 60) and what happened was that the program assigned PL1 as the lowest score. PL6 was added and PL1 was deleted. That was exactly the behavior I wanted but I don't understand how it happened. Do dictionaries keep track of the point in time where a item was assigned? Here's the code:
MAX_NUM_SCORES = 5
def getHighScores(scores_file):
"""Read scores from a file into a list."""
try:
cache_file = open(scores_file, 'r')
except (IOError, EOFError):
print("File is empty or does not exist.")
return []
else:
lines = cache_file.readlines()
high_scores = {}
for line in lines:
if len(high_scores) < MAX_NUM_SCORES:
name, score = line.split()
high_scores[name] = int(score)
else:
break
return high_scores
def writeScore(file_, name, new_score):
"""Write score to a file."""
if len(name) > 3:
name = name[0:3]
high_scores = getHighScores(file_)
if high_scores:
lowest_score = min(high_scores, key=high_scores.get)
if new_score > high_scores[lowest_score] or len(high_scores) < 5:
if len(high_scores) == 5:
del high_scores[lowest_score]
high_scores[name.upper()] = int(new_score)
else:
return 0
else:
high_scores[name.upper()] = int(new_score)
write_file = open(file_, 'w')
while high_scores:
highest_key = max(high_scores, key=high_scores.get)
line = highest_key + ' ' + str(high_scores[highest_key]) + '\n'
write_file.write(line)
del high_scores[highest_key]
return 1
def displayScores(file_):
"""Display scores from file."""
high_scores = getHighScores(file_)
print("HIGH SCORES")
if high_scores:
while high_scores:
highest_key = max(high_scores, key=high_scores.get)
print(highest_key, high_scores[highest_key])
del high_scores[highest_key]
else:
print("No scores yet.")
def resetScores(file_):
open(file_, "w").close()
No. The results you got were due to arbitrary choices internal to the dict implementation that you cannot depend on always happening. (There is a subclass of dict that does keep track of insertion order, though: collections.OrderedDict.) I believe that with the current implementation, if you switch the order of the PL1 and PL2 lines, PL1 will probably still be deleted.
As others noted, the order of items in the dictionary is "up to the implementation".
This answer is more a comment to your question, "how min() decides what score is the lowest?", but is much too long and format-y for a comment. :-)
The interesting thing is that both max and min can be used this way. The reason is that they (can) work on "iterables", and dictionaries are iterable:
for i in some_dict:
loops i over all the keys in the dictionary. In your case, the keys are the user names. Further, min and max allow passing a key argument to turn each candidate in the iterable into a value suitable for a binary comparison. Thus, min is pretty much equivalent to the following python code, which includes some tracing to show exactly how this works:
def like_min(iterable, key=None):
it = iter(iterable)
result = it.next()
if key is None:
min_val = result
else:
min_val = key(result)
print '** initially, result is', result, 'with min_val =', min_val
for candidate in it:
if key is None:
cmp_val = candidate
else:
cmp_val = key(candidate)
print '** new candidate:', candidate, 'with val =', cmp_val
if cmp_val < min_val:
print '** taking new candidate'
result = candidate
return result
If we run the above on a sample dictionary d, using d.get as our key:
d = {'p': 0, 'ayyy': 3, 'b': 5, 'elephant': -17}
m = like_min(d, key=d.get)
print 'like_min:', m
** initially, result is ayyy with min_val = 3
** new candidate: p with val = 0
** taking new candidate
** new candidate: b with val = 5
** new candidate: elephant with val = -17
** taking new candidate
like_min: elephant
we find that we get the key whose value is the smallest. Of course, if multiple values are equal, the choice of "smallest" depends on the dictionary iteration order (and also whether min actually uses < or <= internally).
(Also, the method you use to "sort" the high scores to print them out is O(n2): pick highest value, remove it from dictionary, repeat until empty. This traverses n items, then n-1, ... then 2, then 1 => n+(n-1)+...+2+1 steps = n(n+1)/2 = O(n2). Deleting the high one is also an expensive operation, although it should still come in at or under O(n2), I think. With n=5 this is not that bad (5 * 6 / 2 = 15), but ... not elegant. :-) )
This is pretty much what http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/ is about.
Short version: Get the treap module, which works like a sorted dictionary, and keep the keys in order. Or use the nest module to get the n greatest (or least) values automatically.
collections.OrderedDict is good for preserving insertion order, but not key order.

Categories

Resources