Two number sum : why don't anybody do it this way - python

I was looking for the solution to "two number sum problem" and I saw every body using two for loops
and another way I saw was using a hash table
def twoSumHashing(num_arr, pair_sum):
sums = []
hashTable = {}
for i in range(len(num_arr)):
complement = pair_sum - num_arr[i]
if complement in hashTable:
print("Pair with sum", pair_sum,"is: (", num_arr[i],",",complement,")")
hashTable[num_arr[i]] = num_arr[i]
# Driver Code
num_arr = [4, 5, 1, 8]
pair_sum = 9
# Calling function
twoSumHashing(num_arr, pair_sum)
But why don't nobody discuss about this solution
def two_num_sum(array, target):
for num in array:
match = target - num
if match in array:
return [match, num]
return "no result found"
when using a hash table we have to store values into the hash table. But here there is no need for that.
1)Does that affect the time complexity of the solution?
2)looking up a value in hash table is easy compared to array, but if the values are huge in number,
does storing them in a hash table take more space?

First of all, the second function you provide as a solution is not correct and does not return a complete list of answers.
Second, as a Pythonist, it's better to say dictionary instead of the hash table. A python dictionary is one of the implementations of a hash table.
Anyhow, regarding the other questions that you asked:
Using two for-loops is a brute-force approach and usually is not an optimized approach in real. Dictionaries are way faster than lists in python. So for the sake of time-complexity sure, dictionaries are the winner.
From the point of view of space complexity, using dictionaries for sure takes more memory allocation, but with current hardware, it is not an essential issue for billions of numbers. It depends on your situation, whether the speed is crucial to you or the amount of memory.

first function
uses O(n) time complexity as you iterate over n members in the array
uses O(n) space complexity as you can have one pair which is the first and the last, then in the worst case you can store up to n-1 numbers.
second function
uses O(n^2) time complexity as you iterate first on the array then uses in which uses __contains__ on list which is O(n) in worst case.
So the second function is like doing two loops to brute force the solution.
Another thing to point out in second function is that you don't return all the pairs but just the first pair you find.
Then you can try and fix it by iterating from the index of num+1 but you will have duplicates.
This is all comes down to a preference of what's more important - time complexity or space complexity-
this is one of many interview / preparation to interview question where you need to explain why you would use function two (if was working properly) over function one and vice versa.
Answers for your questions
1.when using a hash table we have to store values into the hash table. But here there is no need for that. 1)Does that affect the time complexity of the solution?
Yes now time complexity is O(n^2) which is worse
2)looking up a value in hash table is easy compared to array, but if the values are huge in number, does storing them in a hash table take more space?
In computers numbers are just representation of bits. Larger numbers can take up more space as they need more bits to represent them but storing them will be the same, no matter where you store.

Related

Inefficient code for removing duplicates from a list in Python - interpretation?

I am writing a Python program to remove duplicates from a list. My code is the following:
some_values_list = [2,2,4,7,7,8]
unique_values_list = []
for i in some_values_list:
if i not in unique_values_list:
unique_values_list.append(i)
print(unique_values_list)
This code works fine. However, an alternative solution is given and I am trying to interpret it (as I am still a beginner in Python). Specifically, I do not understand the added value or benefit of creating an empty set - how does that make the code clearer or more efficient? Isn´t it enough to create an empty list as I have done in the first example?
The code for the alternative solution is the following:
a = [10,20,30,20,10,50,60,40,80,50,40]
dup_items = set()
uniq_items = []
for x in a:
if x not in dup_items:
uniq_items.append(x)
dup_items.add(x)
print(dup_items)
This code also throws up an error TypeError: set() missing 1 required positional argument: 'items' (This is from a website for Python exercises with answers key, so it is supposed to be correct.)
Determining if an item is present in a set is generally faster than determining if it is present in a list of the same size. Why? Because for a set (at least, for a hash table, which is how CPython sets are implemented) we don't need to traverse the entire collection of elements to check if a particular value is present (whereas we do for a list). Rather, we usually just need to check at most one element. A more precise way to frame this is to say that containment tests for lists take "linear time" (i.e. time proportional to the size of the list), whereas containment tests in sets take "constant time" (i.e. the runtime does not depend on the size of the set).
Lookup for an element in a list takes O(N) time (you can find an element in logarithmic time, but the list should be sorted, so not your case). So if you use the same list to keep unique elements and lookup newly added ones, your whole algorithm runs in O(N²) time (N elements, O(N) average lookup). set is a hash-set in Python, so lookup in it should take O(1) on average. Thus, if you use an auxiliary set to keep track of unique elements already found, your whole algorithm will only take O(N) time on average, chances are good, one order better.
In most cases sets are faster than lists. One of this cases is when you look for an item using "in" keyword. The reason why sets are faster is that, they implement hashtable.
So, in short, if x not in dup_items in second code snippet works faster than if i not in unique_values_list.
If you want to check the time complexity of different Python data structures and operations, you can check this link
.
I think your code is also inefficient in a way that for each item in list you are searching in larger list. The second snippet looks for the item in smaller set. But that is not correct all the time. For example, if the list is all unique items, then it is the same.
Hope it clarifies.

Why are dict lookups always better than list lookups?

I was using a dictionary as a lookup table but I started to wonder if a list would be better for my application -- the amount of entries in my lookup table wasn't that big. I know lists use C arrays under the hood which made me conclude that lookup in a list with just a few items would be better than in a dictionary (accessing a few elements in an array is faster than computing a hash).
I decided to profile the alternatives but the results surprised me. List lookup was only better with a single element! See the following figure (log-log plot):
So here comes the question: Why do list lookups perform so poorly? What am I missing?
On a side question, something else that called my attention was a little "discontinuity" in the dict lookup time after approximately 1000 entries. I plotted the dict lookup time alone to show it.
p.s.1 I know about O(n) vs O(1) amortized time for arrays and hash tables, but it is usually the case that for a small number of elements iterating over an array is better than to use a hash table.
p.s.2 Here is the code I used to compare the dict and list lookup times:
import timeit
lengths = [2 ** i for i in xrange(15)]
list_time = []
dict_time = []
for l in lengths:
list_time.append(timeit.timeit('%i in d' % (l/2), 'd=range(%i)' % l))
dict_time.append(timeit.timeit('%i in d' % (l/2),
'd=dict.fromkeys(range(%i))' % l))
print l, list_time[-1], dict_time[-1]
p.s.3 Using Python 2.7.13
I know lists use C arrays under the hood which made me conclude that lookup in a list with just a few items would be better than in a dictionary (accessing a few elements in an array is faster than computing a hash).
Accessing a few array elements is cheap, sure, but computing == is surprisingly heavyweight in Python. See that spike in your second graph? That's the cost of computing == for two ints right there.
Your list lookups need to compute == a lot more than your dict lookups do.
Meanwhile, computing hashes might be a pretty heavyweight operation for a lot of objects, but for all ints involved here, they just hash to themselves. (-1 would hash to -2, and large integers (technically longs) would hash to smaller integers, but that doesn't apply here.)
Dict lookup isn't really that bad in Python, especially when your keys are just a consecutive range of ints. All ints here hash to themselves, and Python uses a custom open addressing scheme instead of chaining, so all your keys end up nearly as contiguous in memory as if you'd used a list (which is to say, the pointers to the keys end up in a contiguous range of PyDictEntrys). The lookup procedure is fast, and in your test cases, it always hits the right key on the first probe.
Okay, back to the spike in graph 2. The spike in the lookup times at 1024 entries in the second graph is because for all smaller sizes, the integers you were looking for were all <= 256, so they all fell within the range of CPython's small integer cache. The reference implementation of Python keeps canonical integer objects for all integers from -5 to 256, inclusive. For these integers, Python was able to use a quick pointer comparison to avoid going through the (surprisingly heavyweight) process of computing ==. For larger integers, the argument to in was no longer the same object as the matching integer in the dict, and Python had to go through the whole == process.
The short answer is that lists use linear search and dicts use amortized O(1) search.
In addition, dict searches can skip an equality test either when 1) hash values don't match or 2) when there is an identity match. Lists only benefit from the identity-implies equality optimization.
Back in 2008, I gave a talk on this subject where you'll find all the details: https://www.youtube.com/watch?v=hYUsssClE94
Roughly the logic for searching lists is:
for element in s:
if element is target:
# fast check for identity implies equality
return True
if element == target:
# slower check for actual equality
return True
return False
For dicts the logic is roughly:
h = hash(target)
for i in probe_sequence(h, len(table)):
element = key_table[i]
if element is UNUSED:
raise KeyError(target)
if element is target:
# fast path for identity implies equality
return value_table[i]
if h != h_table[i]:
# unequal hashes implies unequal keys
continue
if element == target:
# slower check for actual equality
return value_table[i]
Dictionary hash tables are typically between one-third and two-thirds full, so they tend to have few collisions (few trips around the loop shown above) regardless of size. Also, the hash value check prevents needless slow equality checks (the chance of a wasted equality check is about 1 in 2**64).
If your timing focuses on integers, there are some other effects at play as well. That hash of a int is the int itself, so hashing is very fast. Also, it means that if you're storing consecutive integers, there tend to be no collisions at all.
You say "accessing a few elements in an array is faster than computing a hash".
A simple hashing rule for strings might be just a sum (with a modulo in the end). This is a branchless operation that can compare favorably with character comparisons, especially when there is a long match on the prefix.

Python Set Slice Complexity

Assume that I have two lists named a and b of both size n, and I want to do the following slice setting operation with k < n
a[:k] = b[:k]
In the Python wiki's Time Complexity page it says that the complexity of slice setting is O(n+k) where k is the length of the slice. I just cannot understand why it is not just O(k) in the above situation.
I know that slicing returns a new list, so it is O(k), and I know that the list holds its data in a continuous way, so inserting an item in the middle would take O(n) time. But the above operation can easily be done in O(k) time. Am I missing something?
Furthermore, is there a documentation where I can find detailed information about such issues? Should I look into the CPython implementation?
Thanks.
O(n+k) is the average case, which includes having to grow or shrink the list to adjust for the number of elements inserted to replace the original slice.
Your case, where you replace the slice with an equal number of new elements, the implementation only takes O(k) steps. But given all possible combinations of number of elements inserted and deleted, the average case has to move the n remaining elements in the list up or down.
See the list_ass_slice function for the exact implementation.
You're right, if you want to know the exact details it's best to use the source. The CPython implementation of setting a slice is in listobject.c.
If I read it correctly, it will...
Count how many new elements you're inserting (or deleting!)
Shift the n existing elements of the list over enough places to make room for the new elements, taking O(n) time in the worst case (when every element of the list has to be shifted).
Copy over the new elements into the space that was just created, taking O(k) time.
That adds up to O(n+k).
Of course, your case is probably not that worst case: you're changing the last k elements of the list, so there might be no need for shifting at all, reducing the complexity to O(k) you expected. However, that is not true in general.

Optimize algorithm to compute orbit under a given action in python

My goal is to iterate through a set S of elements given a single element and an action G: S -> S that acts transitively on S (i.e., for any elt,elt' in S, there is a map f in G such that f(elt) = elt'). The action is finitely generated, so I can use that I can apply each generator to a given element.
The algorithm I use is:
def orbit(act,elt):
new_elements = [elt]
seen_elements = set([elt])
yield elt
while new_elements:
elt = new_elements.pop()
seen_elements.add(elt)
for f in act.gens():
elt_new = f(elt)
if elt_new not in seen_elements:
new_elements.append(elt_new)
seen_elements.add(elt_new)
yield elt_new
This algorithm seems to be well-suited and very generic. BUT it has one major and one minor slowdown in big computations that I would like to get rid of:
The major: seen_elements collects all the elements, and is thus too memory consuming, given that I do not need the actual elements anymore.
How can I achieve to not have all the elements stored in memory?
Very likely, this depends on what the elements are. So for me, these are short lists (<10 entries) of ints (each < 10^3). So first, is there a fast way to associate a (with high probability) unique integer to such a list? Does that save much memory? If so, should I put those into a dict to check the containment (in this case, first the hash equality test, and then an int equality test are done, right?), or how should I do that?
the minor: poping the element takes a lot of time given that I don't quite need that list. Is there a better way of doing that?
Thanks a lot for your suggestions!
So first, is there a fast way to associate a (with high probability) unique integer to such a list?
If the list entries all are in range(1, 1024), then sum(x << (i * 10) for i, x in enumerate(elt)) yields a unique integer.
Does that save much memory?
The short answer is yes. The long answer is that it's complicated to determine how much. Python's long integer representation uses (probably) 30-bit digits, so the digits will pack 3 to the 32-bit word instead of 1 (or 0.5 for 64-bit). There's some object overhead (8/16 bytes?), and then there's the question of how many of the list entries require separate objects, which is where the big win may lie.
If you can tolerate errors, then a Bloom filter would be a possibility.
the minor: popping the element takes a lot of time given that I don't quite need that list. Is there a better way of doing that?
I find that claim surprising. Have you measured?

How to tractably solve the assignment optimisation task

I'm working on a script that takes the elements from companies and pairs them up with the elements of people. The goal is to optimize the pairings such that the sum of all pair values is maximized (the value of each individual pairing is precomputed and stored in the dictionary ctrPairs).
They're all paired in a 1:1, each company has only one person and each person belongs to only one company, and the number of companies is equal to the number of people. I used a top-down approach with a memoization table (memDict) to avoid recomputing areas that have already been solved.
I believe that I could vastly improve the speed of what's going on here but I'm not really sure how. Areas I'm worried about are marked with #slow?, any advice would be appreciated (the script works for inputs of lists n<15 but it gets incredibly slow for n > ~15)
def getMaxCTR(companies, people):
if(memDict.has_key((companies,people))):
return memDict[(companies,people)] #here's where we return the memoized version if it exists
if(not len(companies) or not len(people)):
return 0
maxCTR = None
remainingCompanies = companies[1:len(companies)] #slow?
for p in people:
remainingPeople = list(people) #slow?
remainingPeople.remove(p) #slow?
ctr = ctrPairs[(companies[0],p)] + getMaxCTR(remainingCompanies,tuple(remainingPeople)) #recurse
if(ctr > maxCTR):
maxCTR = ctr
memDict[(companies,people)] = maxCTR
return maxCTR
To all those who wonder about the use of learning theory, this question is a good illustration. The right question is not about a "fast way to bounce between lists and tuples in python" — the reason for the slowness is something deeper.
What you're trying to solve here is known as the assignment problem: given two lists of n elements each and n×n values (the value of each pair), how to assign them so that the total "value" is maximized (or equivalently, minimized). There are several algorithms for this, such as the Hungarian algorithm (Python implementation), or you could solve it using more general min-cost flow algorithms, or even cast it as a linear program and use an LP solver. Most of these would have a running time of O(n3).
What your algorithm above does is to try each possible way of pairing them. (The memoisation only helps to avoid recomputing answers for pairs of subsets, but you're still looking at all pairs of subsets.) This approach is at least Ω(n222n). For n=16, n3 is 4096 and n222n is 1099511627776. There are constant factors in each algorithm of course, but see the difference? :-) (The approach in the question is still better than the naive O(n!), which would be much worse.) Use one of the O(n^3) algorithms, and I predict it should run in time for up to n=10000 or so, instead of just up to n=15.
"Premature optimization is the root of all evil", as Knuth said, but so is delayed/overdue optimization: you should first carefully consider an appropriate algorithm before implementing it, not pick a bad one and then wonder what parts of it are slow. :-) Even badly implementing a good algorithm in Python would be orders of magnitude faster than fixing all the "slow?" parts of the code above (e.g., by rewriting in C).
i see two issues here:
efficiency: you're recreating the same remainingPeople sublists for each company. it would be better to create all the remainingPeople and all the remainingCompanies once and then do all the combinations.
memoization: you're using tuples instead of lists to use them as dict keys for memoization; but tuple identity is order-sensitive. IOW: (1,2) != (2,1) you better use sets and frozensets for this: frozenset((1,2)) == frozenset((2,1))
This line:
remainingCompanies = companies[1:len(companies)]
Can be replaced with this line:
remainingCompanies = companies[1:]
For a very slight speed increase. That's the only improvement I see.
If you want to get a copy of a tuple as a list you can do
mylist = list(mytuple)

Categories

Resources