How to avoid quadratic time in Python?

How to avoid quadratic time in Python? - python

I've created a graph G with NetworkX, where my nodes are movies and actors, and there's an edge if an actor partecipated in a movie. I have a dictionary for all actors and all movies. I want to find the pair of movies that share the largest number of actors.
I thought that the solution could be something like that:
maximum=0
pair=[]
dict_pair_movies={}
for actor in actors:
list_movies=list(nx.all_neighbors(G, actor))
for movie1 in list_movies:
for movie2 in list_movies:
if movie1!=movie2:
dict_coppia_movies[(movies1,movies2)]+=1
if dict_coppia_movies[(movies1,movies2)]>massimo:
maximum=dict_coppia_movies[(movies1,movies2)]
pair=[movies1,movies2]
return pair
But this can't really work because there are 2 millions of actors.
I tried if the code could work in a smaller case, but I ran in two problems:
This line dict_coppia_movies[(movies1,movies2)]+=1 doesn't work; But I could get the result that I wanted with this one dict_coppia_movies[(movies1,movies2)]=dict_coppia_movies.get((movies1,movies2),0) + 1
I don't know ho to specify that, if I have two film A and B, the combination "A,B" it's the same of "B,A".
Instead the algorithm creates two different keys.
I even tried something with nx.common_neighbors that should gives me the number of actors of two movies, but the problems were always the quadratic time and my inability to tell the algorithm to iterate only for different movies.
EDIT: Maybe I've found the solution, but I can't check if it's the right one. I thought that the wise road to follow should be with nx.common_neighbors so I could just iterate for two nodes. In order to make the algorithm fast enough, I tried to use the zip function with the list of movies and the set of movies.
movieList=list(movies.keys())
movieSet=set(movieLista)
def question3():
maximum=0
pair=[]
for node1,node2 in zip(movies,movieSet):
neighborsList=(list(nx.common_neighbors(G,node1,node2)))
if len(neighborsList)>maximum:
maximum=len(neighborsList)
pair=[node1,node2]
return pair
This algorithm gives me a result, but I can't really check if it's correct. I knew that the zip function in the case of two list or set with different lenght it will truncate to the shortest one, but in this case movies and movieSet have the same lenght so it should work...

Based on my understanding of what each variable you're working with means, I think that the following should work. This makes use of Joel's "movie size" heuristic.
Notably, the sorting process is O(n log(n)), so it has no impact on the overall O(n2) complexity.
def question3():
def comp_func(pair):
return len(pair[1])
movie_list = sorted([(k,set(d)) for k,d in G.adjacency() if k in movies],
key = comp_func, reverse = True)
maximum = 0
pair = [None,None]
for i,(movie_1,set_1) in enumerate(movie_list):
if len(set_1) <= maximum:
break
for movie_2,set_2 in movie_list[i+1:]:
if len(set_2) <= maximum:
break
m = len(set_1&set_2)
if m > maximum:
maximum = m
pair[:] = movie_1,movie_2
return pair

Related

Which part of my code is causing a "Time-Out Error" and why?

I am trying to solve this problem on Hackerrank:https://www.hackerrank.com/challenges/climbing-the-leaderboard. The problem statement basically states that there are two set of scores one of players and other of Alice and we have to use dense ranking and display Alice's rank when compared to other player's scores. It is giving me Time-Out error on large test-cases. I have used the forum suggestions on Hackerrank already and was successful, but specifically I am curious to know the problem in my code. Here is my code:
class Dict(dict):
def __init__(self):
self=dict()
def add(self,key,value):
self[key]=value
def climbingLeaderboard(scores, alice):
alice_rank=[]
for i in range(len(alice)):
scores.append(alice[i])
a=list(set(scores))
a.sort(reverse=True)
obj=Dict()
b=1
for j in a:
obj.add(j,b)
b+=1
if alice[i] in obj:
alice_rank.append(obj[alice[i]])
scores.remove(alice[i])
return alice_rank

You have a couple of problems in your code but the most important one is the following.
...
scores.append(alice[i])
a=list(set(scores))
a.sort(reverse=True)
...
On each iteration you add Alice's score to scores and then sort scores. The cost here is already O(nlog(n)), where n - number of elements in scores. Thus, your total time complexity becomes O(n*n*log(n)). That's too much because n can reach 200000 and so for your solution it can be up to 200000*200000*log(200000) operations.
Of course, there's another problem:
...
for j in a:
obj.add(j,b)
b+=1
...
But it's still not as bad as the previous one since the loop time complexity is O(n).
There exists a O(n*log(n)) time complexity solution. I'll give you an overall idea so that you can easily implement it yourself.
If you recall that players with duplicate scores share the same position in the leaderboard then you can convert your scores to an array without duplicates as list(set(scores)) before your loop. In that case, the first position corresponds to the highest score, the second one to the second highest score and so on (the initial array is sorted in decreasing order per problem statement).
Given the step above, for each score of Alice you can find a position in the array at which the player's score is less or equal to score. The lookup will take O(log(n)) because the array is sorted. For instance, if scores of players are 40, 30, 10 and a score of Alice is 35, then the found position will be 2 (for the algorithm description I consider that the first index starts from 1) as 30 occupies this position, This position is the ACTUAL position of Alice in the leaderboard and so can be printed right away.
Another tip - you can use bisect module for performing a binary search in the array.
So, the overall time complexity of the proposed solution is O(n*log(n)). It will pass all the test cases (I've tried it).

Performing repeated sort (a.sort(reverse=True)) consumes a lot of time. I had the same problem. If you read the question, you will find that the scores are input in sorted (ascending or descending). The trick is to exploit this inherent ordering of input.
One more thing, your code's time complexity is O(n^2) due to nested loop, whereas the forum you spoke if may be doing it with O(n) (not sure).

Problems with Brute Force Algorithm cow transport

I'm studying Optimization Problems and I got stuck at a homework problem. I have to write a Brute Force algorithm to minimizes the number of spaceship trips. The problem is: aliens created a new type of cow and now they want to transport it back with the minimum number of trips. Each trip has the maximum value of 10 tons.
The exercise provided some things, like this algorithm to get all possible partitions for a list:
# From codereview.stackexchange.com
def partitions(set_):
if not set_:
yield []
return
for i in range(2**len(set_)//2):
parts = [set(), set()]
for item in set_:
parts[i&1].add(item)
i >>= 1
for b in partitions(parts[1]):
yield [parts[0]]+b
def get_partitions(set_):
for partition in partitions(set_):
yield [list(elt) for elt in partition]
The input is a dict of cows, like this one: cows = {'Jesse': 6,'Maybel': 3, 'Callie': 2, 'Maggie': 5}, with the key being the name of the cow and the value being the cow's weight in tons.
The output must be a list of lists, where each inner list represents a trip, like this one:
[['Jesse', 'Callie'], ['Maybel', 'Maggie']]
My question is: How can I implement this algorithm using get_partitions()? Does DFS is a good way to solve this?
I tried many ways already, but two of them I found at stackoverflow, that seemed to be closer to the answer was:
Got all possible combinations using the get_partitions() function and selected all that obey the limit = 10 inside a list. Like I saw here: Why is this brute force algorithm producing the incorrect result? but it didn't work because it was returning an empty list.
Then I tried Depth First Search, like I saw in here with a few changes: How to find best solution from all brute force combinations? and the lists are not returning the correct answer yet.
This was the closest I get from the correct answer. First I used get_partitions to generate all possible partitions, then I filtered the partitions to a list named possible with only trips with limit <= 10 and if the trips had all cows inside (to exclude those partitions with only one or two cow names).
def brute_force_cow_transport(cows,limit=10):
"""
Finds the allocation of cows that minimizes the number of spaceship trips
via brute force. The brute force algorithm should follow the following method:
1. Enumerate all possible ways that the cows can be divided into separate trips
Use the given get_partitions function in ps1_partition.py to help you!
2. Select the allocation that minimizes the number of trips without making any trip
that does not obey the weight limitation
Does not mutate the given dictionary of cows.
Parameters:
cows - a dictionary of name (string), weight (int) pairs
limit - weight limit of the spaceship (an int)
Returns:
A list of lists, with each inner list containing the names of cows
transported on a particular trip and the overall list containing all the
trips
"""
possible_combinations = []
for partition in get_partitions(cows.keys()):
possible_combinations.append(partition)
possible_combinations.sort(key=len)
def _is_valid_trip(cows, trip):
valid = False
for cow_name in cows:
if cow_name in trip:
valid = True
else:
valid = False
return valid
possibles = []
for partition in possible_combinations:
trips = []
for trip in partition:
total = sum([cows.get(cow) for cow in trip])
if total <= limit and _is_valid_trip(cows.keys(), trip):
trips.append(trip)
possibles.append(trips)
all_possibilities = [possibility for possibility in possibles if possibility != []]
return min(all_possibilities)
My TestCase for this still gives:
AssertionError: Lists differ: [['Callie', 'Maggie']] != [['Jesse', 'Callie'], ['Maybel', 'Maggie']]
First differing element 0:
['Callie', 'Maggie']
['Jesse', 'Callie']
Second list contains 1 additional elements.
First extra element 1:
['Maybel', 'Maggie']
- [['Callie', 'Maggie']]
+ [['Jesse', 'Callie'], ['Maybel', 'Maggie']]
----------------------------------------------------------------------
Ran 5 tests in 0.009s
FAILED (failures=1)

This was the closest I get from the correct answer. First I used get_partitions to generate all possible partitions, then I filtered the partitions to a list named possible with only trips with limit <= 10 and if the trips had all cows inside (to exclude those partitions with only one or two cow names).
This is the right idea except for the last statement, a partition of a set by definition will include all elements of the set exactly once. The issue is that you are building the list from trips and not partitions, there is no need for this since you are already generating the full set of partitions in possible_combinations, all you need to do is remove those partitions which contain trips that exceed the weight limit which would leave you with something like this:
def brute_force_cow_transport(cows, limit):
## Generate set of partitions
possible_combinations = []
for partition in get_partitions(cows.keys()):
possible_combinations.append(partition)
possible_combinations.sort(key=len)
valid_combinations = possible_combinations[:] ## or list.copy() if using python 3.3+
## Remove invalid partitions
for partition in possible_combinations:
for trip in partition:
total = sum([cows.get(cow) for cow in trip])
if total > limit:
valid_combinations.remove(partition)
break
## Return valid partition of minimum length
return min(valid_combinations, key=len)
Here, since we are iterating over the partitions we first make a copy of the partitions list so that we can remove partitions which contain trips over the limit and then return the list of minimum length as the solution. There are some simple ways to improve the performance of this but they are left as an exercise to the reader.

Why is this brute force algorithm producing the incorrect result?

I'm trying to write a brute-force algorithm that minimises the number of journeys of a herd of cows, subject to the conditions in the docstring.
def brute_force_cow_transport(cows,limit=10):
"""
Finds the allocation of cows that minimizes the number of spaceship trips
via brute force. The brute force algorithm should follow the following method:
1. Enumerate all possible ways that the cows can be divided into separate trips
2. Select the allocation that minimizes the number of trips without making any trip
that does not obey the weight limitation
Does not mutate the given dictionary of cows.
Parameters:
cows - a dictionary of name (string), weight (int) pairs
limit - weight limit of the spaceship (an int)
Returns:
A list of lists, with each inner list containing the names of cows
transported on a particular trip and the overall list containing all the
trips
"""
def weight(sub):
sum = 0
for e in sub:
sum += cows[e]
return sum
valid_trips = []
for part in list(get_partitions(cows)):
if all(weight(sub) <= limit for sub in part):
valid_trips.append(part)
return min(valid_trips)
(The function get_partitions and the dictionary cows have been given in the question)
Where have I gone wrong? I've checked the weight function (that evaluates the weight of a given spaceship trip), so it must be in the last 5 lines. I've checked the code over and over, and it returns a sub-optimal answer:
[['Florence', 'Lola'],
['Maggie', 'Milkshake', 'Moo Moo'],
['Herman'],
['Oreo'],
['Millie'],
['Henrietta'],
['Betsy']]
The syntax is fine; there are no errors being produced, yet I have a sub-optimal (but valid) answer. Why is this?

The question here is:
How do I find the shortest sublist in a nested list?
To do this, change the last line to:
min(valid_trips, key=len)

Filter list to remove similar, but not identical, entries

I have a long list containing several thousand names that are all unique strings, but I would like to filter them to produce a shorter list so that if there are similar names only one is retained. For example, the original list could contain:
Mickey Mouse
Mickey M Mouse
Mickey M. Mouse
The new list would contain just one of them - it doesn't really matter which at this moment in time. It's possible to get a similarity score using the code below (where a and b are the text being compared), so providing I pick an appropriate ratio it I have a way of making a include/exclude decision.
difflib.SequenceMatcher(None, a, b).ratio()
What I'm struggling to work out is how to populate the second list from the first one. I'm sure it's a trivial matter, but it baffling my newbie brain.
I'd have thought something along the lines of this would have worked, but nothing ends up being populated in the second list.
for p in ppl1:
for pp in ppl2:
if difflib.SequenceMater(None, p, pp).ratio() <=0.9:
ppl2.append(p)
In fact, even if that did populate the list, it'd still be wrong. I guess it'd need to compare the name from the first list to all the names in the second list, keep track of the highest ratio scored, and then only add it if the highest ratio was less that the cutoff criteria.
Any guidance gratefully received!

I'm going to risk never getting an accept because this may be too advanced for you, but here's the optimal solution.
What you're trying to do is a variant of agglomerative clustering. A union-find algorithm can be used to solve this efficiently. From all pairs of distinct strings a and b, which can be generated using
def pairs(l):
for i, a in enumerate(l):
for j in range(i + 1, len(l)):
yield (a, l[j])
you filter the pairs that have a similarity ratio <= .9:
similar = ((a, b) for a, b in pairs
if difflib.SequenceMatcher(None, p, pp).ratio() <= .9)
then union those in a disjoint-set forest. After that, you loop over the sets to get their representatives.

Firstly, you shouldn't modify a list while you're iterating over it.
One strategy would be to go through all pairs of names and, if a certain pair is too similar to each other, only keep one, and then iterate this until no two pairs are too similar. Of course, the result would now depend on the initial order of the list, but if your data is sufficiently clustered and your similarity score metric sufficiently nice, it should produce what you're looking for.

python: algorithm - to gather items from mean

Not sure whether this is the right place, but I have a question related to algorithm and I cant think of an efficient algorithm.
So thought of sharing my problem statement.. :)
To ease up what I am trying to explain, let me create a hypothetical example.
Suppose, I have a list which contains an object whcih contains two things..
lets say product id and price
Now, this is a long long list..sort of like an inventory..
out of this I have defined three price segments.. lowprice, midprice and highprice
and then k1,k2,k3 where k1,k2 and k3 are ratios.
So, the job is now,, I have to gather products from this huge inventory in such a way that there is n1 products from lowprice range, n2 products from midprice range and n3 products from high price range... where n1:n2:n3 == k1:k2:k3
Now, how do I efficiently achieve the following.
I target the low price point is 100 dollars
and I have to gather 20 products from this range..
mid price range is probably 500 dollars
and so on
So I start with 100 dollars.. and then look for items between 90 and 100 and also between 100 and 110
Let say I found 5 products in interval 1 low (90,100) and 2 products in interval 1 high (100,110)
Then, I go to next low interval and next high interval.
I keep on doing this until I get the number of products in this interval.
How do I do this?? Also there might be case, when the number of products in a particular price range is less than what I need.. (maybe mid price range is 105 dollars...).. so what should I do in that case..
Please pardon me, if this is not the right platform.. as from the question you can make out that this is more like a debative question rather than the "I am getting this error" type of question.
Thanks

You are probably looking for selection algorithm.
First find the n1'th smallest element, let it be e1, and the lower bound list is all elements such that element <= e1.
Do the same for the other ranges.
pseudo code for lower bound list:
getLowerRange(list,n):
e <- select(list,n)
result <- []
for each element in list:
if element <= e:
result.append(element)
return result
Note that this solution fails if there are many "identical" items [result will be a bigger list], but finding those items, and removing it from result list is not hard.
Note that selection algorithm is O(n), so this algorithm will consume linear time related to your list's size.

Approach 1
If the assignment what products belong to the three price segments never changes, why don't you simply build 3 lists, one for the products in each price segment (assuming these sets are disjoint).
Then you may pick from these lists randomly (either with or without replacement - as you like). The number of items for each class is given by the ratios.
Approach 2
If the product-price-segment assignment is intended to be pre-specified, e.g., by passing corresponding price values for each segment on function call, you may want to have the products sorted by price and use a binary search to select the m-nearest-neighbors (for example). The parameter m could be specified according to the ratios. If you specify a maximum distance you may reject products that are outside the desired price range.
Approach 3
If the product-price-segment assignment needs to be determined autonomously, you could apply your clustering algorithm of choice, e.g., k-means, to assign your products to, say, k = 3 price segments. For the actual product selection you may proceed similarly as described above.

It's seems like you should try a database solution rather then using a list. Check out sqlite. It's in Python by default

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.