How to speed up a for loop?

How to speed up a for loop? - python

I have this loop:
for s in sales:
salezip = sales[s][1]
salecount = sales[s][0]
for d in deals:
dealzip = deals[d][1]
dealname = deals[d][0]
for zips in ziplist:
if salezip == zips[0] and dealzip == zips[1]:
distance = zips[2]
print "MATCH FOUND"
if not salesdict.has_key(dealname):
salesdict[dealname] = [dealname,dealzip,salezip,salecount,distance]
else:
salesdict[dealname][3] += salecount
And it is taking FOREVER to run. The sales dictionary has 13k entries, the deals dictionary has 1000 entries, and the ziplist has 1.8M entries. It is obviously very slow when it hits the ziplist part, I have it set to print "MATCH FOUND" when it successfully find a match, and it hasn't printed in over 20 minutes. What can I do to make this move quicker?
Purpose of the code:
Loop through sales data which contains the amount of apples sold and the location of the purchase, pull the location and quantity info. Then, loop through apple dealers, find their location and their name. Then, loop through the ziplist data which shows distance between zip codes, sorted in ascending order of distance. The second it finds a match of the sales zip and dealer zip, it adds them to a dictionary with all of their information.

Having ziplist as an actual list of (zip1, zip2, distance) is insane - you want a data structure where you can directly find the desired item, without having to loop through the entire data set.
A dictionary with (zip1, zip2) as the key, and the distance as the value, would be enormously faster. Note that you'd need to insert each distance under the key (zip2, zip1) as well, to handle lookups in the opposite direction. Alternatively, you could sort [zip1, zip2] into numeric order before using it as a key (both on insert and lookup), so that it doesn't matter which order they are specified in.

The best thing you can do is to reorganize your code so that you don't have to loop so many times, and you don't have to do as many look-ups. It looks to me like you're looping over ziplist 130k times as much as you really need to. Here are a couple ideas that might help:
First, create a way to quickly look up sale and deal information by zip rather than by name:
sale_by_zip = {sales[key][1]: sales[key] for key in sales}
deal_by_zip = {deals[key][1]: deals[key] for key in deals}
Then, make the iteration through the ziplist the only outer loop:
for zips in ziplist:
salezip = zips[0]
dealzip = zips[1]
if salezip in sale_by_zip and dealzip in deal_by_zip:
distance = zips[2]
print "MATCH FOUND"
dealname = deal_by_zip[dealzip][0]
salecount = sale_by_zip[salezip][0]
if not salesdict.has_key(dealname):
salesdict[dealname] = [dealname,dealzip,salezip,salecount,distance]
else:
salesdict[dealname][3] += salecount
This should drastically reduce the amount of processing you need to do.
As others have noted, the structure of ziplist is also not the most well-suited to this problem. My suggestions assume ziplist is something you receive from an external source and cannot change the format of without making an extra pass over it. If you are building the ziplist yourself, however, consider something that would give you faster lookups.

The root of your problem is that you're processing the zip list multiple times - for every deal and then again for every sale.
One possibility is to reverse the order of your coding: start with the zips list, then the sales list, and finally the deals dictionary. If you're going to iterate through something multiple times, at least iterating through the smaller dictionary would be a lot faster.
if there aren't a lot matches, perhaps using "in" would be quicker, such as if dealzip in zips: and then process from then.

Related

How to append only unique values to a key in a dictionary?

sorry this is likely a complete noob question, although I'm new to python and am unable to implement any online suggestions such that they actually work. I need decrease the run-time of the code for larger files, so need to reduce the number of iterations i'm doing.
How do I modify the append_value function below to append only UNIQUE values to dict_obj, and remove the need for another series of iterations to do this later on.
EDIT: Sorry, here is an example input/output
Sample Input:
6
5 6
0 1
1 4
5 4
1 2
4 0
Sample Output:
1
4
I'm attempting to solve to solve:
http://orac.amt.edu.au/cgi-bin/train/problem.pl?problemid=416
Output Result
input_file = open("listin.txt", "r")
output_file = open("listout.txt", "w")
ls = []
n = int(input_file.readline())
for i in range(n):
a, b = input_file.readline().split()
ls.append(int(a))
ls.append(int(b))
def append_value(dict_obj, key, value): # How to append only UNIQUE values to
if key in dict_obj: # dict_obj?
if not isinstance(dict_obj[key], list):
dict_obj[key] = [dict_obj[key]]
dict_obj[key].append(value)
else:
dict_obj[key] = value
mx = []
ls.sort()
Dict = {}
for i in range(len(ls)):
c = ls.count(ls[i])
append_value(Dict, int(c), ls[i])
mx.append(c)
x = max(mx)
lss = []
list_set = set(Dict[x]) #To remove the need for this
unique_list = (list(list_set))
for x in unique_list:
lss.append(x)
lsss = sorted(lss)
for i in lsss:
output_file.write(str(i) + "\n")
output_file.close()
input_file.close()
Thank you

The answer to your question, 'how to only append unique values to this container' is fairly simple: change it from a list to a set (as #ShadowRanger suggested in the comments). This isn't really a question about dictionaries, though; you're not appending values to 'dict_obj', only to a list stored in the dictionary.
Since the source you linked to shows this is a training problem for people newer to coding, you should know that changing the lists to sets might be a good idea, but it's not the cause of the performance issues.
The problem boils down to: given a file containing a list of integers, print the most common integer(s). Your current code iterates over the list, and for each index i, iterates over the entire list to count matches with ls[i] (this is the line c = ls.count(ls[i])).
Some operations are more expensive than others: calling count() is one of the more expensive operations on a Python list. It reads through the entire list every time it's called. This is an O(n) function, which is inside a length n loop, taking O(n^2) time. All of the set() filtering for non-unique elements takes O(n) time total (and is even quite fast in practice). Identifying linear-time functions hidden in loops like this is a frequent theme in optimization, but profiling your code would have identified this.
In general, you'll want to use something like the Counter class in Python's standard library for frequency counting. That kind of defeats the whole point of this training problem, though, which is to encourage you to improve on the brute-force algorithm for finding the most frequent element(s) in a list. One possible way to solve this problem is to read the description of Counter, and try to mimic its behavior yourself with a plain Python dictionary.

Answering the question you haven't asked: Your whole approach is overkill.
You don't need to worry about uniqueness; the question prompt guarantees that if you see 2 5, you'll never see 5 2, nor a repeat of 2 5
You don't even care who is friends with who, you just care how many friends an individual has
So don't even bother making the pairs. Just count how many times each player ID appears at all. If you see 2 5, that means 2 has one more friend, and 5 has one more friend, it doesn't matter who they are friends with.
The entire problem can simplify down to a simple exercise in separating the player IDs and counting them all up (because each appearance means one more unique friend), then keeping only the ones with the highest counts.
A fairly idiomatic solution (reading from stdin and writing to stdout; tweaking it to open files is left as an exercise) would be something like:
import sys
from collections import Counter
from itertools import chain, islice
def main():
numlines = int(next(sys.stdin))
friend_pairs = map(str.split, islice(sys.stdin, numlines)) # Convert lines to friendship pairs
counts = Counter(chain.from_iterable(friend_pairs)) # Flatten to friend mentions and count mentions to get friend count
max_count = max(counts.values()) # Identify maximum friend count
winners = [pid for pid, cnt in counts.items() if cnt == max_count]
winners.sort(key=int) # Sort winners numerically
print(*winners, sep="\n")
if __name__ == '__main__':
main()
Try it online!
Technically, it doesn't even require the use of islice nor storing to numlines (the line count at the beginning might be useful to low level languages to preallocate an array for results, but for Python, you can just read line by line until you run out), so the first two lines of main could simplify to:
next(sys.stdin)
friend_pairs = map(str.split, sys.stdin)
But either way, you don't need to uniquify friendships, nor preserve any knowledge of who is friends with whom to figure out who has the most friends, so save yourself some trouble and skip the unnecessary work.

If you intention is to have a list in each value of the dictionary why not iterate the same way you iterated on each key.
if key in dict_obj.keys():
for elem in dict_obje[key]: # dict_obje[key] asusming the value is a list
if (elem == value):
else:
# append the value to the desired list
else:
dic_obj[key] = value

Is there a better way than a while loop to perform this function?

I was attempting some python exercises and I hit the 5s timeout on one of the tests. The function is pre-populated with the parameters and I am tasked with writing code that is fast enough to run within the max timeframe of 5s.
There are N dishes in a row on a kaiten belt, with the ith dish being of type Di. Some dishes may be of the same type as one another. The N dishes will arrive in front of you, one after another in order, and for each one you'll eat it as long as it isn't the same type as any of the previous K dishes you've eaten. You eat very fast, so you can consume a dish before the next one gets to you. Any dishes you choose not to eat as they pass will be eaten by others.
Determine how many dishes you'll end up eating.
Issue
The code "works" but is not fast enough.
Code
The idea here is to add the D[i] entry if it is not in the pastDishes list (which can be of size K).
from typing import List
# Write any import statements here
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
numDishes=0
pastDishes=[]
i=0
while (i<N):
if(D[i] not in pastDishes):
numDishes+=1
pastDishes.append(D[i])
if len(pastDishes)>K:
pastDishes.pop(0)
i+=1
return numDishes
Is there a more effective way?

After much trial and error, I have finally found a solution that is fast enough to pass the final case in the puzzle you are working on. My previous code was very neat and quick, however, I have finally found a module with a tool that makes this much faster. Its from collections just as deque is, however it is called Counter.
This was my original code:
def getMaximumEatenDishCount(N: int, D: list, K: int) -> int:
numDishes=lastMod=0
pastDishes=[0]*K
for Dval in D:
if Dval in pastDishes:continue
pastDishes[lastMod] = Dval
numDishes,lastMod = numDishes+1,(lastMod+1)%K
return numDishes
I then implemented Counter like so:
from typing import List
# Write any import statements here
from collections import Counter
def getMaximumEatenDishCount(N: int, D: 'list[int]', K: int) -> int:
eatCount=lastMod = 0
pastDishes=[0]*K
eatenCounts = Counter({0:K})
for Dval in D:
if Dval in eatenCounts:continue
eatCount +=1
eatenCounts[Dval] +=1
val = pastDishes[lastMod]
if eatenCounts[val] <= 1: eatenCounts.pop(val)
else: eatenCounts[val] -= 1
pastDishes[lastMod]=Dval
lastMod = (lastMod+1)%K
return eatCount
Which ended up working quite well. I'm sure you can make it less clunky, but this should work fine on its own.
Some Explanation of what i am doing:
Typically while loops are actually marginally faster than a for loop, however since I need to access the value at an index multiple times if i used it, using a for loop I believe is actually better in this situation. You can see i also initialised the list to the max size it needs to be and am writing over the values instead of popping+appending which saves alot of time. Additionally, as pointed out by #outis, another small improvement was made in my code by using the modulo operator in conjunction with the variable which removes the need for an additional if statement. The Counter is essentially a special dict object that holds a hashable as the key and an int as the value. I use the fact that lastMod is an index to what would normally be accesed through list.pop(0) to access the object needed to either remove or decrement in the counter
Note that it is not considered 'pythonic' to assign multiple variable on one line, however I believe it adds a slight performance boost which is why I have done it. This can be argued though, see this post.
If anyone else is interested the problem that we were trying to solve, it can be found here: https://www.facebookrecruiting.com/portal/coding_puzzles/?puzzle=958513514962507

Can we use an appropriate data structure? If so:
Data structures
Seems like an ordered set which you have to shrink to a capacity restriction of K.
To meet that, if exceeds (len(ordered_set) > K) we have to remove the first n items where n = len(ordered_set) - K. Ideally the removal will perform in O(1).
However since removal on a set is in unordered fashion. We first transform it to a list. A list containing unique elements in the order of appearance in their original sequence.
From that ordered list we can then remove the first n elements.
For example: the function lru returns the least-recently-used items for a sequence seq limited by capacity-limit k.
To obtain the length we can simply call len() on that LRU return value:
maximumEatenDishCount = len(lru(seq, k))
See also:
Does Python have an ordered set?
Fastest way to get sorted unique list in python?
Using set for uniqueness (up to Python 3.6)
def lru(seq, k):
return list(set(seq))[:k]
Using dict for uniqueness (since Python 3.6)
Same mechanics as above, but using the preserved insertion order of dicts since 3.7:
using OrderedDict explicitly
from collections import OrderedDict
def lru(seq, k):
return list(OrderedDict.fromkeys(seq).keys())[:k]
using dict factory-method:
def lru(seq, k):
return list(dict.fromkeys(seq).keys())[:k]
using dict-comprehension:
def lru(seq, k):
return list({i:0 for i in seq}.keys())[:k]
See also:
The order of keys in dictionaries
Using ordered dictionary as ordered set
How do you remove duplicates from a list whilst preserving order?
Real Python: OrderedDict vs dict in Python: The Right Tool for the Job

As the problem is an exercise, exact solutions are not included. Instead, strategies are described.
There are at least a couple potential approaches:
Use a data structure that supports fast containment testing (a set in use, if not in name) limited to the K most recently eaten dishes. Fortunately, since dict preserves insertion order in newer Python versions and testing key containment is fast, it will fit the bill. dict requires that keys be hashable, but since the problem uses ints to represent dish types, that requirement is met.
With this approach, the algorithm in the question remains unchanged.
Rather than checking whether the next dish type is any of the last K dishes, check whether the last time the next dish was eaten is within K of the current plate count. If it is, skip the dish. If not, eat the dish (update both the record of when the next dish was last eaten and the current dish count). In terms of data structures, the program will need to keep a record of when any given dish type was last eaten (initialized to -K-1 to ensure that the first time a dish type is encountered it will be eaten; defaultdict can be very useful for this).
With this approach, the algorithm is slightly different. The code ends up being slightly shorter, as there's no shortening of the data structure storing information about the dishes as there is in the original algorithm.
There are two takeaways from the latter approach that might be applied when solving other problems:
More broadly, reframing a problem (such as from "the dish is in the last K dishes eaten" to "the dish was last eaten within K dishes of now") can result in a simpler approach.
Less broadly, sometimes it's more efficient to work with a flipped data structure, swapping keys/indices and values.
Approach & takeaway 2 both remind me of a substring search algorithm (the name escapes me) that uses a table of positions in the needle (the string to search for) of where each character first appears (for characters not in the string, the table has the length of the string); when a mismatch occurs, the algorithm uses the table to align the substring with the mismatching character, then starts checking at the start of the substring. It's not the most efficient string search algorithm, but it's simple and more efficient than the naive algorithm. It's similar to but simpler and less efficient than the skip search algorithm, which uses the positions of every occurrence of each character in the needle.

from typing import List
# Write any import statements here
from collections import deque, Counter
def getMaximumEatenDishCount(N: int, D: List[int], K: int) -> int:
# Write your code here
q = deque()
cnt = 0
dish_counter = Counter()
for d in D:
if dish_counter[d] == 0:
cnt += 1
q.append(d)
dish_counter[d] += 1
if len(q) == K + 1:
remove = q.popleft()
dish_counter[remove] -= 1
return cnt

How to merge several Python lists together using for loop

I am writing a script in order to calculate all the euclidean distances between a X value and a lot of other values in a dictonary, obtaining a float that, then, I convert in a list. The problem is I don't obtain a list with all the outcomes but many lists with only one element inside, my outcome.
My script for the moment is:
single_mineral = {}
for k in new_dict.keys():
single_mineral = new_dict[k]
Zeff = single_mineral["Zeff_norm"]
rhoe = single_mineral["Rhoe_norm"]
eucl_Zeff= (calculated_Zeff_norm, Zeff)
eucl_rhoe= (calculated_rhoe_norm, rhoe)
dst= [(distance.euclidean(eucl_Zeff, eucl_rhoe))]
print(dst)
I obtain something like that:
[0.29205348037179407]
[0.23436642937625374]
[0.3835446564476642]
[0.11616594912309205]
[0.21792958584034935]
and they are not linked somehow (so I can't use intertools.chain).
I want to create a single list with all these lists (the final goal is the ascending order...for this reason I need only one list).
I guess the solution is a for loop but I have no idea how to do it. I don't understand where it needs to run and how can I add my outcomes, which are always called "dst"?
Please, help me! Thank you very much in advance!

if you want to do get all in one list then you need
# before loop
dst = []
# loop
for k in new_dict.keys():
# ... code ...
#dst.append( [distance.euclidean(eucl_Zeff, eucl_rhoe)] )
dst.append( distance.euclidean(eucl_Zeff, eucl_rhoe) )
# after loop
print(dst)

Avoid redundant looping in Python

I have two lists(items,sales) and for each pair of item,sale elements between two lists I have to call a function. I'm looking out for a pythonic way to avoid such redundant looping
First Loop:
# Create item_sales_list
item_sales_list = list()
for item,sales in itertools.product(items,sales):
if sales > 100:
item_sales_list.append([item,sales])
result = some_func_1(item_sales_list)
Second Loop:
# Call a function with the result returned from first function (some_func_1)
for item,sales in itertools.product(items,sales):
some_func_2(item,sales,result)

You can avoid the second call to itertools.product at least if you store the result in the list, adding the condition at the call site of some_func_1:
item_sales_list = list(itertools.product(items, sales))
result = some_func_1([el for el in item_sales_list if el[1] > 100])
for item, sales in item_sales_list:
some_func_2(item, sales, result)
It is impossible to do it with one pass unless you can pass an incomplete version of result to some_func_2.

A solution, and a frame challenge.
First, to avoid calculating itertools.product() multiple times, you can calculate it once up-front and then use it for both loops:
item_product = list(itertools.product(items, sales))
item_sales_list = [[item, sales] for item, sales in item_product if sales > 100]
Second, there's actually no time disadvantage to looping twice (you're still doing basically the same amount of work - the same operations, the same amount of times each. So it's still in the same complexity class). And in this case it's unavoidable, because you need the result of the first calculation (which requires going over the entire list) to do the second calculation.
result = some_func_1(item_sales_list)
for item, sales in item_product:
some_func_2(item, sales, result)
If you can modify some_func_2() so that it doesn't need the entire item_sales_list in order to work, then you could load it into the same for loop and do them one after another. Without knowing how some_func_2() works, it's impossible to give any further advice.

Comparing massive lists of dictionaries in python

I never actually thought I'd run into speed-issues with python, but I have. I'm trying to compare really big lists of dictionaries to each other based on the dictionary values. I compare two lists, with the first like so
biglist1=[{'transaction':'somevalue', 'id':'somevalue', 'date':'somevalue' ...}, {'transactio':'somevalue', 'id':'somevalue', 'date':'somevalue' ...}, ...]
With 'somevalue' standing for a user-generated string, int or decimal. Now, the second list is pretty similar, except the id-values are always empty, as they have not been assigned yet.
biglist2=[{'transaction':'somevalue', 'id':'', 'date':'somevalue' ...}, {'transactio':'somevalue', 'id':'', 'date':'somevalue' ...}, ...]
So I want to get a list of the dictionaries in biglist2 that match the dictionaries in biglist1 for all other keys except id.
I've been doing
for item in biglist2:
for transaction in biglist1:
if item['transaction'] == transaction['transaction']:
list_transactionnamematches.append(transaction)
for item in biglist2:
for transaction in list_transactionnamematches:
if item['date'] == transaction['date']:
list_transactionnamematches.append(transaction)
... and so on, not comparing id values, until I get a final list of matches. Since the lists can be really big (around 3000+ items each), this takes quite some time for python to loop through.
I'm guessing this isn't really how this kind of comparison should be done. Any ideas?

Index on the fields you want to use for lookup. O(n+m)
matches = []
biglist1_indexed = {}
for item in biglist1:
biglist1_indexed[(item["transaction"], item["date"])] = item
for item in biglist2:
if (item["transaction"], item["date"]) in biglist1_indexed:
matches.append(item)
This is probably thousands of times faster than what you're doing now.

What you want to do is to use correct data structures:
Create a dictionary of mappings of tuples of other values in the first dictionary to their id.
Create two sets of tuples of values in both dictionaries. Then use set operations to get the tuple set you want.
Use the dictionary from the point 1 to assign ids to those tuples.

Forgive my rusty python syntax, it's been a while, so consider this partially pseudocode
import operator
biglist1.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
biglist2.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
i1=0;
i2=0;
while i1 < len(biglist1) and i2 < len(biglist2):
if (biglist1[i1]['date'],biglist1[i1]['transaction']) == (biglist2[i2]['date'],biglist2[i2]['transaction']):
biglist3.append(biglist1[i1])
i1++
i2++
elif (biglist1[i1]['date'],biglist1[i1]['transaction']) < (biglist2[i2]['date'],biglist2[i2]['transaction']):
i1++
elif (biglist1[i1]['date'],biglist1[i1]['transaction']) > (biglist2[i2]['date'],biglist2[i2]['transaction']):
i2++
else:
print "this wont happen if i did the tuple comparison correctly"
This sorts both lists into the same order, by (date,transaction). Then it walks through them side by side, stepping through each looking for relatively adjacent matches. It assumes that (date,transaction) is unique, and that I am not completely off my rocker with regards to tuple sorting and comparison.

In O(m*n)...
for item in biglist2:
for transaction in biglist1:
if (item['transaction'] == transaction['transaction'] &&
item['date'] == transaction['date'] &&
item['foo'] == transaction['foo'] ) :
list_transactionnamematches.append(transaction)

The approach I would probably take to this is to make a very, very lightweight class with one instance variable and one method. The instance variable is a pointer to a dictionary; the method overrides the built-in special method __hash__(self), returning a value calculated from all the values in the dictionary except id.
From there the solution seems fairly obvious: Create two initially empty dictionaries: N and M (for no-matches and matches.) Loop over each list exactly once, and for each of these dictionaries representing a transaction (let's call it a Tx_dict), create an instance of the new class (a Tx_ptr). Then test for an item matching this Tx_ptr in N and M: if there is no matching item in N, insert the current Tx_ptr into N; if there is a matching item in N but no matching item in M, insert the current Tx_ptr into M with the Tx_ptr itself as a key and a list containing the Tx_ptr as the value; if there is a matching item in N and in M, append the current Tx_ptr to the value associated with that key in M.
After you've gone through every item once, your dictionary M will contain pointers to all the transactions which match other transactions, all neatly grouped together into lists for you.
Edit: Oops! Obviously, the correct action if there is a matching Tx_ptr in N but not in M is to insert a key-value pair into M with the current Tx_ptr as the key and as the value, a list of the current Tx_ptr and the Tx_ptr that was already in N.

Have a look at Psyco. Its a Python compiler that can create very fast, optimized machine code from your source.
http://sourceforge.net/projects/psyco/
While this isn't a direct solution to your code's efficiency issues, it could still help speed things up without needing to write any new code. That said, I'd still highly recommend optimizing your code as much as possible AND use Psyco to squeeze as much speed out of it as possible.
Part of their guide specifically talks about using it to speed up list, string, and numeric computation heavy functions.
http://psyco.sourceforge.net/psycoguide/node8.html

I'm also a newbie. My code is structured in much the same way as his.
for A in biglist:
for B in biglist:
if ( A.get('somekey') <> B.get('somekey') and #don't match to itself
len( set(A.get('list')) - set(B.get('list')) ) > 10:
[do stuff...]
This takes hours to run through a list of 10000 dictionaries. Each dictionary contains lots of stuff but I could potentially pull out just the ids ('somekey') and lists ('list') and rewrite as a single dictionary of 10000 key:value pairs.
Question: how much faster would that be? And I assume this is faster than using a list of lists, right?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.