I have a dictionary of form
myNestedDict=collections.defaultdict(dict)
with example data:
{'509582':{'509533': 65.499},
'509583':{'509534': -35.499},{'509568': -325.499},
'509584':{'509576': 0},{'509576': -1337} }
And I am trying to get it to return both keys associated with the n smallest per value.
So for this example if I am looking for 2 smallest: heapq.nsmallest(2, myNestedDict, key=???)
I would like to return the dictionary:
{'509584':{'509576': -1337},
'509583':{'509568': -325.499} }
or,I actually don't really need the value anymore so it could return a non-nested dictionary if easier:
{'509584':'509576',
'509583':'509568'}
As you can see, I can't figure out a proper key design for the heapq.nsmallest to sort on the innermost value. Any help hugely appreciated. Thanks
NOTE - this dictionary is many millions of records, so efficiency is important.
edit - this right here is what I have that actually runs, but its sorting on the first key only, I need to sort on the innermost value. note, they say itemgetter is way faster that lambda for this multi-million sized dictionary
heapq.nsmallest( 2, myNestedDict.items(), key=itemgetter(0) )
I solved this by converting my nested dictionary into a list of tuples using list comprehension and then using heapq.nsmallest(2, myTupleList, key=itemgetter(2) ).
Performance is manageable because it only takes 0.5 sec to convert the dictionary to a list for a 1 million row example.
Related
What would be the fastest, most efficient way to grab and map multiple values to one value. For a use case example, say you are multiplying two numbers and you want to remember if you have multiplied those numbers before. Instead of making a giant matrix of X by Y and filling it out, it would be nice to query a Dict to see if dict[2,3] = 6 or dict[3,2] = 6. This would be especially useful for more than 2 values.
I have seen an answer similar to what I'm asking here, but would this be O(n) time or O(1)?
print value for matching multiple key
for key in responses:
if user_message in key:
print(responses[key])
Thanks!
Seems like the easiest way to do this is to sort the values before putting them in the dict. Then sort the x,y... values before looking them up. And note that you need to use tuples to map into a dictionary (lists are mutable).
the_dict = {(2,3,4): 24, (4,5,6): 120}
nums = tuple(sorted([6,4,5]))
if nums in the_dict:
print(the_dict[nums])
I was coding a Euler problem, and I ran into question that sparked my curiosity. I have two snippets of code. One is with lists the other uses dictionaries.
using lists:
n=100000
num=[]
suma=0
for i in range(n,1,-1):
tmp=tuple(set([n for n in factors(i)]))
if len(tmp) != 2: continue
if tmp not in num:
num.append(tmp)
suma+=i
using dictionaries:
n=100000
num={}
suma=0
for i in range(n,1,-1):
tmp=tuple(set([n for n in factors(i)]))
if len(tmp) != 2: continue
if tmp not in num:
num[tmp]=i
suma+=i
I am only concerned about performance. Why does the second example using dictionaries run incredibly fast, faster than the first example with lists. the example with dictionaries runs almost thirty-fold faster!
I tested these 2 code using n=1000000, and the first code run in 1032 seconds and the second one run in just 3.3 second,,, amazin'!
In Python, the average time complexity of a dictionary key lookup is O(1), since they are implemented as hash tables. The time complexity of lookup in a list is O(n) on average. In your code, this makes a difference in the line if tmp not in num:, since in the list case, Python needs to search through the whole list to detect membership, whereas in the dict case it does not except for the absolute worst case.
For more details, check out TimeComplexity.
If it's about speed, you should not create any lists:
n = 100000
factors = ((frozenset(factors(i)), i) for i in range(2, n+1))
num = {k:v for k,v in factors if len(k)==2}
suma = sum(num.values())
I am almost positive that the "magic sauce" using a dictionary lies in the fact that the dictionary is comprised of key->value pairs.
in a list, youre dealing with arrays, which means the for loop has to start at index 0 inside of your list in order to loop through every record.
the dictionary just has to find the key->value pair in question on the first 'go-round' and return it, hence the speed...
basically, testing for membership in a set of key->value pairs is a lot quicker than searching an entire list for a value. the larger your list gets the slower it will be... but this isnt always the case, there are scenarios where a list will be faster... but i believe this may be the answer youre looking for
In a list, the code if tmp not in num: is O(n), while it is O(lgn) in dict.
Edit: The dict is based on hashing, so it is much quicker than liner list search.
Thanks #user2357112 for point this.
I apologize this must be a basic question for using dictionaries. I'm learning python, and the objective I have is to compare two dictionaries and recover the Key and Value entries from both entries that are identical. I understand that the order in dictionaries is not relevant like if one is working with a list. But I adopted a code to compare my dictionaries and i just wanted to make sure that the order of the dictionaries does not matter.
The code I have written so far is:
def compare_dict(first,second):
with open('Common_hits_python.txt', 'w') as file:
for keyone in first:
for keytwo in second:
if keytwo == keyone:
if first[keyone] == second[keytwo]:
file.write(keyone + "\t" + first[keyone] + "\n")
Any recommendations would be appreciated. I apologize for the redundany in the code above. But if someone could confirm that comparing two dictionaries this way does not require the key to be in the same order would great. Other ways of writing the function would be really appreciated as well.
Since you loop over both dictionaries and compare all the combinations, no, order doesn't matter. Every key in one dictionary is compared with every key in the other dictionary, eventually.
It is not a very efficient way to test for matching keys, however. Testing if a key is present is as simple as keyone in second, no need to loop over all the keys in second here.
Better still, you can use set intersections instead:
for key, value in first.viewitems() & second.viewitems():
# loops over all key - value pairs that match in both.
file.write('{}\t{}\n'.format(key, value))
This uses dictionary view objects; if you are using Python 3, then you can use first.items() & second.items() as dictionaries there return dictionary views by default.
Using dict.viewitems() as a set only works if the values are hashable too, but since you are treating your values as strings when writing to the file I assumed they were.
If your values are not hashable, you'll need to validate that the values match, but you can still use views and intersect just the keys:
for key in first.viewkeys() & second.viewkeys():
# loops over all keys that match in both.
if first[key] == second[key]:
file.write('{}\t{}\n'.format(key, first[key]))
Again, in Python 3, use first.keys() & second.keys() for the intersection of the two dictionaries by keys.
Your way of doing it is valid. As you look through both lists, the order of the dictionaries does not matter.
You could do this instead, to optimize your code.
for keyone in first:
if keyone in second: # returns true if keyone is present in second.
if first[keyone] == second[keyone]:
file.write(keyone + "\t" + first[keyone] + "\n")
The keys of a dictionary are effectively a set, and Python already has a built-in set type with an efficient intersection method. This will produce a set of keys that are common to both dictionaries:
dict0 = {...}
dict1 = {...}
set0 = set(dict0)
set1 = set(dict1)
keys = set0.intersection(set1)
Your goal is to build a dictionary out of these keys, which can be done with a dictionary comprehension. It will require a condition to keep out the keys that have unequal values in the two original dictionaries:
new_dict = {k: dict0[k] for k in keys if dict0[k] == dict1[k]}
Depending on your intended use for the new dictionary, you might want to copy or deepcopy the old dictionary's values into the new one.
Suppose I have some kind of dictionary structure like this (or another data structure representing the same thing.
d = {
42.123231:'X',
42.1432423:'Y',
45.3213213:'Z',
..etc
}
I want to create a function like this:
f(n,d,e):
'''Return a list with the values in dictionary d corresponding to the float n
within (+/-) the float error term e'''
So if I called the function like this with the above dictionary:
f(42,d,2)
It would return
['X','Y']
However, while it is straightforward to write this function with a loop, I don't want to do something that goes through every value in the dictionary and checks it exhaustively, but I want it to take advantage of the indexed structure somehow (or a even a sorted list could be used) to make the search much faster.
Dictionary is a wrong data structure for this. Write a search tree.
Python dictionary is a hashmap implementation. Its keys can't be compared and traversed as in search tree. So you simply can't do it using python dictionary without actually checking all keys.
Dictionaries with numeric keys are usually sorted - by key values. But you may - to be on the safe side - rearrange it as OrderedDictionary - you do it once
from collections import OrderedDict
d_ordered = OrderedDict(sorted(d.items(), key =lambda i:i[0]))
Then filtering values is rather simple - and it will stop at the upper border
import itertools
values = [val for k, val in
itertools.takewhile(lambda (k,v): k<upper, d_ordered.iteritems())
if k > lower]
As I've already stated, ordering dictionary is not really necessary - but some will say that this assumption is based on the current implementation and may change in the future.
I never actually thought I'd run into speed-issues with python, but I have. I'm trying to compare really big lists of dictionaries to each other based on the dictionary values. I compare two lists, with the first like so
biglist1=[{'transaction':'somevalue', 'id':'somevalue', 'date':'somevalue' ...}, {'transactio':'somevalue', 'id':'somevalue', 'date':'somevalue' ...}, ...]
With 'somevalue' standing for a user-generated string, int or decimal. Now, the second list is pretty similar, except the id-values are always empty, as they have not been assigned yet.
biglist2=[{'transaction':'somevalue', 'id':'', 'date':'somevalue' ...}, {'transactio':'somevalue', 'id':'', 'date':'somevalue' ...}, ...]
So I want to get a list of the dictionaries in biglist2 that match the dictionaries in biglist1 for all other keys except id.
I've been doing
for item in biglist2:
for transaction in biglist1:
if item['transaction'] == transaction['transaction']:
list_transactionnamematches.append(transaction)
for item in biglist2:
for transaction in list_transactionnamematches:
if item['date'] == transaction['date']:
list_transactionnamematches.append(transaction)
... and so on, not comparing id values, until I get a final list of matches. Since the lists can be really big (around 3000+ items each), this takes quite some time for python to loop through.
I'm guessing this isn't really how this kind of comparison should be done. Any ideas?
Index on the fields you want to use for lookup. O(n+m)
matches = []
biglist1_indexed = {}
for item in biglist1:
biglist1_indexed[(item["transaction"], item["date"])] = item
for item in biglist2:
if (item["transaction"], item["date"]) in biglist1_indexed:
matches.append(item)
This is probably thousands of times faster than what you're doing now.
What you want to do is to use correct data structures:
Create a dictionary of mappings of tuples of other values in the first dictionary to their id.
Create two sets of tuples of values in both dictionaries. Then use set operations to get the tuple set you want.
Use the dictionary from the point 1 to assign ids to those tuples.
Forgive my rusty python syntax, it's been a while, so consider this partially pseudocode
import operator
biglist1.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
biglist2.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
i1=0;
i2=0;
while i1 < len(biglist1) and i2 < len(biglist2):
if (biglist1[i1]['date'],biglist1[i1]['transaction']) == (biglist2[i2]['date'],biglist2[i2]['transaction']):
biglist3.append(biglist1[i1])
i1++
i2++
elif (biglist1[i1]['date'],biglist1[i1]['transaction']) < (biglist2[i2]['date'],biglist2[i2]['transaction']):
i1++
elif (biglist1[i1]['date'],biglist1[i1]['transaction']) > (biglist2[i2]['date'],biglist2[i2]['transaction']):
i2++
else:
print "this wont happen if i did the tuple comparison correctly"
This sorts both lists into the same order, by (date,transaction). Then it walks through them side by side, stepping through each looking for relatively adjacent matches. It assumes that (date,transaction) is unique, and that I am not completely off my rocker with regards to tuple sorting and comparison.
In O(m*n)...
for item in biglist2:
for transaction in biglist1:
if (item['transaction'] == transaction['transaction'] &&
item['date'] == transaction['date'] &&
item['foo'] == transaction['foo'] ) :
list_transactionnamematches.append(transaction)
The approach I would probably take to this is to make a very, very lightweight class with one instance variable and one method. The instance variable is a pointer to a dictionary; the method overrides the built-in special method __hash__(self), returning a value calculated from all the values in the dictionary except id.
From there the solution seems fairly obvious: Create two initially empty dictionaries: N and M (for no-matches and matches.) Loop over each list exactly once, and for each of these dictionaries representing a transaction (let's call it a Tx_dict), create an instance of the new class (a Tx_ptr). Then test for an item matching this Tx_ptr in N and M: if there is no matching item in N, insert the current Tx_ptr into N; if there is a matching item in N but no matching item in M, insert the current Tx_ptr into M with the Tx_ptr itself as a key and a list containing the Tx_ptr as the value; if there is a matching item in N and in M, append the current Tx_ptr to the value associated with that key in M.
After you've gone through every item once, your dictionary M will contain pointers to all the transactions which match other transactions, all neatly grouped together into lists for you.
Edit: Oops! Obviously, the correct action if there is a matching Tx_ptr in N but not in M is to insert a key-value pair into M with the current Tx_ptr as the key and as the value, a list of the current Tx_ptr and the Tx_ptr that was already in N.
Have a look at Psyco. Its a Python compiler that can create very fast, optimized machine code from your source.
http://sourceforge.net/projects/psyco/
While this isn't a direct solution to your code's efficiency issues, it could still help speed things up without needing to write any new code. That said, I'd still highly recommend optimizing your code as much as possible AND use Psyco to squeeze as much speed out of it as possible.
Part of their guide specifically talks about using it to speed up list, string, and numeric computation heavy functions.
http://psyco.sourceforge.net/psycoguide/node8.html
I'm also a newbie. My code is structured in much the same way as his.
for A in biglist:
for B in biglist:
if ( A.get('somekey') <> B.get('somekey') and #don't match to itself
len( set(A.get('list')) - set(B.get('list')) ) > 10:
[do stuff...]
This takes hours to run through a list of 10000 dictionaries. Each dictionary contains lots of stuff but I could potentially pull out just the ids ('somekey') and lists ('list') and rewrite as a single dictionary of 10000 key:value pairs.
Question: how much faster would that be? And I assume this is faster than using a list of lists, right?