Python Matching Multiple Keys/ Unique Pairs to a Value - python

What would be the fastest, most efficient way to grab and map multiple values to one value. For a use case example, say you are multiplying two numbers and you want to remember if you have multiplied those numbers before. Instead of making a giant matrix of X by Y and filling it out, it would be nice to query a Dict to see if dict[2,3] = 6 or dict[3,2] = 6. This would be especially useful for more than 2 values.
I have seen an answer similar to what I'm asking here, but would this be O(n) time or O(1)?
print value for matching multiple key
for key in responses:
if user_message in key:
print(responses[key])
Thanks!

Seems like the easiest way to do this is to sort the values before putting them in the dict. Then sort the x,y... values before looking them up. And note that you need to use tuples to map into a dictionary (lists are mutable).
the_dict = {(2,3,4): 24, (4,5,6): 120}
nums = tuple(sorted([6,4,5]))
if nums in the_dict:
print(the_dict[nums])

Related

How to find most repeated element in a list Python?

I'm currently using Counter() method for this. But the issue I'm facing is that when there are multiple elements with same number of values I'm getting the out of key value of number which occurs first in the list.
a=[1,3,2,2,3]
coun=Counter(a)
print(coun.most_common(1))
Output: [(3,2)]
a=[1,2,3,2,3]
coun=Counter(a)
print(coun.most_common(1))
Output: [(2,2)]
I want to get the key value which is lower instead of the one that occurs first i.e 2 here irrespective of the order. I could sort the list but I'm considering that sorting can use up a lot of time.
Please help
Sorry for the formatting mess.
Depending on the amount of duplicates you are expecting you could simply check more of the most_common values? Assuming that there's no more than 100 values with exactly the same amount you could simply do:
print(sorted(coun.most_common(100))[0])
You could use a different values for 100 of course. But now the list to sort would be at most 100 tuples, which of course isn't a problem.

heapq.nsmallest on nested dictionary innermost value

I have a dictionary of form
myNestedDict=collections.defaultdict(dict)
with example data:
{'509582':{'509533': 65.499},
'509583':{'509534': -35.499},{'509568': -325.499},
'509584':{'509576': 0},{'509576': -1337} }
And I am trying to get it to return both keys associated with the n smallest per value.
So for this example if I am looking for 2 smallest: heapq.nsmallest(2, myNestedDict, key=???)
I would like to return the dictionary:
{'509584':{'509576': -1337},
'509583':{'509568': -325.499} }
or,I actually don't really need the value anymore so it could return a non-nested dictionary if easier:
{'509584':'509576',
'509583':'509568'}
As you can see, I can't figure out a proper key design for the heapq.nsmallest to sort on the innermost value. Any help hugely appreciated. Thanks
NOTE - this dictionary is many millions of records, so efficiency is important.
edit - this right here is what I have that actually runs, but its sorting on the first key only, I need to sort on the innermost value. note, they say itemgetter is way faster that lambda for this multi-million sized dictionary
heapq.nsmallest( 2, myNestedDict.items(), key=itemgetter(0) )
I solved this by converting my nested dictionary into a list of tuples using list comprehension and then using heapq.nsmallest(2, myTupleList, key=itemgetter(2) ).
Performance is manageable because it only takes 0.5 sec to convert the dictionary to a list for a 1 million row example.

Python Spark split list into sublists divided by the sum of value inside elements

I try to split a list of objects in python into sublists based on the cumulative value of one of the parameters in the object. Let me present it on the example:
I have a list of objects like this:
[{x:1, y:2}, {x:3, y:2}, ..., {x:5, y: 1}]
and I want to divide this list into sub-lists where the total sum of x values inside a sublist will be the same (or roughly the same) so the result could look like this:
[[{x:3, y:1}, {x:3, y:1}, {x:4, y:1}], [{x:2, y:1}, {x:2, y:1}, {x:6, y:1}]]
Where the sum of x'es is equal to 10. Objects I am working with are a little bit more complicated, and my x'es are float values. So I want to aggregate the values from the ordered list, up till the sum of x'es will be >= 10, and then start creating next sub-list.
In my case, the first list of elements is an ordered list, and the summation has to take place on the ordered list.
I done something like this already in C#, where I iterate through all my elements, and keep one counter of "x" value. I sum the value of x for consecutive objects, until it will hit my threshold, and then I create a new sub-list, and restart my counter.
Now I want to reimplement it in python, and next use it with Spark. So I am looking for a little bit more "functional" implementation, maybe something to work nicely with map-reduce framework. I can't figure out another way than the iterative approach.
If you have any suggestions, or possible solutions, I would welcome all constructive comments.

Insert number to a list

I have an ordered dictionary like following:
source =([('a',[1,2,3,4,5,6,7,11,13,17]),('b',[1,2,3,12])])
I want to calculate the length of each key's value first, then calculate the sqrt of it, say it is L.
Insert L to the positions which can be divided without remainder and insert "1" after other number.
For example, source['a'] = [1,2,3,4,5,6,7,11,13,17] the length of it is 9.
Thus sqrt of len(source['a']) is 3.
Insert number 3 at the position which can be divided exactly by 3 (eg. position 3, position 6, position 9) if the position of the number can not be divided exactly by 3 then insert 1 after it.
To get a result like folloing:
result=([('a',["1,1","2,1","3,3","4,1","5,1","6,3","7,1","11,1","13,3","10,1"]),('b',["1,1","2,2","3,1","12,2"])]
I dont know how to change the item in the list to a string pair. BTW, this is not my homework assignment, I was trying to build a boolean retrival engine, the source data is too big, so I just created a simple sample here to explain what I want to achive :)
As this seems to be a homework, I will try to help you with the part you are facing problem with
I dont know how to change the item in the list to a string pair.
As the entire list needs to be updated, its better to recreate it rather than update it in place, though its possible as lists are mutable
Consider a list
lst = [1,2,3,4,5]
to convert it to a list of strings, you can use list comprehension
lst = [str(e) for e in lst]
You may also use built-in map as map(str,lst), but you need to remember than in Py3.X, map returns a map object, so it needs to be handled accordingly
Condition in a comprehension is best expressed as a conditional statement
<TRUE-STATEMENT> if <condition> else <FALSE-STATEMENT>
To get the index of any item in a list, your best bet is to use the built-in enumerate
If you need to create a formatted string expression from a sequence of items, its suggested to use the format string specifier
"{},{}".format(a,b)
The length of any sequence including a list can be calculated through the built-in len
You can use the operator ** with fractional power or use the math module and invoke the sqrt function to calculate the square-root
Now you just have to combine each of the above suggestion to solve your problem.

Comparing massive lists of dictionaries in python

I never actually thought I'd run into speed-issues with python, but I have. I'm trying to compare really big lists of dictionaries to each other based on the dictionary values. I compare two lists, with the first like so
biglist1=[{'transaction':'somevalue', 'id':'somevalue', 'date':'somevalue' ...}, {'transactio':'somevalue', 'id':'somevalue', 'date':'somevalue' ...}, ...]
With 'somevalue' standing for a user-generated string, int or decimal. Now, the second list is pretty similar, except the id-values are always empty, as they have not been assigned yet.
biglist2=[{'transaction':'somevalue', 'id':'', 'date':'somevalue' ...}, {'transactio':'somevalue', 'id':'', 'date':'somevalue' ...}, ...]
So I want to get a list of the dictionaries in biglist2 that match the dictionaries in biglist1 for all other keys except id.
I've been doing
for item in biglist2:
for transaction in biglist1:
if item['transaction'] == transaction['transaction']:
list_transactionnamematches.append(transaction)
for item in biglist2:
for transaction in list_transactionnamematches:
if item['date'] == transaction['date']:
list_transactionnamematches.append(transaction)
... and so on, not comparing id values, until I get a final list of matches. Since the lists can be really big (around 3000+ items each), this takes quite some time for python to loop through.
I'm guessing this isn't really how this kind of comparison should be done. Any ideas?
Index on the fields you want to use for lookup. O(n+m)
matches = []
biglist1_indexed = {}
for item in biglist1:
biglist1_indexed[(item["transaction"], item["date"])] = item
for item in biglist2:
if (item["transaction"], item["date"]) in biglist1_indexed:
matches.append(item)
This is probably thousands of times faster than what you're doing now.
What you want to do is to use correct data structures:
Create a dictionary of mappings of tuples of other values in the first dictionary to their id.
Create two sets of tuples of values in both dictionaries. Then use set operations to get the tuple set you want.
Use the dictionary from the point 1 to assign ids to those tuples.
Forgive my rusty python syntax, it's been a while, so consider this partially pseudocode
import operator
biglist1.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
biglist2.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
i1=0;
i2=0;
while i1 < len(biglist1) and i2 < len(biglist2):
if (biglist1[i1]['date'],biglist1[i1]['transaction']) == (biglist2[i2]['date'],biglist2[i2]['transaction']):
biglist3.append(biglist1[i1])
i1++
i2++
elif (biglist1[i1]['date'],biglist1[i1]['transaction']) < (biglist2[i2]['date'],biglist2[i2]['transaction']):
i1++
elif (biglist1[i1]['date'],biglist1[i1]['transaction']) > (biglist2[i2]['date'],biglist2[i2]['transaction']):
i2++
else:
print "this wont happen if i did the tuple comparison correctly"
This sorts both lists into the same order, by (date,transaction). Then it walks through them side by side, stepping through each looking for relatively adjacent matches. It assumes that (date,transaction) is unique, and that I am not completely off my rocker with regards to tuple sorting and comparison.
In O(m*n)...
for item in biglist2:
for transaction in biglist1:
if (item['transaction'] == transaction['transaction'] &&
item['date'] == transaction['date'] &&
item['foo'] == transaction['foo'] ) :
list_transactionnamematches.append(transaction)
The approach I would probably take to this is to make a very, very lightweight class with one instance variable and one method. The instance variable is a pointer to a dictionary; the method overrides the built-in special method __hash__(self), returning a value calculated from all the values in the dictionary except id.
From there the solution seems fairly obvious: Create two initially empty dictionaries: N and M (for no-matches and matches.) Loop over each list exactly once, and for each of these dictionaries representing a transaction (let's call it a Tx_dict), create an instance of the new class (a Tx_ptr). Then test for an item matching this Tx_ptr in N and M: if there is no matching item in N, insert the current Tx_ptr into N; if there is a matching item in N but no matching item in M, insert the current Tx_ptr into M with the Tx_ptr itself as a key and a list containing the Tx_ptr as the value; if there is a matching item in N and in M, append the current Tx_ptr to the value associated with that key in M.
After you've gone through every item once, your dictionary M will contain pointers to all the transactions which match other transactions, all neatly grouped together into lists for you.
Edit: Oops! Obviously, the correct action if there is a matching Tx_ptr in N but not in M is to insert a key-value pair into M with the current Tx_ptr as the key and as the value, a list of the current Tx_ptr and the Tx_ptr that was already in N.
Have a look at Psyco. Its a Python compiler that can create very fast, optimized machine code from your source.
http://sourceforge.net/projects/psyco/
While this isn't a direct solution to your code's efficiency issues, it could still help speed things up without needing to write any new code. That said, I'd still highly recommend optimizing your code as much as possible AND use Psyco to squeeze as much speed out of it as possible.
Part of their guide specifically talks about using it to speed up list, string, and numeric computation heavy functions.
http://psyco.sourceforge.net/psycoguide/node8.html
I'm also a newbie. My code is structured in much the same way as his.
for A in biglist:
for B in biglist:
if ( A.get('somekey') <> B.get('somekey') and #don't match to itself
len( set(A.get('list')) - set(B.get('list')) ) > 10:
[do stuff...]
This takes hours to run through a list of 10000 dictionaries. Each dictionary contains lots of stuff but I could potentially pull out just the ids ('somekey') and lists ('list') and rewrite as a single dictionary of 10000 key:value pairs.
Question: how much faster would that be? And I assume this is faster than using a list of lists, right?

Categories

Resources