Comparing massive lists of dictionaries in python - python

I never actually thought I'd run into speed-issues with python, but I have. I'm trying to compare really big lists of dictionaries to each other based on the dictionary values. I compare two lists, with the first like so
biglist1=[{'transaction':'somevalue', 'id':'somevalue', 'date':'somevalue' ...}, {'transactio':'somevalue', 'id':'somevalue', 'date':'somevalue' ...}, ...]
With 'somevalue' standing for a user-generated string, int or decimal. Now, the second list is pretty similar, except the id-values are always empty, as they have not been assigned yet.
biglist2=[{'transaction':'somevalue', 'id':'', 'date':'somevalue' ...}, {'transactio':'somevalue', 'id':'', 'date':'somevalue' ...}, ...]
So I want to get a list of the dictionaries in biglist2 that match the dictionaries in biglist1 for all other keys except id.
I've been doing
for item in biglist2:
for transaction in biglist1:
if item['transaction'] == transaction['transaction']:
list_transactionnamematches.append(transaction)
for item in biglist2:
for transaction in list_transactionnamematches:
if item['date'] == transaction['date']:
list_transactionnamematches.append(transaction)
... and so on, not comparing id values, until I get a final list of matches. Since the lists can be really big (around 3000+ items each), this takes quite some time for python to loop through.
I'm guessing this isn't really how this kind of comparison should be done. Any ideas?

Index on the fields you want to use for lookup. O(n+m)
matches = []
biglist1_indexed = {}
for item in biglist1:
biglist1_indexed[(item["transaction"], item["date"])] = item
for item in biglist2:
if (item["transaction"], item["date"]) in biglist1_indexed:
matches.append(item)
This is probably thousands of times faster than what you're doing now.

What you want to do is to use correct data structures:
Create a dictionary of mappings of tuples of other values in the first dictionary to their id.
Create two sets of tuples of values in both dictionaries. Then use set operations to get the tuple set you want.
Use the dictionary from the point 1 to assign ids to those tuples.

Forgive my rusty python syntax, it's been a while, so consider this partially pseudocode
import operator
biglist1.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
biglist2.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
i1=0;
i2=0;
while i1 < len(biglist1) and i2 < len(biglist2):
if (biglist1[i1]['date'],biglist1[i1]['transaction']) == (biglist2[i2]['date'],biglist2[i2]['transaction']):
biglist3.append(biglist1[i1])
i1++
i2++
elif (biglist1[i1]['date'],biglist1[i1]['transaction']) < (biglist2[i2]['date'],biglist2[i2]['transaction']):
i1++
elif (biglist1[i1]['date'],biglist1[i1]['transaction']) > (biglist2[i2]['date'],biglist2[i2]['transaction']):
i2++
else:
print "this wont happen if i did the tuple comparison correctly"
This sorts both lists into the same order, by (date,transaction). Then it walks through them side by side, stepping through each looking for relatively adjacent matches. It assumes that (date,transaction) is unique, and that I am not completely off my rocker with regards to tuple sorting and comparison.

In O(m*n)...
for item in biglist2:
for transaction in biglist1:
if (item['transaction'] == transaction['transaction'] &&
item['date'] == transaction['date'] &&
item['foo'] == transaction['foo'] ) :
list_transactionnamematches.append(transaction)

The approach I would probably take to this is to make a very, very lightweight class with one instance variable and one method. The instance variable is a pointer to a dictionary; the method overrides the built-in special method __hash__(self), returning a value calculated from all the values in the dictionary except id.
From there the solution seems fairly obvious: Create two initially empty dictionaries: N and M (for no-matches and matches.) Loop over each list exactly once, and for each of these dictionaries representing a transaction (let's call it a Tx_dict), create an instance of the new class (a Tx_ptr). Then test for an item matching this Tx_ptr in N and M: if there is no matching item in N, insert the current Tx_ptr into N; if there is a matching item in N but no matching item in M, insert the current Tx_ptr into M with the Tx_ptr itself as a key and a list containing the Tx_ptr as the value; if there is a matching item in N and in M, append the current Tx_ptr to the value associated with that key in M.
After you've gone through every item once, your dictionary M will contain pointers to all the transactions which match other transactions, all neatly grouped together into lists for you.
Edit: Oops! Obviously, the correct action if there is a matching Tx_ptr in N but not in M is to insert a key-value pair into M with the current Tx_ptr as the key and as the value, a list of the current Tx_ptr and the Tx_ptr that was already in N.

Have a look at Psyco. Its a Python compiler that can create very fast, optimized machine code from your source.
http://sourceforge.net/projects/psyco/
While this isn't a direct solution to your code's efficiency issues, it could still help speed things up without needing to write any new code. That said, I'd still highly recommend optimizing your code as much as possible AND use Psyco to squeeze as much speed out of it as possible.
Part of their guide specifically talks about using it to speed up list, string, and numeric computation heavy functions.
http://psyco.sourceforge.net/psycoguide/node8.html

I'm also a newbie. My code is structured in much the same way as his.
for A in biglist:
for B in biglist:
if ( A.get('somekey') <> B.get('somekey') and #don't match to itself
len( set(A.get('list')) - set(B.get('list')) ) > 10:
[do stuff...]
This takes hours to run through a list of 10000 dictionaries. Each dictionary contains lots of stuff but I could potentially pull out just the ids ('somekey') and lists ('list') and rewrite as a single dictionary of 10000 key:value pairs.
Question: how much faster would that be? And I assume this is faster than using a list of lists, right?

Related

Python Matching Multiple Keys/ Unique Pairs to a Value

What would be the fastest, most efficient way to grab and map multiple values to one value. For a use case example, say you are multiplying two numbers and you want to remember if you have multiplied those numbers before. Instead of making a giant matrix of X by Y and filling it out, it would be nice to query a Dict to see if dict[2,3] = 6 or dict[3,2] = 6. This would be especially useful for more than 2 values.
I have seen an answer similar to what I'm asking here, but would this be O(n) time or O(1)?
print value for matching multiple key
for key in responses:
if user_message in key:
print(responses[key])
Thanks!
Seems like the easiest way to do this is to sort the values before putting them in the dict. Then sort the x,y... values before looking them up. And note that you need to use tuples to map into a dictionary (lists are mutable).
the_dict = {(2,3,4): 24, (4,5,6): 120}
nums = tuple(sorted([6,4,5]))
if nums in the_dict:
print(the_dict[nums])

How to implement dicts / sets opposed to a list search, to increase speed

I am making a program that has to search through very long lists, and I have seen people suggesting that using sets and dicts speeds it up massively. However, I am at a loss as to how to make it work within my code. Currently, the program does this:
indexes = []
print("Collecting indexes...")
for term in sliced_5:
indexes.append(hex_crypted.index(term))
The code searches through the hex_crypted list, which contains 1,000,000+ terms, finds the index of the term, and then appends it to the the 'indexes' list.
I simply need to speed this process. Thanks for any help.
You want to build a lookup table so you don't need to repeatedly loop over hex_crypted. Then you can simply look up each term in the table.
print("Collecting indexes...")
lookup = {term: index for (index, term) in enumerate(hex_crypted)}
indexes = [lookup[term] for term in sliced_5]
The fastest method if you have a list is to do a set function on the list to return it as a set, but I don't think that is what you want to do in this case.
hex_crypted_set = set(hex_crypted)
If you need to keep that index for some reason, you'll want to instead build a dictionary first.
hex_crypted_dict = {}
for i in enumerate(hex_crypted):
hex_crypted_dict[i[1]] = i[0]
And then to get that index you just search the dict:
indexes = []
for term in sliced_5:
indexes.append(hex_crypted_dict[term])
You will end up with the appropriate indexes which correspond to the original long list and only iterate that long list one time, which will be a lot better performance than iterating it for every time you do a lookup.
The first step is to generate a dict, for example:
hex_crypted_dict = {v: i for i, v in enumerate(hex_crypted)}
Then your code changed to
indexes = []
hex_crypted_dict = {v: i for i, v in enumerate(hex_crypted)}
print("Collecting indexes...")
for term in sliced_5:
indexes.append(hex_crypted_dict[term])

Python dictionary vs list, which is faster?

I was coding a Euler problem, and I ran into question that sparked my curiosity. I have two snippets of code. One is with lists the other uses dictionaries.
using lists:
n=100000
num=[]
suma=0
for i in range(n,1,-1):
tmp=tuple(set([n for n in factors(i)]))
if len(tmp) != 2: continue
if tmp not in num:
num.append(tmp)
suma+=i
using dictionaries:
n=100000
num={}
suma=0
for i in range(n,1,-1):
tmp=tuple(set([n for n in factors(i)]))
if len(tmp) != 2: continue
if tmp not in num:
num[tmp]=i
suma+=i
I am only concerned about performance. Why does the second example using dictionaries run incredibly fast, faster than the first example with lists. the example with dictionaries runs almost thirty-fold faster!
I tested these 2 code using n=1000000, and the first code run in 1032 seconds and the second one run in just 3.3 second,,, amazin'!
In Python, the average time complexity of a dictionary key lookup is O(1), since they are implemented as hash tables. The time complexity of lookup in a list is O(n) on average. In your code, this makes a difference in the line if tmp not in num:, since in the list case, Python needs to search through the whole list to detect membership, whereas in the dict case it does not except for the absolute worst case.
For more details, check out TimeComplexity.
If it's about speed, you should not create any lists:
n = 100000
factors = ((frozenset(factors(i)), i) for i in range(2, n+1))
num = {k:v for k,v in factors if len(k)==2}
suma = sum(num.values())
I am almost positive that the "magic sauce" using a dictionary lies in the fact that the dictionary is comprised of key->value pairs.
in a list, youre dealing with arrays, which means the for loop has to start at index 0 inside of your list in order to loop through every record.
the dictionary just has to find the key->value pair in question on the first 'go-round' and return it, hence the speed...
basically, testing for membership in a set of key->value pairs is a lot quicker than searching an entire list for a value. the larger your list gets the slower it will be... but this isnt always the case, there are scenarios where a list will be faster... but i believe this may be the answer youre looking for
In a list, the code if tmp not in num: is O(n), while it is O(lgn) in dict.
Edit: The dict is based on hashing, so it is much quicker than liner list search.
Thanks #user2357112 for point this.

Text search elements in a big python list

With a list that looks something like:
cell_lines = ["LN18_CENTRAL_NERVOUS_SYSTEM","769P_KIDNEY","786O_KIDNEY"]
With my dabbling in regular expressions, I can't figure out a compelling way to search individual strings in a list besides looping through each element and performing the search.
How can I retrieve indices containing "KIDNEY" in an efficient way (since I have a list of length thousands)?
Make a list comprehension:
[line for line in cell_lines if "KIDNEY" in line]
This is O(n) since we check every item in a list to contain KIDNEY.
If you would need to make similar queries like this often, you should probably think about reorganizing your data and have a dictionary grouped by categories like KIDNEY:
{
"KIDNEY": ["769P_KIDNEY","786O_KIDNEY"],
"NERVOUS_SYSTEM": ["LN18_CENTRAL_NERVOUS_SYSTEM"]
}
In this case, every "by category" lookup would take "constant" time.
You can use a set instead of a list since it performs lookups in constant time.
from bisect import bisect_left
def bi_contains(lst, item):
""" efficient `item in lst` for sorted lists """
# if item is larger than the last its not in the list, but the bisect would
# find `len(lst)` as the index to insert, so check that first. Else, if the
# item is in the list then it has to be at index bisect_left(lst, item)
return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item)
Slightly modifying the above code will give you pretty good efficiency.
Here's a list of the data structures available in Python along with the time complexities.
https://wiki.python.org/moin/TimeComplexity

finding first item in a list whose first item in a tuple is matched

I have a list of several thousand unordered tuples that are of the format
(mainValue, (value, value, value, value))
Given a main value (which may or may not be present), is there a 'nice' way, other than iterating through every item looking and incrementing a value, where I can produce a list of indexes of tuples that match like this:
index = 0;
for destEntry in destList:
if destEntry[0] == sourceMatch:
destMatches.append(index)
index = index + 1
So I can compare the sub values against another set, and remove the best match from the list if necessary.
This works fine, but just seems like python would have a better way!
Edit:
As per the question, when writing the original question, I realised that I could use a dictionary instead of the first value (in fact this list is within another dictionary), but after removing the question, I still wanted to know how to do it as a tuple.
With list comprehension your for loop can be reduced to this expression:
destMatches = [i for i,destEntry in enumerate(destList) if destEntry[0] == sourceMatch]
You can also use filter()1 built in function to filter your data:
destMatches = filter(lambda destEntry:destEntry[0] == sourceMatch, destList)
1: In Python 3 filter is a class and returns a filter object.

Categories

Resources