Pythonic Way to Compare Values in Two Lists of Dictionaries - python

I'm new to Python and am still trying to tear myself away from C++ coding techniques while in Python, so please forgive me if this is a trivial question. I can't seem to find the most Pythonic way of doing this.
I have two lists of dicts. The individual dicts in both lists may contain nested dicts. (It's actually some Yelp data, if you're curious.) The first list of dicts contains entries like this:
{business_id': 'JwUE5GmEO-sH1FuwJgKBlQ',
'categories': ['Restaurants'],
'type': 'business'
...}
The second list of dicts contains entries like this:
{'business_id': 'vcNAWiLM4dR7D2nwwJ7nCA',
'date': '2010-03-22',
'review_id': 'RF6UnRTtG7tWMcrO2GEoAg',
'stars': 2,
'text': "This is a basic review",
...}
What I would like to do is extract all the entries in the second list that match specific categories in the first list. For example, if I'm interested in restaurants, I only want the entires in the second list where the business_id matches the business_id in the first list and the word Restaurants appears in the list of values for categories.
If I had these two lists as tables in SQL, I'd do a join on the business_id attribute then just a simple filter to get the rows I want (where Restaurants IN categories, or something similar).
These two lists are extremely large, so I'm running into both efficiency and memory space issues. Before I go and shove all of this into a SQL database, can anyone give me some pointers? I've messed around with Pandas some, so I do have some limited experience with that. I was having trouble with the merge process.

Suppose your lists are called l1 and l2:
All elements from l1:
[each for each in l1]
All elements from l1 with the Restaurant category:
[each for each in l1
if 'Restaurants' in each['categories']]
All elements from l2 matching id with elements from l1 with the Restaurant category:
[x for each in l1 for x in l2
if 'Restaurants' in each['categories']
and x['business_id'] == each['business_id'] ]

Let's define sample lists of dictionaries:
first = [
{'business_id':100, 'categories':['Restaurants']},
{'business_id':101, 'categories':['Printer']},
{'business_id':102, 'categories':['Restaurants']},
]
second = [
{'business_id':100, 'stars':5},
{'business_id':101, 'stars':4},
{'business_id':102, 'stars':3},
]
We can extract the items of interest in two steps. The first step is to collect the list of business ids that belong to restaurants:
ids = [d['business_id'] for d in first if 'Restaurants' in d['categories']]
The second step is to get the dicts that correspond to those ids:
[d for d in second if d['business_id'] in ids]
This results in:
[{'business_id': 100, 'stars': 5}, {'business_id': 102, 'stars': 3}]

Python programmers like using list comprehensions as a way to do both their logic and their design.
List comprehensions lead to terser and more compact expression. You're right to think of it quite a lot like a query language.
x = [comparison(a, b) for (a, b) in zip(A, B)]
x = [comparison(a, b) for (a, b) in itertools.product(A, B)]
x = [comparison(a, b) for a in A for b in B if test(a, b)]
x = [comparison(a, b) for (a, b) in X for X in Y if test(a, b, X)]
...are all patterns that I use.

This is pretty tricky, and I had fun with it. This is what I'd do:
def match_fields(business, review):
return business['business_id'] == review['business_id'] and 'Restaurants' in business['categories']
def search_businesses(review):
# the lambda binds the given review as an argument to match_fields
return any(lambda business: match_fields(business, review), business_list)
answer = filter(search_businesses, review_list)
This is the most readable way I found. I'm not terribly fond of list comprehensions that go past one line, and three lines is really pushing it. If you want this to look more terse, just use shorter variable names. I favor long ones for clarity.
I defined a function that returns true if an entry can be matched between lists, and a second function that helps me search through the review list. I then can say: get rid of any review that doesn't have a matching entry in the businesses list. This pattern works well with arbitrary checks between lists.

As a variation to the list comprehension only approaches, it may be more efficient to use a set and generator comprehension. This is especially true if the size of your first list is very large or if the total number of restaurants is very large.
restaurant_ids = set(biz for biz in first if 'Restaurants' in biz['categories'])
restaurant_data = [rest for rest in second if rest['id'] in restaurant_ids]
Note the brute force list comprehension approach is O(len(first)*len(second)), but it uses no additional memory storage whereas this approach is O(len(first)+len(second)) and uses O(number_of_restaurants) extra memory for the set.

You could do:
restaurant_ids = [biz['id'] for biz in list1 if 'Restaurants' in biz['categories']]
restaurant_data = [rest for rest in list2 if rest['id'] in restaurant_ids]
Then restaurant_data would contain all of the dictionaries from list2 that contain restaurant data.

Related

Refer to a list in Python from within the [ ] operator

While trying to optimize a one-liner that extracts values from a list of dicts (all strings that come after keyword/ til the next /) I got stuck in the following:
imagine this list of dicts:
my_list = [{'name':'val_1', 'url':'foo/bar/keyword/bla_1/etc'}.......{'name':'val_z', 'url':'foo/bar/keyword/bla_n/etc'}]
The numbers in val_ and bla_ are not related.
I am trying to extract all the bla_x (they could be anything, no pattern possible) words using list comprehension for 0 < x < n+1. I managed to get the index of those elements in the split string
[d['url'].split('/').index('keyword') + 1 for d in my_list]
But what I need is to access that value and not only its index, so I thought about something like this:
[d['url'].split('/')[the_resulting_split_list.index('keyword') + 1] for d in my_list]
As far as my knowledge go, that seems impossible to do. Is there an alternative way to reach my goal of getting an output of ['bla_1', 'bla_2', ....., 'bla_n'] without having to run the split('/') operation twice in the list comprehension?
Let's not care about exceptions for now and assume input data is always correct.

Python list comprehension with if else conditions

I have a small (<100) list of chemical names called detected_chems .
And a second much larger (>1000) iterable; a dictionary chem_db containing chemical names as the key, and a dictionary of chemical properties as the value. Like this:
{'chemicalx':{'property1':'smells','property2':'poisonous'},
'chemicaly':{'property1':'stinks','property2':'toxic'}}
I am trying to match all the detected chemicals with those in the database and pull their properties.
I have studied these questions/answers but can't seem to apply it to my case (sorry)
Is it possible to use 'else' in a list comprehension?
if/else in a list comprehension?
if/else in a list comprehension?
Python Nested List Comprehension with If Else
So I am making a list of results res, but instead of nested for loops with an if x in condition, I've created this.
res = [{chem:chem_db[chem]}
for det_chem in detected_chems
for chem in chem_db.keys()
if det_chem in chem]
This works to an extent!
What I (think) am doing here is creating a list of dictionaries, which will have the key:value pair of chemical names (keys) and information about the chemicals (as a dictionary itself, as values), if the detected chemical is found somewhere in the chemical database (chem_db).
The problem is not all the detected chemicals are found in the database. This is probably because of misspelling or name variation (e.g. they include numbers) or something similar.
So to solve the problem I need to identify which detected chemicals are not being matched. I thought this might be a solution:
not_matched=[]
res = [{chem:chem_db[chem]}
for det_chem in detected_chems
for chem in chem_db.keys()
if det_chem in chem else not_matched.append(det_chem)]
I am getting a syntax error, due to the else not_matched.append(det_chem) part.
I have two questions:
1) Where should I put the else condition to avoid the syntax error?
2) Can the not_matched list be built within the list comprehension, so I don't create that empty list first.
res = [{chem:chem_db[chem]}
for det_chem in detected_chems
for chem in chem_db.keys()
if det_chem in chem else print(det_chem)]
What I'd like to achieve is something like:
in: len(detected_chems)
out: 20
in: len(res)
out: 18
in: len(not_matched)
out: 2
in: print(not_matched)
out: ['chemical_strange_character$$','chemical___WeirdSPELLING']
That will help me find trouble shoot the matching.
You should
if det_chem in chem or not_matched.append(det_chem)
but that being said if you clean up a bit as per comments I think there is a much more efficient way of doing what you want. The explanation of the above is that append returns None so the whole if-condition will evaluate to False (but the item still appended to the not_matched list)
Re: efficiency:
res = [{det_chem:chem_db[det_chem]}
for det_chem in detected_chems
if det_chem in chem_db or not_matched.append(det_chem)]
This should be drastically faster - the for loop on dictionary keys is an O(n) operation while dictionaries are used precisely because lookup is O(1) so instead of retrieving the keys and comparing them one by one we use the det_chem in chem_db lookup which is hash based
Bonus: dict comprehension (to address question 2)
I am not sure why a list of one-key-dicts is built but probably what needed is a dict comprehension as in:
chem_db = {1: 2, 4: 5}
detected_chems = [1, 3]
not_matched = []
res = {det_chem: chem_db[det_chem] for det_chem in detected_chems if
det_chem in chem_db or not_matched.append(det_chem)}
# output
print(res) # {1: 2}
print(not_matched) # [3]
No way I can think of to build the not_matched list while also building res using a single list/dict comprehension.
List comprehension consists formally up to 3 parts. Let's show them in an example:
[2 * i for i in range(10) if i % 3 == 0]
The first part is an expression — and it may be (or used in it) the ternary operator (x if y else z)
The second part is a list (or more lists in nested for loops) to select values for a variable from it.
The third part (optional) is a filter (for selecting in the part 2) - and the else clause in not allowed here!
So if you want to use the else branch, you have to put it into the first part, for example
[2 * i if i < 5 else 3 * i for i in range(10) if i % 3 == 0]
Your Syntax error comes from the fact that comprehension do not accepts else clauses.
You could eventually use the ... if ... else ... ternary operator to determine the value to put in your comprehension result. Something like below:
not_matched=[]
res = [{chem:chem_db[chem]} if det_chem in chem else not_matched.append(det_chem)
for det_chem in detected_chems
for chem in chem_db.keys()]
But it would be a bad idea since you then would have None in your res for each not matched. This is because the ... if ... else ... operator always returns a value, and in your case, the value would be the return value of the list.append method (= None).
You could then filter the res list to remove None values but... meh...
A better solution would be to simply keep your first comprehsion and get the difference between the original chem list and the res list:
not_matched = set(chems).difference(<the already matched chems>)
Note that I used a the already matched chems placeholder instead of a real chunk of code because the way your store your res is not pratical at all. Indeed it is a list of single-key-dictionaries which is a non-sens. The role of dictionary is to hold multiple values identified by keys.
A solution to this would be to make res a dictionary instead of a list, using a dict comprehension:
res = {chem: chem_db[chem]
for det_chem in detected_chems
for chem in chem_db.keys()
if det_chem in chem}
Doing this, the the already matched chems placeholder could be replaced by res.values()
As an addition, even if comprehensions are a really cool feature in a lot of cases, they are not a miraculous feature which should be used everywhere.
And nested comprehensions are a real pain to read and should be avoided (in my opinion at least).
The below sample code would give you the desired output what you want. It is using a dictionary comprehension instead of a list comprehension to capture matched dictionary item info as a dictionary only. It is because you would need a dictionary of matched chemical items instead of a list. In the dictionary of matched items, it would be easier for you to get their properties later. Also, you don't need to use chem_db.keys() because the "in" operator itself searches the target in the entire sequence (be it a list or dictionary). And if the seq is a dict, then it matches the target with all the keys inside the dictionary.
Code:
detected_chems=['chemical_strange_character$$','chemical___WeirdSPELLING','chem3','chem4','chem5','chem6','chem7','chem8','chem9','chem10','chem11','chem12','chem13','chem14','chem15','chem16','chem17','chem18','chem19','chem20']
chem_db = {'chem1':{'property1':'smells','property2':'poisonous'},'chem2':{'property1':'stinks','property2':'toxic'},'chem3':{'property1':'smells','property2':'poisonous'},'chem4':{'property1':'smells','property2':'poisonous'},'chem5':{'property1':'smells','property2':'poisonous'},'chem6':{'property1':'smells','property2':'poisonous'},'chem7':{'property1':'smells','property2':'poisonous'},'chem8':{'property1':'smells','property2':'poisonous'},'chem9':{'property1':'smells','property2':'poisonous'},'chem10':{'property1':'smells','property2':'poisonous'},'chem11':{'property1':'smells','property2':'poisonous'},'chem12':{'property1':'smells','property2':'poisonous'},'chem13':{'property1':'smells','property2':'poisonous'},'chem14':{'property1':'smells','property2':'poisonous'},'chem15':{'property1':'smells','property2':'poisonous'},'chem16':{'property1':'smells','property2':'poisonous'},'chem17':{'property1':'smells','property2':'poisonous'},'chem18':{'property1':'smells','property2':'poisonous'},'chem19':{'property1':'smells','property2':'poisonous'},'chem20':{'property1':'smells','property2':'poisonous'}}
not_matched = []
res = {det_chem:chem_db[det_chem]
for det_chem in detected_chems
if det_chem in chem_db or not_matched.append(det_chem)}
print(len(detected_chems))
print(len(res))
print(len(not_matched))
print(not_matched)
The output:
20
18
2
['chemical_strange_character$$', 'chemical___WeirdSPELLING']
If any info is needed further, see to check out: dictionary in python

Searching through a list of dictionaries to see if a key/value exists in any dictionary

so I have a list of dictionaries,
ex:
[{'title':'Green eggs and ham', 'author':'dr seuss'}, {'title':'matilda', 'author':'roald dahl'}]
What is the best way to search if outliers by malcolm gladwell exists in any of those dictionaries?
I was thinking of brute force checking each title and author, but I feel like there's gotta be a better way.
If you need all key-value pairs to match, you can just use in and have the list do the searching for you:
if {'title': 'outliers', 'author': 'malcolm gladwell'} in yourlist:
Otherwise, with no other indices, you'll have to 'manually' search the list. You can use the any function with a generator expression to make the test efficient enough (e.g. stop searching when a match is found), plus dictionary view objects to test for subsets of key-value pairs:
search = {'title': 'outliers', 'author': 'malcolm gladwell'}.viewitems()
if any(search <= d.viewitems() for d in yourlist):
would match even if dictionaries in yourlist have more keys than just title and author.
You can avoid full scans by using extra indices:
authors = {}
titles = {}
for d in yourlist:
authors.set_default(d['author'], []).append(d)
titles.set_default(d['title'], []).append(d)
creates extra mappings by specific keys in the dictionaries. No you can test for individual elements:
if any(d['title'] == 'outliers' for d in authors.get('malcolm gladwell', [])):
is a limited search just through all books by Malcolm Gladwell.
The titles and authors dictionaries map author and title strings to lists of the same dictionaries, shared with the yourlist list. However, adding or removing dictionaries from one such structure does require updating all structures. This is where a relational database comes in handy, as it is really good at keeping such indexes for you and will automatically keep these up-to-date.

Python's list comprehensions and other better practices

This relates to a project to convert a 2-way ANOVA program in SAS to Python.
I pretty much started trying to learn the language Thursday, so I know I have a lot of room for improvement. If I'm missing something blatantly obvious, by all means, let me know. I haven't got Sage up and running yet, nor numpy, so right now, this is all quite vanilla Python 2.6.1. (portable)
Primary query: Need a good set of list comprehensions that can extract the data in lists of samples in lists by factor A, by factor B, overall, and in groups of each level of factors A&B (AxB).
After some work, the data is in the following form (3 layers of nested lists):
response[a][b][n]
(meaning [a1 [b1 [n1, ... ,nN] ...[bB [n1, ...nN]]], ... ,[aA [b1 [n1, ... ,nN] ...[bB [n1, ...nN]]]
Hopefully that's clear.)
Factor levels in my example case: A=3 (0-2), B=8 (0-7), N=8 (0-7)
byA= [[a[i] for i in range(b)] for a[b] in response]
(Can someone explain why this syntax works? I stumbled into it trying to see what the parser would accept. I haven't seen that syntax attached to that behavior elsewhere, but it's really nice. Any good links on sites or books on the topic would be appreciated. Edit: Persistence of variables between runs explained this oddity. It doesn't work.)
byB=lstcrunch([[Bs[i] for i in range(len(Bs)) ]for Bs in response])
(It bears noting that zip(*response) almost does what I want. The above version isn't actually working, as I recall. I haven't run it through a careful test yet.)
byAxB= [item for sublist in response for item in sublist]
(Stolen from a response by Alex Martelli on this site. Again could someone explain why? List comprehension syntax is not very well explained in the texts I've been reading.)
ByO= [item for sublist in byAxB for item in sublist]
(Obviously, I simply reused the former comprehension here, 'cause it did what I need. Edit:)
I'd like these to end up the same datatypes, at least when looped through by the factor in question, s.t. that same average/sum/SS/et cetera functions can be applied and used.
This could easily be replaced by something cleaner:
def lstcrunch(Dlist):
"""Returns a list containing the entire
contents of whatever is imported,
reduced by one level.
If a rectangular array, it reduces a dimension by one.
lstcrunch(DataSet[a][b]) -> DataOutput[a]
[[1, 2], [[2, 3], [2, 4]]] -> [1, 2, [2, 3], [2, 4]]
"""
flat=[]
if islist(Dlist):#1D top level list
for i in Dlist:
if islist(i):
flat+= i
else:
flat.append(i)
return flat
else:
return [Dlist]
Oh, while I'm on the topic, what's the preferred way of identifying a variable as a list?
I have been using:
def islist(a):
"Returns 'True' if input is a list and 'False' otherwise"
return type(a)==type([])
Parting query:
Is there a way to explicitly force a shallow copy to convert to a deep? copy? Or, similarly, when copying into a variable, is there a way of declaring that the assignment is supposed to replace the pointer, too, and not merely the value? (s.t.the assignment won't propagate to other shallow copies) Similarly, using that might be useful, as well, from time to time, so being able to control when it does or doesn't occur sounds really nice.
(I really stepped all over myself when I prepared my table for inserting by calling:
response=[[[0]*N]*B]*A
)
Edit:
Further investigation lead to most of this working fine. I've since made the class and tested it. it works fine. I'll leave the list comprehension forms intact for reference.
def byB(array_a_b_c):
y=range(len(array_a_b_c))
x=range(len(array_a_b_c[0]))
return [[array_a_b_c[i][j][k]
for k in range(len(array_a_b_c[0][0]))
for i in y]
for j in x]
def byA(array_a_b_c):
return [[repn for rowB in rowA for repn in rowB]
for rowA in array_a_b_c]
def byAxB(array_a_b_c):
return [rowB for rowA in array_a_b_c
for rowB in rowA]
def byO(array_a_b_c):
return [rep
for rowA in array_a_b_c
for rowB in rowA
for rep in rowB]
def gen3d(row, col, inner):
"""Produces a 3d nested array without any naughty shallow copies.
[row[col[inner]] named s.t. the outer can be split on, per lprn for easy display"""
return [[[k for k in range(inner)]
for i in range(col)]
for j in range(row)]
def lprn(X):
"""This prints a list by lines.
Not fancy, but works"""
if isiterable(X):
for line in X: print line
else:
print x
def isiterable(a):
return hasattr(a, "__iter__")
Thanks to everyone who responded. Already see a noticeable improvement in code quality due to improvements in my gnosis. Further thoughts are still appreciated, of course.
byAxB= [item for sublist in response for item in sublist] Again could someone explain why?
I am sure A.M. will be able to give you a good explanation. Here is my stab at it while waiting for him to turn up.
I would approach this from left to right. Take these four words:
for sublist in response
I hope you can see the resemblance to a regular for loop. These four words are doing the ground work for performing some action on each sublist in response. It appears that response is a list of lists. In that case sublist would be a list for each iteration through response.
for item in sublist
This is again another for loop in the making. Given that we first heard about sublist in the previous "loop" this would indicate that we are now traversing through sublist, one item at a time. If I were to write these loops out without comprehensions it would look like this:
for sublist in response:
for item in sublist:
Next, we look at the remaining words. [, item and ]. This effectively means, collect items in a list and return the resulting list.
Whenever you have trouble creating or understanding list iterations write the relevant for loops out and then compress them:
result = []
for sublist in response:
for item in sublist:
result.append(item)
This will compress to:
[
item
for sublist in response
for item in sublist
]
List comprehension syntax is not very well explained in the texts I've been reading
Dive Into Python has a section dedicated to list comprehensions. There is also this nice tutorial to read through.
Update
I forgot to say something. List comprehensions are another way of achieving what has been traditionally done using map and filter. It would be a good idea to understand how map and filter work if you want to improve your comprehension-fu.
For the copy part, look into the copy module, python simply uses references after the first object is created, so any change in other "copies" propagates back to the original, but the copy module makes real copies of objects and you can specify several copy modes
It is sometimes kinky to produce right level of recursion in your data structure, however I think in your case it should be relatively simple. To test it out while we are doing we need one sample data, say:
data = [ [a,
[b,
range(1,9)]]
for b in range(8)
for a in range(3)]
print 'Origin'
print(data)
print 'Flat'
## from this we see how to produce the c data flat
print([(a,b,c) for a,[b,c] in data])
print "Sum of data in third level = %f" % sum(point for point in c for a,[b,c] in data)
print "Sum of all data = %f" % sum(a+b+sum(c) for a,[b,c] in data)
for the type check, generally you should avoid it but if you must, as when you do not want to do recursion in string you can do it like this
if not isinstance(data, basestring) : ....
If you need to flatten structure you can find useful code in Python documentation (other way to express it is chain(*listOfLists)) and as list comprehension [ d for sublist in listOfLists for d in sublist ]:
from itertools import flat.chain
def flatten(listOfLists):
"Flatten one level of nesting"
return chain.from_iterable(listOfLists)
This does not work though if you have data in different depths. For heavy weight flattener see: http://www.python.org/workshops/1994-11/flatten.py,

Comparing massive lists of dictionaries in python

I never actually thought I'd run into speed-issues with python, but I have. I'm trying to compare really big lists of dictionaries to each other based on the dictionary values. I compare two lists, with the first like so
biglist1=[{'transaction':'somevalue', 'id':'somevalue', 'date':'somevalue' ...}, {'transactio':'somevalue', 'id':'somevalue', 'date':'somevalue' ...}, ...]
With 'somevalue' standing for a user-generated string, int or decimal. Now, the second list is pretty similar, except the id-values are always empty, as they have not been assigned yet.
biglist2=[{'transaction':'somevalue', 'id':'', 'date':'somevalue' ...}, {'transactio':'somevalue', 'id':'', 'date':'somevalue' ...}, ...]
So I want to get a list of the dictionaries in biglist2 that match the dictionaries in biglist1 for all other keys except id.
I've been doing
for item in biglist2:
for transaction in biglist1:
if item['transaction'] == transaction['transaction']:
list_transactionnamematches.append(transaction)
for item in biglist2:
for transaction in list_transactionnamematches:
if item['date'] == transaction['date']:
list_transactionnamematches.append(transaction)
... and so on, not comparing id values, until I get a final list of matches. Since the lists can be really big (around 3000+ items each), this takes quite some time for python to loop through.
I'm guessing this isn't really how this kind of comparison should be done. Any ideas?
Index on the fields you want to use for lookup. O(n+m)
matches = []
biglist1_indexed = {}
for item in biglist1:
biglist1_indexed[(item["transaction"], item["date"])] = item
for item in biglist2:
if (item["transaction"], item["date"]) in biglist1_indexed:
matches.append(item)
This is probably thousands of times faster than what you're doing now.
What you want to do is to use correct data structures:
Create a dictionary of mappings of tuples of other values in the first dictionary to their id.
Create two sets of tuples of values in both dictionaries. Then use set operations to get the tuple set you want.
Use the dictionary from the point 1 to assign ids to those tuples.
Forgive my rusty python syntax, it's been a while, so consider this partially pseudocode
import operator
biglist1.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
biglist2.sort(key=(operator.itemgetter(2),operator.itemgetter(0)))
i1=0;
i2=0;
while i1 < len(biglist1) and i2 < len(biglist2):
if (biglist1[i1]['date'],biglist1[i1]['transaction']) == (biglist2[i2]['date'],biglist2[i2]['transaction']):
biglist3.append(biglist1[i1])
i1++
i2++
elif (biglist1[i1]['date'],biglist1[i1]['transaction']) < (biglist2[i2]['date'],biglist2[i2]['transaction']):
i1++
elif (biglist1[i1]['date'],biglist1[i1]['transaction']) > (biglist2[i2]['date'],biglist2[i2]['transaction']):
i2++
else:
print "this wont happen if i did the tuple comparison correctly"
This sorts both lists into the same order, by (date,transaction). Then it walks through them side by side, stepping through each looking for relatively adjacent matches. It assumes that (date,transaction) is unique, and that I am not completely off my rocker with regards to tuple sorting and comparison.
In O(m*n)...
for item in biglist2:
for transaction in biglist1:
if (item['transaction'] == transaction['transaction'] &&
item['date'] == transaction['date'] &&
item['foo'] == transaction['foo'] ) :
list_transactionnamematches.append(transaction)
The approach I would probably take to this is to make a very, very lightweight class with one instance variable and one method. The instance variable is a pointer to a dictionary; the method overrides the built-in special method __hash__(self), returning a value calculated from all the values in the dictionary except id.
From there the solution seems fairly obvious: Create two initially empty dictionaries: N and M (for no-matches and matches.) Loop over each list exactly once, and for each of these dictionaries representing a transaction (let's call it a Tx_dict), create an instance of the new class (a Tx_ptr). Then test for an item matching this Tx_ptr in N and M: if there is no matching item in N, insert the current Tx_ptr into N; if there is a matching item in N but no matching item in M, insert the current Tx_ptr into M with the Tx_ptr itself as a key and a list containing the Tx_ptr as the value; if there is a matching item in N and in M, append the current Tx_ptr to the value associated with that key in M.
After you've gone through every item once, your dictionary M will contain pointers to all the transactions which match other transactions, all neatly grouped together into lists for you.
Edit: Oops! Obviously, the correct action if there is a matching Tx_ptr in N but not in M is to insert a key-value pair into M with the current Tx_ptr as the key and as the value, a list of the current Tx_ptr and the Tx_ptr that was already in N.
Have a look at Psyco. Its a Python compiler that can create very fast, optimized machine code from your source.
http://sourceforge.net/projects/psyco/
While this isn't a direct solution to your code's efficiency issues, it could still help speed things up without needing to write any new code. That said, I'd still highly recommend optimizing your code as much as possible AND use Psyco to squeeze as much speed out of it as possible.
Part of their guide specifically talks about using it to speed up list, string, and numeric computation heavy functions.
http://psyco.sourceforge.net/psycoguide/node8.html
I'm also a newbie. My code is structured in much the same way as his.
for A in biglist:
for B in biglist:
if ( A.get('somekey') <> B.get('somekey') and #don't match to itself
len( set(A.get('list')) - set(B.get('list')) ) > 10:
[do stuff...]
This takes hours to run through a list of 10000 dictionaries. Each dictionary contains lots of stuff but I could potentially pull out just the ids ('somekey') and lists ('list') and rewrite as a single dictionary of 10000 key:value pairs.
Question: how much faster would that be? And I assume this is faster than using a list of lists, right?

Categories

Resources