Optimizing a nested for loop with two lists - python

I have a program that searches through two separate lists, lets call them list1 and list2.
I only want to print the instances where list1 and list2 have matching items. The thing is, not all items in both lists match eachother, but the first, third and fourth items should.
If they match, I want the complete lists (including the mismatching items) to be appended to two corresponding lists.
I have written the follow code:
for item in list1:
for item2 in list2:
if (item[0] and item[2:4])==(item[0] and item2[2:4]):
newlist1.append(item)
newlist2.append(item2)
break
This works, but it's quite inefficient. For some of the larger files I'm looking through it can take more than 10 seconds to complete the match, and it should ideally be at most half of that.
What I'm thinking is that it shouldn't have to start over from the beginning in list2 each time the code is run, it should be enough to continue from the last point where there was a match. But I don't know how to write it in code.

Your condition (item[0] and item[2:4])==(item[0] and item2[2:4]) is wrong.
Besides that the second item[0] should probably be item2[0], what (item[0] and item[2:4]) does is the following (analogously for (item2[0] and item2[2:4])):
if item[0] is 0, it returns item[0] itself, i.e. 0
if item[0] is not 0, it returns whatever item[2:4] is
And this is then compared to the result of the second term. Thus, [0,1,1,1] would "equal" [0,2,2,2], and [1,1,1,1] would "equal" [2,1,1,1].
Try using tuples instead:
if (item[0], item[2:4]) == (item2[0], item2[2:4]):
Or use operator.itemgetter as suggested in the other answer.
To speed up the pairwise matching of items from both lists, put the items from the first list into a dictionary, using those tuples as key, and then iterating over the other list and looking up the matching items in the dictionary. Complexity will be O(n+m) instead of O(n*m) (n and m being the length of the lists).
key = operator.itemgetter(0, 2, 3)
list1_dict = {}
for item in list1:
list1_dict.setdefault(key(item), []).append(item)
for item2 in list2:
for item in list1_dict.get(key(item2), []):
newlist1.append(item)
newlist2.append(item2)

from operator import itemgetter
getter = itemgetter(0, 2, 3)
for item,item2 in zip(list1, list2):
if getter(item) == getter(item2):
newlist1.append(item)
newlist2.append(item2)
break
This may reduce bit of time complexity though...

Related

Python reduce a large list for speed in a for loop

I'm trying to take a list of 4 million entries and rather than iterate over them all, reduce the list in the for loop that is enumerating them as it goes along.
The reduction criteria is found in the loop. Some later my_huge_list elements contain an combination of 2 consecutive elements that allows them to be discarded immediately.
Here I'm going to remove sublists with 1,2 and A,B in them from my_huge_list.
Please note I don't know in advance that 1,2 and A,B are illegal until I go into my for loop.
output_list = []
my_huge_list = [[0,1,2,3,4],[0,1,3,4],[0,1,2,3,4],[0,1,3,4],[0,1,2,4],[0,1,2,3,4],[A,B],[0,1,3,A,B],[0,1,2,3,4],[0,1,3,4],[0,1,2,3,4],[0,1,3,4],[0,1,2,4]...] #to 4m assorted entries
for sublist in my_huge_list[:]:
pair = None
for item_index in sublist[:-1]: #Edit for Barmar. each item in sublist is actually an object with attributes about allowed neighbors.
if sublist[item_index +1] in sublist[item_index].attributes['excludes_neighbors_list']:
pair = [sublist[item_index],sublist[item_index +1]] #TODO build a list of pairs
if pair != None: #Don't want pair in any item of output_list
my_huge_list = [x for x in my_huge_list if not ','.join(pair) in str(x)] #This list comprehension sole function to reduce my_huge_list from 4m item list to 1.7m items
#if '1, 2' in str(sublist): #Don't want 1,2 in any item of output_list
#my_huge_list = [x for x in my_huge_list if not '1, 2' in str(x)] #This list comprehension sole function to reduce my_huge_list
#elif 'A, B' in str(sublist): #Don't want A,B in any item of output_list
#my_huge_list = [x for x in my_huge_list if not 'A, B' in str(x)] #This list comprehension sole function to reduce my_huge_list from 1.7m item list to 1.1m items
else:
output_list.append(sublist)
my_huge_list
>>>[[0,1,3,4],[0,1,3,4],[0,1,3,4],[0,1,3,4]...]
So the 'for loop' unfortunately does not seem to get any faster because my_huge_list is still iterated over all 4m entries, even though it was quickly reduced by the list comprehension.
[The my_huge_list does not need to be processed in any order and does not need to be retained after this loop.]
[I have considered making the for loop into a sub-function and using map and also the shallow copy but can't figure this architecture out.]
[I'm sure by testing that the removal of list elements by list comprehension is quicker than brute-forcing all 4m sublists.]
Thanks!
Here's my dig on it:
my_huge_list = [[0,1,2,3,4],[0,1,3,4],[0,1,2,3,4],[0,1,3,4],[0,1,2,4],[0,1,2,3,4],['A','B'],[0,1,3,'A','B'],[0,'A','B'],[0,1,2,3,4],[0,1,3,4],[0,1,2,3,4],[0,1,3,4],[0,1,2,4]] #to 4m assorted entries
# ... do whatever and return unwanted list... #
# ... if needed, convert the returned items into lists before putting into unwanted ... #
unwanted = [[1,2], ['A','B']]
index = 0
while index < len(my_huge_list):
sublist = my_huge_list[index]
next = True
for u in unwanted:
if u in [sublist[j:j+len(u)] for j in range(len(sublist)-len(u)+1)] or u == sublist:
my_huge_list.pop(index)
next = False
index += next
print(my_huge_list)
# [[0, 1, 3, 4], [0, 1, 3, 4], [0, 1, 3, 4], [0, 1, 3, 4]]
It's not elegant but it gets the job done. A huge caveat is that modifying a list while iterating over it is bad karma (pros will probably shake their heads at me), but dealing with a size of 4 mil you can understand I'm trying to save some memory by modifying in place.
This is also scale-able so that if you have multiple numbers of unwanted in different sizes, it should still catch it from your huge list. If your element size is 1, try to match the expected element type from your my_huge_list. e.g. if your my_huge_list has a [1], your unwanted should be [1] as well. If the element is a string instead of list, you'll need that string in your unwanted. int/float will however break this current code as you can't iterate over it, but you can add extra handling before you iterate through unwanted.
You're iterating over your huge list twice, once in the main for loop, and then each time you find an invalid element you iterate over it again in the list comprehensions to remove all of those invalid elements.
It would be better to simply filter those elements out of the list once with a list comprehension.
def sublist_contains(l, pair):
for i in range(len(l)-1):
if l[i] == pair[0] and l[i+1] == pair[1]:
return True
return False
output_list = [sublist for sublist in my_huge_list if not(list_contains(sublist, ['A', 'B']) or list_contains(sublist, ['1', '2']))]
My sublist_contains() function assumes it's always just two elements in a row you have to test for. You can replace this with a more general function if necessary. See elegant find sub-list in list

filling a list with the indexes of an other list

I got two lists list1 and list2, I want to get all the indexes of the of the element of list1 that are also in 2nd one
for i in list1:
print(i) ## this works fine
Test_features_index.append(list1.index(i for i in list2))# here not that well
running this doens't work here is what I get :
<ipython-input-35-8d7ff70a8be0> in <module>()
----> 1 Test_features_index.append(list1.index(i for i in list2))
ValueError: <generator object <genexpr> at 0x0000021710BBA7D8> is not in list
Any idea how to do that? I wanted to avoid a for loop, but not sure if it's possible
You are trying to find the index of a generator expression which is supposedly in your list. Besides, using list.index repeatedly is not very performant since you'll be running the entire length of the list (worst case) every time.
You can instead use a list comprehension with enumerate:
set2 = set(list2)
Test_features_index = [i for i, x in enumerate(list1) if x in set2]
Using a set for the lookup of shared items ensures 0(1) lookup time as opposed to O(n) for lists.

2d-list calculations

I have two 2-dimensional lists. Each list item contains a list with a string ID and an integer. I want to subtract the integers from each other where the string ID matches.
List 1:
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
List 2:
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
I want to end up with
difference = [['ID_001',500],['ID_002',1000],['ID_003',2000]]
Notice that the elements aren't necessarily in the same order in both lists. Both lists will be the same length and there is an integer corresponding to each ID in both lists.
I would also like this to be done efficiently as both lists will have thousands of records.
from collections import defaultdict
diffs = defaultdict(int)
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
for pair in list1:
diffs[pair[0]] = pair[1]
for pair in list2:
diffs[pair[0]] -= pair[1]
differences = [[k,abs(v)] for k,v in diffs.items()]
print(differences)
I was curious so I ran a few timeits comparing my answer to Jim's. They seem to run in about the same time. You can cut the runtime of mine in half if you're willing to accept the output as a dictionary, however.
His is, of course, more Pythonic, if that's important to you.
You could achieve this by using a list comprehension:
diff = [(i[0], abs(i[1] - j[1])) for i,j in zip(sorted(list1), sorted(list2))]
This first sorts the lists with sorted in order for the order to be similar (not with list.sort() which sorts in place) and then, it creates tuples containing each entry in the lists ['ID_001', 1000], ['ID_001', 500] by feeding the sorted lists to zip.
Finally:
(i[0], abs(i[1] - j[1]))
returns i[0] indicating the ID for each entry and abs(i[1] - j[1]) computes their absolute difference. There are added as a tuple in the final list result (note the parentheses surrounding them).
In general, sorted might slow you down if you have a large amount of data, but that depends on how disorganized the data is from what I'm aware.
Other than that, zip creates an iterator so memory wise it doesn't affect you. Speed wise, list comps tend to be quite efficient and in most cases are your best options.

Python list index splitting and manipulation

My question seems simple, but for a novice to python like myself this is starting to get too complex for me to get, so here's the situation:
I need to take a list such as:
L = [(a, b, c), (d, e, d), (etc, etc, etc), (etc, etc, etc)]
and make each index an individual list so that I may pull elements from each index specifically. The problem is that the list I am actually working with contains hundreds of indices such as the ones above and I cannot make something like:
L_new = list(L['insert specific index here'])
for each one as that would mean filling up the memory with hundreds of lists corresponding to individual indices of the first list and would be far too time and memory consuming from my point of view. So my question is this, how can I separate those indices and then pull individual parts from them without needing to create hundreds of individual lists (at least to the point where I wont need hundreds of individual lines to create them).
I might be misreading your question, but I'm inclined to say that you don't actually have to do anything to be able to index your tuples. See my comment, but: L[0][0] will give "a", L[0][1] will give "b", L[2][1] will give "etc" etc...
If you really want a clean way to turn this into a list of lists you could use a list comprehension:
cast = [list(entry) for entry in L]
In response to your comment: if you want to access across dimensions I would suggest list comprehension. For your comment specifically:
crosscut = [entry[0] for entry in L]
In response to comment 2: This is largely a part of a really useful operation called slicing. Specifically to do the referenced operation you would do this:
multiple_index = [entry[0:3] for entry in L]
Depending on your readability preferences there are actually a number of possibilities here:
list_of_lists = []
for sublist in L:
list_of_lists.append(list(sublist))
iterator = iter(L)
for i in range(0,iterator.__length_hint__()):
return list(iterator.next())
# Or yield list(iterator.next()) if you want lazy evaluation
What you have there is a list of tuples, access them like a list of lists
L[3][2]
will get the second element from the 3rd tuple in your list L
Two way of using inner lists:
for index, sublist in enumerate(L):
# do something with sublist
pass
or with an iterator
iterator = iter(L)
sublist = L.next() # <-- yields the first sublist
in both case, sublist elements can be reached via
direct index
sublist[2]
iteration
iterator = iter(sublist)
iterator.next() # <-- yields first elem of sublist
for elem in sublist:
# do something with my elem
pass

append/extend list in loop

I would like to extend a list while looping over it:
for idx in xrange(len(a_list)):
item = a_list[idx]
a_list.extend(fun(item))
(fun is a function that returns a list.)
Question:
Is this already the best way to do it, or is something nicer and more compact possible?
Remarks:
from matplotlib.cbook import flatten
a_list.extend(flatten(fun(item) for item in a_list))
should work but I do not want my code to depend on matplotlib.
for item in a_list:
a_list.extend(fun(item))
would be nice enough for my taste but seems to cause an infinite loop.
Context:
I have have a large number of nodes (in a dict) and some of them are special because they are on the boundary.
'a_list' contains the keys of these special/boundary nodes. Sometimes nodes are added and then every new node that is on the boundary needs to be added to 'a_list'. The new boundary nodes can be determined by the old boundary nodes (expresses here by 'fun') and every boundary node can add several new nodes.
Have you tried list comprehensions? This would work by creating a separate list in memory, then assigning it to your original list once the comprehension is complete. Basically its the same as your second example, but instead of importing a flattening function, it flattens it through stacked list comprehensions. [edit Matthias: changed + to +=]
a_list += [x for lst in [fun(item) for item in a_list] for x in lst]
EDIT: To explain what going on.
So the first thing that will happen is this part in the middle of the above code:
[fun(item) for item in a_list]
This will apply fun to every item in a_list and add it to a new list. Problem is, because fun(item) returns a list, now we have a list of lists. So we run a second (stacked) list comprehension to loop through all the lists in our new list that we just created in the original comprehension:
for lst in [fun(item) for item in a_list]
This will allow us to loop through all the lists in order. So then:
[x for lst in [fun(item) for item in a_list] for x in lst]
This means take every x (that is, every item) in every lst (all the lists we created in our original comprehension) and add it to a new list.
Hope this is clearer. If not, I'm always willing to elaborate further.
Using itertools, it can be written as:
import itertools
a_list += itertools.chain(* itertools.imap(fun, a_list))
or, if you're aiming for code golf:
a_list += sum(map(fun, a_list), [])
Alternatively, just write it out:
new_elements = map(fun, a_list) # itertools.imap in Python 2.x
for ne in new_elements:
a_list.extend(ne)
As you want to extend the list, but loop only over the original list, you can loop over a copy instead of the original:
for item in a_list[:]:
a_list.extend(fun(item))
Using generator
original_list = [1, 2]
original_list.extend((x for x in original_list[:]))
# [1, 2, 1, 2]

Categories

Resources