Python: remove lots of items from a list

Python: remove lots of items from a list - python

I am in the final stretch of a project I have been working on. Everything is running smoothly but I have a bottleneck that I am having trouble working around.
I have a list of tuples. The list ranges in length from say 40,000 - 1,000,000 records. Now I have a dictionary where each and every (value, key) is a tuple in the list.
So, I might have
myList = [(20000, 11), (16000, 4), (14000, 9)...]
myDict = {11:20000, 9:14000, ...}
I want to remove each (v, k) tuple from the list.
Currently I am doing:
for k, v in myDict.iteritems():
myList.remove((v, k))
Removing 838 tuples from the list containing 20,000 tuples takes anywhere from 3 - 4 seconds. I will most likely be removing more like 10,000 tuples from a list of 1,000,000 so I need this to be faster.
Is there a better way to do this?
I can provide code used to test, plus pickled data from the actual application if needed.

You'll have to measure, but I can imagine this to be more performant:
myList = filter(lambda x: myDict.get(x[1], None) != x[0], myList)
because the lookup happens in the dict, which is more suited for this kind of thing. Note, though, that this will create a new list before removing the old one; so there's a memory tradeoff. If that's an issue, rethinking your container type as jkp suggest might be in order.
Edit: Be careful, though, if None is actually in your list -- you'd have to use a different "placeholder."

To remove about 10,000 tuples from a list of about 1,000,000, if the values are hashable, the fastest approach should be:
totoss = set((v,k) for (k,v) in myDict.iteritems())
myList[:] = [x for x in myList if x not in totoss]
The preparation of the set is a small one-time cost, wich saves doing tuple unpacking and repacking, or tuple indexing, a lot of times. Assignign to myList[:] instead of assigning to myList is also semantically important (in case there are any other references to myList around, it's not enough to rebind just the name -- you really want to rebind the contents!-).
I don't have your test-data around to do the time measurement myself, alas!, but, let me know how it plays our on your test data!
If the values are not hashable (e.g. they're sub-lists, for example), fastest is probably:
sentinel = object()
myList[:] = [x for x in myList if myDict.get(x[0], sentinel) != x[1]]
or maybe (shouldn't make a big difference either way, but I suspect the previous one is better -- indexing is cheaper than unpacking and repacking):
sentinel = object()
myList[:] = [(a,b) for (a,b) in myList if myDict.get(a, sentinel) != b]
In these two variants the sentinel idiom is used to ward against values of None (which is not a problem for the preferred set-based approach -- if values are hashable!) as it's going to be way cheaper than if a not in myDict or myDict[a] != b (which requires two indexings into myDict).

Every time you call myList.remove, Python has to scan over the entire list to search for that item and remove it. In the worst case scenario, every item you look for would be at the end of the list each time.
Have you tried doing the "inverse" operation of:
newMyList = [(v,k) for (v,k) in myList if not k in myDict]
But I'm really not sure how well that would scale, either, since you would be making a copy of the original list -- could potentially be a lot of memory usage there.
Probably the best alternative here is to wait for Alex Martelli to post some mind-blowingly intuitive, simple, and efficient approach.

[(i, j) for i, j in myList if myDict.get(j) != i]

Try something like this:
myListSet = set(myList)
myDictSet = set(zip(myDict.values(), myDict.keys()))
myList = list(myListSet - myDictSet)
This will convert myList to a set, will swap the keys/values in myDict and put them into a set, and will then find the difference, turn it back into a list, and assign it back to myList. :)

The problem looks to me to be the fact you are using a list as the container you are trying to remove from, and it is a totally unordered type. So to find each item in the list is a linear operation (O(n)), it has to iterate over the whole list until it finds a match.
If you could swap the list for some other container (set?) which uses a hash() of each item to order them, then each match could be performed much quicker.
The following code shows how you could do this using a combination of ideas offered by myself and Nick on this thread:
list_set = set(original_list)
dict_set = set(zip(original_dict.values(), original_dict.keys()))
difference_set = list(list_set - dict_set)
final_list = []
for item in original_list:
if item in difference_set:
final_list.append(item)

[i for i in myList if i not in list(zip(myDict.values(), myDict.keys()))]

A list containing a million 2-tuples is not large on most machines running Python. However if you absolutely must do the removal in situ, here is a clean way of doing it properly:
def filter_by_dict(my_list, my_dict):
sentinel = object()
for i in xrange(len(my_list) - 1, -1, -1):
key = my_list[i][1]
if my_dict.get(key, sentinel) is not sentinel:
del my_list[i]
Update Actually each del costs O(n) shuffling the list pointers down using C's memmove(), so if there are d dels, it's O(n*d) not O(n**2). Note that (1) the OP suggests that d approx == 0.01 * n and (2) the O(n*d) effort is copying one pointer to somewhere else in memory ... so this method could in fact be somewhat faster than a quick glance would indicate. Benchmarks, anyone?
What are you going to do with the list after you have removed the items that are in the dict? Is it possible to piggy-back the dict-filtering onto the next step?

Related

List of lists updates an entire column when given a specific element [duplicate]

So I was wondering how to best create a list of blank lists:
[[],[],[]...]
Because of how Python works with lists in memory, this doesn't work:
[[]]*n
This does create [[],[],...] but each element is the same list:
d = [[]]*n
d[0].append(1)
#[[1],[1],...]
Something like a list comprehension works:
d = [[] for x in xrange(0,n)]
But this uses the Python VM for looping. Is there any way to use an implied loop (taking advantage of it being written in C)?
d = []
map(lambda n: d.append([]),xrange(0,10))
This is actually slower. :(

The probably only way which is marginally faster than
d = [[] for x in xrange(n)]
is
from itertools import repeat
d = [[] for i in repeat(None, n)]
It does not have to create a new int object in every iteration and is about 15 % faster on my machine.
Edit: Using NumPy, you can avoid the Python loop using
d = numpy.empty((n, 0)).tolist()
but this is actually 2.5 times slower than the list comprehension.

The list comprehensions actually are implemented more efficiently than explicit looping (see the dis output for example functions) and the map way has to invoke an ophaque callable object on every iteration, which incurs considerable overhead overhead.
Regardless, [[] for _dummy in xrange(n)] is the right way to do it and none of the tiny (if existent at all) speed differences between various other ways should matter. Unless of course you spend most of your time doing this - but in that case, you should work on your algorithms instead. How often do you create these lists?

Here are two methods, one sweet and simple(and conceptual), the other more formal and can be extended in a variety of situations, after having read a dataset.
Method 1: Conceptual
X2=[]
X1=[1,2,3]
X2.append(X1)
X3=[4,5,6]
X2.append(X3)
X2 thus has [[1,2,3],[4,5,6]] ie a list of lists.
Method 2 : Formal and extensible
Another elegant way to store a list as a list of lists of different numbers - which it reads from a file. (The file here has the dataset train)
Train is a data-set with say 50 rows and 20 columns. ie. Train[0] gives me the 1st row of a csv file, train[1] gives me the 2nd row and so on. I am interested in separating the dataset with 50 rows as one list, except the column 0 , which is my explained variable here, so must be removed from the orignal train dataset, and then scaling up list after list- ie a list of a list. Here's the code that does that.
Note that I am reading from "1" in the inner loop since I am interested in explanatory variables only. And I re-initialize X1=[] in the other loop, else the X2.append([0:(len(train[0])-1)]) will rewrite X1 over and over again - besides it more memory efficient.
X2=[]
for j in range(0,len(train)):
X1=[]
for k in range(1,len(train[0])):
txt2=train[j][k]
X1.append(txt2)
X2.append(X1[0:(len(train[0])-1)])

To create list and list of lists use below syntax
x = [[] for i in range(10)]
this will create 1-d list and to initialize it put number in [[number] and set length of list put length in range(length)
To create list of lists use below syntax.
x = [[[0] for i in range(3)] for i in range(10)]
this will initialize list of lists with 10*3 dimension and with value 0
To access/manipulate element
x[1][5]=value

So I did some speed comparisons to get the fastest way.
List comprehensions are indeed very fast. The only way to get close is to avoid bytecode getting exectuded during construction of the list.
My first attempt was the following method, which would appear to be faster in principle:
l = [[]]
for _ in range(n): l.extend(map(list,l))
(produces a list of length 2**n, of course)
This construction is twice as slow as the list comprehension, according to timeit, for both short and long (a million) lists.
My second attempt was to use starmap to call the list constructor for me, There is one construction, which appears to run the list constructor at top speed, but still is slower, but only by a tiny amount:
from itertools import starmap
l = list(starmap(list,[()]*(1<<n)))
Interesting enough the execution time suggests that it is the final list call that is makes the starmap solution slow, since its execution time is almost exactly equal to the speed of:
l = list([] for _ in range(1<<n))
My third attempt came when I realized that list(()) also produces a list, so I tried the apperently simple:
l = list(map(list, [()]*(1<<n)))
but this was slower than the starmap call.
Conclusion: for the speed maniacs:
Do use the list comprehension.
Only call functions, if you have to.
Use builtins.

What's the most efficient way to perform a multiple match lookup in a python dictionary?

I'm looking to maximally optimize the runtime for this chunk of code:
aDictionary= {"key":["value", "value2", ...
rests = \
list(map((lambda key: Resp(key=key)),
[key for key, values in
aDictionary.items() if (test1 in values or test2 in values)]))
using python3. willing to throw as much memory at it as possible.
considering throwing the two dictionary lookups on separate processes for speedup (does that make sense?). any other optimization ideas welcome
values can definitely be sorted and turned into a set; it is precomputed, very very large.
always len(values) >>>> len(tests), though they're both growing over time
len(tests) grows very very slowly, and has new values for each execution
currently looking at strings (considering doing a string->integer mapping)

For starters, there is no reason to use map when you are already using a list comprehension, so you can remove that, as well as the outer list call:
rests = [Resp(key=key) for key, values in aDictionary.items()
if (test1 in values or test2 in values)]
A second possible optimization might be to turn each list of values into a set. That would take up time initially, but it would change your lookups (in uses) from linear time to constant time. You might need to create a separate helper function for that. Something like:
def anyIn(checking, checkingAgainst):
checkingAgainst = set(checkingAgainst)
for val in checking:
if val in checkingAgainst:
return True
return False
Then you could change the end of your list comprehension to read
...if anyIn([test1, test2], values)]
But again, this would probably only be worth it if you had more than two values you were checking, or if the list of values in values is very long.

If tests are sufficiently numerous, it will surely pay off to switch to set operations:
tests = set([test1, test2, ...])
resps = map(Resp, (k for k, values in dic.items() if not tests.isdisjoint(values)))
# resps this is a lazy iterable, not a list, and it uses a
# generator inside, thus saving the overhead of building
# the inner list.
Turning the dict values into sets would not gain anything as the conversion would be O(N) with N being the added size of all values-lists, while the above disjoint operation will only iterate each values until it encounters a testx with O(1) lookup.
map is possibly more performant compared to a comprehension if you do not have to use lambda, e.g. if key can be used as the first positional argument in Resp's __init__, but certainly not with the lambda! (Python List Comprehension Vs. Map). Otherwise, a generator or comprehension will be better:
resps = (Resp(key=k) for k, values in dic.items() if not tests.isdisjoint(values))
#resps = [Resp(key=k) for k, values in dic.items() if not tests.isdisjoint(values)]

What is the fastest way to add data to a list without duplication in python (2.5)

I have about half a million items that need to be placed in a list, I can't have duplications, and if an item is already there I need to get it's index. So far I have
if Item in List:
ItemNumber=List.index(Item)
else:
List.append(Item)
ItemNumber=List.index(Item)
The problem is that as the list grows it gets progressively slower until at some point it just isn't worth doing. I am limited to python 2.5 because it is an embedded system.

You can use a set (in CPython since version 2.4) to efficiently look up duplicate values. If you really need an indexed system as well, you can use both a set and list.
Doing your lookups using a set will remove the overhead of if Item in List, but not that of List.index(Item)
Please note ItemNumber=List.index(Item) will be very inefficient to do after List.append(Item). You know the length of the list, so your index can be retrieved with ItemNumber = len(List)-1.
To completely remove the overhead of List.index (because that method will search through the list - very inefficient on larger sets), you can use a dict mapping Items back to their index.
I might rewrite it as follows:
# earlier in the program, NOT inside the loop
Dup = {}
# inside your loop to add items:
if Item in Dup:
ItemNumber = Dup[Item]
else:
List.append(Item)
Dup[Item] = ItemNumber = len(List)-1

If you really need to keep the data in an array, I'd use a separate dictionary to keep track of duplicates. This requires twice as much memory, but won't slow down significantly.
existing = dict()
if Item in existing:
ItemNumber = existing[Item]
else:
ItemNumber = existing[Item] = len(List)
List.append(Item)
However, if you don't need to save the order of items you should just use a set instead. This will take almost as little space as a list, yet will be as fast as a dictionary.
Items = set()
# ...
Items.add(Item) # will do nothing if Item is already added
Both of these require that your object is hashable. In Python, most types are hashable unless they are a container whose contents can be modified. For example: lists are not hashable because you can modify their contents, but tuples are hashable because you cannot.
If you were trying to store values that aren't hashable, there isn't a fast general solution.

You can improve the check a lot:
check = set(List)
for Item in NewList:
if Item in check: ItemNumber = List.index(Item)
else:
ItemNumber = len(List)
List.append(Item)
Or, even better, if order is not important you can do this:
oldlist = set(List)
addlist = set(AddList)
newlist = list(oldlist | addlist)
And if you need to loop over the items that were duplicated:
for item in (oldlist & addlist):
pass # do stuff

Python: fastest way to create a list of n lists

So I was wondering how to best create a list of blank lists:
[[],[],[]...]
Because of how Python works with lists in memory, this doesn't work:
[[]]*n
This does create [[],[],...] but each element is the same list:
d = [[]]*n
d[0].append(1)
#[[1],[1],...]
Something like a list comprehension works:
d = [[] for x in xrange(0,n)]
But this uses the Python VM for looping. Is there any way to use an implied loop (taking advantage of it being written in C)?
d = []
map(lambda n: d.append([]),xrange(0,10))
This is actually slower. :(

The probably only way which is marginally faster than
d = [[] for x in xrange(n)]
is
from itertools import repeat
d = [[] for i in repeat(None, n)]
It does not have to create a new int object in every iteration and is about 15 % faster on my machine.
Edit: Using NumPy, you can avoid the Python loop using
d = numpy.empty((n, 0)).tolist()
but this is actually 2.5 times slower than the list comprehension.

The list comprehensions actually are implemented more efficiently than explicit looping (see the dis output for example functions) and the map way has to invoke an ophaque callable object on every iteration, which incurs considerable overhead overhead.
Regardless, [[] for _dummy in xrange(n)] is the right way to do it and none of the tiny (if existent at all) speed differences between various other ways should matter. Unless of course you spend most of your time doing this - but in that case, you should work on your algorithms instead. How often do you create these lists?

Here are two methods, one sweet and simple(and conceptual), the other more formal and can be extended in a variety of situations, after having read a dataset.
Method 1: Conceptual
X2=[]
X1=[1,2,3]
X2.append(X1)
X3=[4,5,6]
X2.append(X3)
X2 thus has [[1,2,3],[4,5,6]] ie a list of lists.
Method 2 : Formal and extensible
Another elegant way to store a list as a list of lists of different numbers - which it reads from a file. (The file here has the dataset train)
Train is a data-set with say 50 rows and 20 columns. ie. Train[0] gives me the 1st row of a csv file, train[1] gives me the 2nd row and so on. I am interested in separating the dataset with 50 rows as one list, except the column 0 , which is my explained variable here, so must be removed from the orignal train dataset, and then scaling up list after list- ie a list of a list. Here's the code that does that.
Note that I am reading from "1" in the inner loop since I am interested in explanatory variables only. And I re-initialize X1=[] in the other loop, else the X2.append([0:(len(train[0])-1)]) will rewrite X1 over and over again - besides it more memory efficient.
X2=[]
for j in range(0,len(train)):
X1=[]
for k in range(1,len(train[0])):
txt2=train[j][k]
X1.append(txt2)
X2.append(X1[0:(len(train[0])-1)])

To create list and list of lists use below syntax
x = [[] for i in range(10)]
this will create 1-d list and to initialize it put number in [[number] and set length of list put length in range(length)
To create list of lists use below syntax.
x = [[[0] for i in range(3)] for i in range(10)]
this will initialize list of lists with 10*3 dimension and with value 0
To access/manipulate element
x[1][5]=value

So I did some speed comparisons to get the fastest way.
List comprehensions are indeed very fast. The only way to get close is to avoid bytecode getting exectuded during construction of the list.
My first attempt was the following method, which would appear to be faster in principle:
l = [[]]
for _ in range(n): l.extend(map(list,l))
(produces a list of length 2**n, of course)
This construction is twice as slow as the list comprehension, according to timeit, for both short and long (a million) lists.
My second attempt was to use starmap to call the list constructor for me, There is one construction, which appears to run the list constructor at top speed, but still is slower, but only by a tiny amount:
from itertools import starmap
l = list(starmap(list,[()]*(1<<n)))
Interesting enough the execution time suggests that it is the final list call that is makes the starmap solution slow, since its execution time is almost exactly equal to the speed of:
l = list([] for _ in range(1<<n))
My third attempt came when I realized that list(()) also produces a list, so I tried the apperently simple:
l = list(map(list, [()]*(1<<n)))
but this was slower than the starmap call.
Conclusion: for the speed maniacs:
Do use the list comprehension.
Only call functions, if you have to.
Use builtins.

Python's list comprehensions and other better practices

This relates to a project to convert a 2-way ANOVA program in SAS to Python.
I pretty much started trying to learn the language Thursday, so I know I have a lot of room for improvement. If I'm missing something blatantly obvious, by all means, let me know. I haven't got Sage up and running yet, nor numpy, so right now, this is all quite vanilla Python 2.6.1. (portable)
Primary query: Need a good set of list comprehensions that can extract the data in lists of samples in lists by factor A, by factor B, overall, and in groups of each level of factors A&B (AxB).
After some work, the data is in the following form (3 layers of nested lists):
response[a][b][n]
(meaning [a1 [b1 [n1, ... ,nN] ...[bB [n1, ...nN]]], ... ,[aA [b1 [n1, ... ,nN] ...[bB [n1, ...nN]]]
Hopefully that's clear.)
Factor levels in my example case: A=3 (0-2), B=8 (0-7), N=8 (0-7)
byA= [[a[i] for i in range(b)] for a[b] in response]
(Can someone explain why this syntax works? I stumbled into it trying to see what the parser would accept. I haven't seen that syntax attached to that behavior elsewhere, but it's really nice. Any good links on sites or books on the topic would be appreciated. Edit: Persistence of variables between runs explained this oddity. It doesn't work.)
byB=lstcrunch([[Bs[i] for i in range(len(Bs)) ]for Bs in response])
(It bears noting that zip(*response) almost does what I want. The above version isn't actually working, as I recall. I haven't run it through a careful test yet.)
byAxB= [item for sublist in response for item in sublist]
(Stolen from a response by Alex Martelli on this site. Again could someone explain why? List comprehension syntax is not very well explained in the texts I've been reading.)
ByO= [item for sublist in byAxB for item in sublist]
(Obviously, I simply reused the former comprehension here, 'cause it did what I need. Edit:)
I'd like these to end up the same datatypes, at least when looped through by the factor in question, s.t. that same average/sum/SS/et cetera functions can be applied and used.
This could easily be replaced by something cleaner:
def lstcrunch(Dlist):
"""Returns a list containing the entire
contents of whatever is imported,
reduced by one level.
If a rectangular array, it reduces a dimension by one.
lstcrunch(DataSet[a][b]) -> DataOutput[a]
[[1, 2], [[2, 3], [2, 4]]] -> [1, 2, [2, 3], [2, 4]]
"""
flat=[]
if islist(Dlist):#1D top level list
for i in Dlist:
if islist(i):
flat+= i
else:
flat.append(i)
return flat
else:
return [Dlist]
Oh, while I'm on the topic, what's the preferred way of identifying a variable as a list?
I have been using:
def islist(a):
"Returns 'True' if input is a list and 'False' otherwise"
return type(a)==type([])
Parting query:
Is there a way to explicitly force a shallow copy to convert to a deep? copy? Or, similarly, when copying into a variable, is there a way of declaring that the assignment is supposed to replace the pointer, too, and not merely the value? (s.t.the assignment won't propagate to other shallow copies) Similarly, using that might be useful, as well, from time to time, so being able to control when it does or doesn't occur sounds really nice.
(I really stepped all over myself when I prepared my table for inserting by calling:
response=[[[0]*N]*B]*A
)
Edit:
Further investigation lead to most of this working fine. I've since made the class and tested it. it works fine. I'll leave the list comprehension forms intact for reference.
def byB(array_a_b_c):
y=range(len(array_a_b_c))
x=range(len(array_a_b_c[0]))
return [[array_a_b_c[i][j][k]
for k in range(len(array_a_b_c[0][0]))
for i in y]
for j in x]
def byA(array_a_b_c):
return [[repn for rowB in rowA for repn in rowB]
for rowA in array_a_b_c]
def byAxB(array_a_b_c):
return [rowB for rowA in array_a_b_c
for rowB in rowA]
def byO(array_a_b_c):
return [rep
for rowA in array_a_b_c
for rowB in rowA
for rep in rowB]
def gen3d(row, col, inner):
"""Produces a 3d nested array without any naughty shallow copies.
[row[col[inner]] named s.t. the outer can be split on, per lprn for easy display"""
return [[[k for k in range(inner)]
for i in range(col)]
for j in range(row)]
def lprn(X):
"""This prints a list by lines.
Not fancy, but works"""
if isiterable(X):
for line in X: print line
else:
print x
def isiterable(a):
return hasattr(a, "__iter__")
Thanks to everyone who responded. Already see a noticeable improvement in code quality due to improvements in my gnosis. Further thoughts are still appreciated, of course.

byAxB= [item for sublist in response for item in sublist] Again could someone explain why?
I am sure A.M. will be able to give you a good explanation. Here is my stab at it while waiting for him to turn up.
I would approach this from left to right. Take these four words:
for sublist in response
I hope you can see the resemblance to a regular for loop. These four words are doing the ground work for performing some action on each sublist in response. It appears that response is a list of lists. In that case sublist would be a list for each iteration through response.
for item in sublist
This is again another for loop in the making. Given that we first heard about sublist in the previous "loop" this would indicate that we are now traversing through sublist, one item at a time. If I were to write these loops out without comprehensions it would look like this:
for sublist in response:
for item in sublist:
Next, we look at the remaining words. [, item and ]. This effectively means, collect items in a list and return the resulting list.
Whenever you have trouble creating or understanding list iterations write the relevant for loops out and then compress them:
result = []
for sublist in response:
for item in sublist:
result.append(item)
This will compress to:
[
item
for sublist in response
for item in sublist
]
List comprehension syntax is not very well explained in the texts I've been reading
Dive Into Python has a section dedicated to list comprehensions. There is also this nice tutorial to read through.
Update
I forgot to say something. List comprehensions are another way of achieving what has been traditionally done using map and filter. It would be a good idea to understand how map and filter work if you want to improve your comprehension-fu.

For the copy part, look into the copy module, python simply uses references after the first object is created, so any change in other "copies" propagates back to the original, but the copy module makes real copies of objects and you can specify several copy modes

It is sometimes kinky to produce right level of recursion in your data structure, however I think in your case it should be relatively simple. To test it out while we are doing we need one sample data, say:
data = [ [a,
[b,
range(1,9)]]
for b in range(8)
for a in range(3)]
print 'Origin'
print(data)
print 'Flat'
## from this we see how to produce the c data flat
print([(a,b,c) for a,[b,c] in data])
print "Sum of data in third level = %f" % sum(point for point in c for a,[b,c] in data)
print "Sum of all data = %f" % sum(a+b+sum(c) for a,[b,c] in data)
for the type check, generally you should avoid it but if you must, as when you do not want to do recursion in string you can do it like this
if not isinstance(data, basestring) : ....
If you need to flatten structure you can find useful code in Python documentation (other way to express it is chain(*listOfLists)) and as list comprehension [ d for sublist in listOfLists for d in sublist ]:
from itertools import flat.chain
def flatten(listOfLists):
"Flatten one level of nesting"
return chain.from_iterable(listOfLists)
This does not work though if you have data in different depths. For heavy weight flattener see: http://www.python.org/workshops/1994-11/flatten.py,

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.