Efficient way to sequential adding multiple list elements

Efficient way to sequential adding multiple list elements - python

I have multiple lists. I want to merge the elements sequentially one-by-one.
Example:
a = [1, 2, 3]
b = [4, 5, 6]
c = [7, 8, 9]
Result should be:
d = [1, 4, 7, 2, 5, 8, 3, 6, 9]
One way of doing it is:
d = []
for i, j, k in zip(a, b, c):
d.extend([i, j, k])
Is this efficient? What is most efficient way here?

A one-liner could be
import itertools
list(itertools.chain.from_iterable(zip(a,b,c)))
A variant of your method is
d=[]
for i, j, k in zip(a, b, c):
d+=[i,j,k]
Out of curiosity, i've just
Just used timeit, out of curiosity, comparing your method, that variant, my one-liner, and also the one in Olvin's comment (let's call it compound version), and verdict is
yours: 1.06-1.08
my variant (with += instead of extend): 0.94-0.97
my one-liner: 1.10-1.12
Olvin's one-liner: 1.28-1.34
Sometimes, the nicest methods aren't the fastest.
Timing may change for longer lists tho.
The fact that += is faster than .extend is quite interesting (since .extend change the list, while += build a new one, and then replace the old one. Instinct would say that rebuilding lists should be faster that extending them. But maybe memory management says otherwise).
But, well, so far, the fastest one is my second version (with +=), which, incidentally, is also the one I find the most boring, among all solutions seen here.
Edit
Since that ranking bothered me (it's itertools iterators are supposed to be faster, since they are a little bit less interpreted and a little bit more compiled), I've tried with longer list. And it is then another story
a=list(range(1000))
b=list(range(1000,2000))
c=list(range(2000,3000))
And then timeit verdict (with 100 times less run than before) is
Your method: 1.91
My += variant: 1.59
My one-liner: 0.98
Olvin's one-liner: 1.88
So, at least, itertools does win in the long run (with big enough data).
Victory of += over .extend is affirmed (I don't really know the internals of memory management. But, coming from the C world, I would say that sometimes, a new malloc and copy is faster than a constantly realloc. But maybe that's a naive view of what happens under the hood in python's interpreter. But well, is faster than extend for this usage in the long run)
Olvin's method is quite equivalent to yours. Which surprises me a little. Because it is roughly the compound version of the same thing. But I would have thought that, while building up a compound list, python could skip some steps in intermediary representation, that it could not skip in your method, where all the versions of the list (the one with just [1,4,7], then the one with [1,4,7,2,5,8] etc.) do exist in the interpreter. May be the 0.03 difference between Olvin's method and yours are because of that (it is not just noise. With this size, the timings are quite constant, and the 0.03 difference is also). But I would have thought that the difference would be higher.
But well, even if the timing differences surprise me, the ranking of the method makes more sense with big lists. With itertools > += > [compound] > .extend

a = [1, 2, 3]
b = [4, 5, 6]
c = [7, 8, 9]
flat = zip(a,b,c)
d = [x for tpl in flat for x in tpl]
This list comprehension is the same as:
flat_list = []
for sublist in l:
for item in sublist:
flat_list.append(item)

Related

Increment all entries in an array by 'n' without a for loop

I have an array:
arr = [5,5,5,5,5,5]
I want to increment a particular range in the arr by 'n'. So if n=2 and the range is [2,5].
The array should look like this:
arr = [5,5,7,7,7,5]
Needed to do this without a for loop, for a problem im trying to solve.
Tried:
arr[2:5] = [n]*3
but that obviously replaces the entries and becomes:
arr = [5,5,3,3,3,5]
Any suggestions would be highly appriciated.

n = 2
arr_range = slice(2, 5)
arr = [5,5,7,7,7,5]
arr[arr_range] = map(lambda x: x+n, arr[arr_range])
# arr
# [5, 5, 9, 9, 9, 5]
But I would recommend using numpy...
import numpy as np
n = 2
arr_range = slice(2, 5)
arr = np.array([5,5,7,7,7,5])
arr[arr_range] += n

You actually have a list, not an array. If you convert it to a Numpy array it is simple.
>>> n=3
>>> arr = np.array([5,5,5,5,5,5])
>>> arr[2:5] += n
>>> arr
array([5, 5, 8, 8, 8, 5])

You have basically two options (for code see below):
Use slice assignment via a list comprehension (a[:] = [x+1 for x in a]),
Use a for-loop (even though you exclude this in your question, I don't see a legitimate reason for doing so).
They come with pros and cons. Let's assume you are going to replace some fraction of the list items (as opposed to a fixed number of items). The for-loop runs in Python and hence might be slower but it has O(1) memory usage. The list comprehension and slice assignment both operate in C (assuming you are using CPython) but it has O(N) memory usage due to the temporary list.
Using a generator doesn't buy anything since it is converted to a list anyway before the assignment happens (this is necessary because if the generator had fewer or more items than the slice, the list would need to be resized accordingly; see the source code).
Using a map adds even more overhead since it needs to call the mapped function on every item.
The following is a performance comparison of the different methods. The for-loop is fastest for very small lists since it has minimal overhead (just the range object). For more than about a dozen items, the list comprehension clearly outperforms the other methods and especially for larger lists (len(a) > 3e5) the difference to the generator becomes noticeable (the generator cannot provide information about its size, so the generated list needs to be resized as more items are fetched). For very large lists the difference between for-loop and list comprehension seems to shrink again since the memory overhead tends to outweigh the loop cost, but reaching that point would require unusually large lists (where you'd be better off using something like Numpy anyway).
This is the code using the perfplot package:
import numpy
import perfplot
def use_generator(a):
i = slice(0, len(a)//2)
a[i] = (x+1 for x in a[i])
def use_map(a):
i = slice(0, len(a)//2)
a[i] = map(lambda x: x+1, a[i])
def use_list(a):
i = slice(0, len(a)//2)
a[i] = [x+1 for x in a[i]]
def use_loop(a):
for i in range(len(a)//2):
a[i] += 1
perfplot.show(
setup=lambda n: [0]*n,
kernels=[use_generator, use_map, use_list, use_loop],
n_range=[2**k for k in range(1, 26)],
xlabel="len(a)",
equality_check=None,
)

Does Python keep track of when something has been sorted, internally?

For example, if I call
L = [3,4,2,1,5]
L = sorted(L)
I get a sorted list. Now, in the future, if I want to perform some other kind of sort on L, does Python automatically know "this list has been sorted before and not modified since, so we can perform some internal optimizations on how we perform this other kind of sort" such as a reverse-sort, etc?

Nope, it doesn't. The sorting algorithm is designed to exploit (partially) sorted inputs, but the list itself doesn't "remember" being sorted in any way.
(This is actually a CPython implementation detail, and future versions/different implementations could cache the fact that a list was just sorted. However, I'm not convinced that could be done without slowing down all operations that modify the list, such as append.)

As the commenters pointed out, normal Python lists are inherently ordered and efficiently sortable (thanks, Timsort!), but do not remember or maintain sorting status.
If you want lists that invariably retain their sorted status, you can install the SortedContainers package from PyPI.
>>> from sortedcontainers import SortedList
>>> L = SortedList([3,4,2,1,5])
>>> L
SortedList([1, 2, 3, 4, 5])
>>> L.add(3.3)
>>> L
SortedList([1, 2, 3, 3.3, 4, 5])
Note the normal append method becomes add, because the item isn't added on the end. It's added wherever appropriate given the sort order. There is also a SortedListWithKey type that allows you to set your sort key/order explicitly.

Some of this, at least the specific reverse sort question, could be done using numpy:
import numpy as np
L = np.array([3,4,2,1,5])
a = np.argsort(L)
b = L[a]
r = L[a[::-1]]
print L
[3 4 2 1 5]
print b
[1 2 3 4 5]
print r
[5, 4, 3, 2, 1]
That is, here we just do the sort once (to create a, the sorting indices), and then we can manipulate a, to do other various sorts, like the normal sort b, and the reverse sort r. And many others would be similarly easy, like every other element.

Remove the last N elements of a list

Is there a a better way to remove the last N elements of a list.
for i in range(0,n):
lst.pop( )

Works for n >= 1
>>> L = [1,2,3, 4, 5]
>>> n=2
>>> del L[-n:]
>>> L
[1, 2, 3]

if you wish to remove the last n elements, in other words, keep first len - n elements:
lst = lst[:len(lst)-n]
Note: This is not an in memory operation. It would create a shallow copy.

As Vincenzooo correctly says, the pythonic lst[:-n] does not work when n==0.
The following works for all n>=0:
lst = lst[:-n or None]
I like this solution because it is kind of readable in English too: "return a slice omitting the last n elements or none (if none needs to be omitted)".
This solution works because of the following:
x or y evaluates to x when x is logically true (e.g., when it is not 0, "", False, None, ...) and to y otherwise. So -n or None is -n when n!=0 and None when n==0.
When slicing, None is equivalent to omitting the value, so lst[:None] is the same as lst[:] (see here).
As noted by #swK, this solution creates a new list (but immediately discards the old one unless it's referenced elsewhere) rather than editing the original one. This is often not a problem in terms of performance as creating a new list in one go is often faster than removing one element at the time (unless n<<len(lst)). It is also often not a problem in terms of space as usually the members of the list take more space than the list itself (unless it's a list of small objects like bytes or the list has many duplicated entries). Please also note that this solution is not exactly equivalent to the OP's: if the original list is referenced by other variables, this solution will not modify (shorten) the other copies unlike in the OP's code.
A possible solution (in the same style as my original one) that works for n>=0 but: a) does not create a copy of the list; and b) also affects other references to the same list, could be the following:
lst[-n:n and None] = []
This is definitely not readable and should not be used. Actually, even my original solution requires too much understanding of the language to be quickly read and univocally understood by everyone. I wouldn't use either in any real code and I think the best solution is that by #wonder.mice: a[len(a)-n:] = [].

Just try to del it like this.
del list[-n:]

I see this was asked a long ago, but none of the answers did it for me; what if we want to get a list without the last N elements, but keep the original one: you just do list[:-n]. If you need to handle cases where n may equal 0, you do list[:-n or None].
>>> a = [1,2,3,4,5,6,7]
>>> b = a[:-4]
>>> b
[1, 2, 3]
>>> a
[1, 1, 2, 3, 4, 5, 7]
As simple as that.

Should be using this:
a[len(a)-n:] = []
or this:
del a[len(a)-n:]
It's much faster, since it really removes items from existing array. The opposite (a = a[:len(a)-1]) creates new list object and less efficient.
>>> timeit.timeit("a = a[:len(a)-1]\na.append(1)", setup="a=range(100)", number=10000000)
6.833014965057373
>>> timeit.timeit("a[len(a)-1:] = []\na.append(1)", setup="a=range(100)", number=10000000)
2.0737061500549316
>>> timeit.timeit("a[-1:] = []\na.append(1)", setup="a=range(100)", number=10000000)
1.507638931274414
>>> timeit.timeit("del a[-1:]\na.append(1)", setup="a=range(100)", number=10000000)
1.2029790878295898
If 0 < n you can use a[-n:] = [] or del a[-n:] which is even faster.

This is one of the cases in which being pythonic doesn't work for me and can give hidden bugs or mess.
None of the solutions above works for the case n=0.
Using l[:len(l)-n] works in the general case:
l=range(4)
for n in [2,1,0]: #test values for numbers of points to cut
print n,l[:len(l)-n]
This is useful for example inside a function to trim edges of a vector, where you want to leave the possibility not to cut anything.

Sorting based on one of the list among Nested list in python

I have a list as [[4,5,6],[2,3,1]]. Now I want to sort the list based on list[1] i.e. output should be [[6,4,5],[1,2,3]]. So basically I am sorting 2,3,1 and maintaining the order of list[0].
While searching I got a function which sorts based on first element of every list but not for this. Also I do not want to recreate list as [[4,2],[5,3],[6,1]] and then use the function.

Since [4, 5, 6] and [2, 3, 1] serves two different purposes I will make a function taking two arguments: the list to be reordered, and the list whose sorting will decide the order. I'll only return the reordered list.
This answer has timings of three different solutions for creating a permutation list for a sort. Using the fastest option gives this solution:
def pyargsort(seq):
return sorted(range(len(seq)), key=seq.__getitem__)
def using_pyargsort(a, b):
"Reorder the list a the same way as list b would be reordered by a normal sort"
return [a[i] for i in pyargsort(b)]
print using_pyargsort([4, 5, 6], [2, 3, 1]) # [6, 4, 5]
The pyargsort method is inspired by the numpy argsort method, which does the same thing much faster. Numpy also has advanced indexing operations whereby an array can be used as an index, making possible very quick reordering of an array.
So if your need for speed is great, one would assume that this numpy solution would be faster:
import numpy as np
def using_numpy(a, b):
"Reorder the list a the same way as list b would be reordered by a normal sort"
return np.array(a)[np.argsort(b)].tolist()
print using_numpy([4, 5, 6], [2, 3, 1]) # [6, 4, 5]
However, for short lists (length < 1000), this solution is in fact slower than the first. This is because we're first converting the a and b lists to array and then converting the result back to list before returning. If we instead assume you're using numpy arrays throughout your application so that we do not need to convert back and forth, we get this solution:
def all_numpy(a, b):
"Reorder array a the same way as array b would be reordered by a normal sort"
return a[np.argsort(b)]
print all_numpy(np.array([4, 5, 6]), np.array([2, 3, 1])) # array([6, 4, 5])
The all_numpy function executes up to 10 times faster than the using_pyargsort function.
The following logaritmic graph compares these three solutions with the two alternative solutions from the other answers. The arguments are two randomly shuffled ranges of equal length, and the functions all receive identically ordered lists. I'm timing only the time the function takes to execute. For illustrative purposes I've added in an extra graph line for each numpy solution where the 60 ms overhead for loading numpy is added to the time.
As we can see, the all-numpy solution beats the others by an order of magnitude. Converting from python list and back slows the using_numpy solution down considerably in comparison, but it still beats pure python for large lists.
For a list length of about 1'000'000, using_pyargsort takes 2.0 seconds, using_nympy + overhead is only 1.3 seconds, while all_numpy + overhead is 0.3 seconds.

The sorting you describe is not very easy to accomplish. The only way that I can think of to do it is to use zip to create the list you say you don't want to create:
lst = [[4,5,6],[2,3,1]]
# key = operator.itemgetter(1) works too, and may be slightly faster ...
transpose_sort = sorted(zip(*lst),key = lambda x: x[1])
lst = zip(*transpose_sort)
Is there a reason for this constraint?
(Also note that you could do this all in one line if you really want to:
lst = zip(*sorted(zip(*lst),key = lambda x: x[1]))
This also results in a list of tuples. If you really want a list of lists, you can map the result:
lst = map(list, lst)
Or a list comprehension would work as well:
lst = [ list(x) for x in lst ]

If the second list doesn't contain duplicates, you could just do this:
l = [[4,5,6],[2,3,1]] #the list
l1 = l[1][:] #a copy of the to-be-sorted sublist
l[1].sort() #sort the sublist
l[0] = [l[0][l1.index(x)] for x in l[1]] #order the first sublist accordingly
(As this saves the sublist l[1] it might be a bad idea if your input list is huge)

How about this one:
a = [[4,5,6],[2,3,1]]
[a[0][i] for i in sorted(range(len(a[1])), key=lambda x: a[1][x])]
It uses the principal way numpy does it without having to use numpy and without the zip stuff.
Neither using numpy nor the zipping around seems to be the cheapest way for giant structures. Unfortunately the .sort() method is built into the list type and uses hard-wired access to the elements in the list (overriding __getitem__() or similar does not have any effect here).
So you can implement your own sort() which sorts two or more lists according to the values in one; this is basically what numpy does.
Or you can create a list of values to sort, sort that, and recreate the sorted original list out of it.

Python - removing items from lists

# I have 3 lists:
L1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
L2 = [4, 7, 8]
L3 = [5, 2, 9]
# I want to create another that is L1 minus L2's memebers and L3's memebers, so:
L4 = (L1 - L2) - L3 # Of course this isn't going to work
I'm wondering, what is the "correct" way to do this. I can do it many different ways, but Python's style guide says there should be only 1 correct way of doing each thing. I've never known what this was.

Here are some tries:
L4 = [ n for n in L1 if (n not in L2) and (n not in L3) ] # parens for clarity
tmpset = set( L2 + L3 )
L4 = [ n for n in L1 if n not in tmpset ]
Now that I have had a moment to think, I realize that the L2 + L3 thing creates a temporary list that immediately gets thrown away. So an even better way is:
tmpset = set(L2)
tmpset.update(L3)
L4 = [ n for n in L1 if n not in tmpset ]
Update: I see some extravagant claims being thrown around about performance, and I want to assert that my solution was already as fast as possible. Creating intermediate results, whether they be intermediate lists or intermediate iterators that then have to be called into repeatedly, will be slower, always, than simply giving L2 and L3 for the set to iterate over directly like I have done here.
$ python -m timeit \
-s 'L1=range(300);L2=range(30,70,2);L3=range(120,220,2)' \
'ts = set(L2); ts.update(L3); L4 = [ n for n in L1 if n not in ts ]'
10000 loops, best of 3: 39.7 usec per loop
All other alternatives (that I can think of) will necessarily be slower than this. Doing the loops ourselves, for example, rather than letting the set() constructor do them, adds expense:
$ python -m timeit \
-s 'L1=range(300);L2=range(30,70,2);L3=range(120,220,2)' \
'unwanted = frozenset(item for lst in (L2, L3) for item in lst); L4 = [ n for n in L1 if n not in unwanted ]'
10000 loops, best of 3: 46.4 usec per loop
Using iterators, will all of the state-saving and callbacks they involve, will obviously be even more expensive:
$ python -m timeit \
-s 'L1=range(300);L2=range(30,70,2);L3=range(120,220,2);from itertools import ifilterfalse, chain' \
'L4 = list(ifilterfalse(frozenset(chain(L2, L3)).__contains__, L1))'
10000 loops, best of 3: 47.1 usec per loop
So I believe that the answer I gave last night is still far and away (for values of "far and away" greater than around 5µsec, obviously) the best, unless the questioner will have duplicates in L1 and wants them removed once each for every time the duplicate appears in one of the other lists.

update::: post contains a reference to false allegations of inferior performance of sets compared to frozensets. I maintain that it's still sensible to use a frozenset in this instance, even though there's no need to hash the set itself, just because it's more correct semantically. Though, in practice, I might not bother typing the extra 6 characters. I'm not feeling motivated to go through and edit the post, so just be advised that the "allegations" link links to some incorrectly run tests. The gory details are hashed out in the comments. :::update
The second chunk of code posted by Brandon Craig Rhodes is quite good, but as he didn't respond to my suggestion about using a frozenset (well, not when I started writing this, anyway), I'm going to go ahead and post it myself.
The whole basis of the undertaking at hand is to check if each of a series of values (L1) are in another set of values; that set of values is the contents of L2 and L3. The use of the word "set" in that sentence is telling: even though L2 and L3 are lists, we don't really care about their list-like properties, like the order that their values are in or how many of each they contain. We just care about the set (there it is again) of values they collectively contain.
If that set of values is stored as a list, you have to go through the list elements one by one, checking each one. It's relatively time-consuming, and it's bad semantics: again, it's a "set" of values, not a list. So Python has these neat set types that hold a bunch of unique values, and can quickly tell you if some value is in them or not. This works in pretty much the same way that python's dict types work when you're looking up a key.
The difference between sets and frozensets is that sets are mutable, meaning that they can be modified after creation. Documentation on both types is here.
Since the set we need to create, the union of the values stored in L2 and L3, is not going to be modified once created, it's semantically appropriate to use an immutable data type. This also allegedly has some performance benefits. Well, it makes sense that it would have some advantage; otherwise, why would Python have frozenset as a builtin?
update...
Brandon has answered this question: the real advantage of frozen sets is that their immutability makes it possible for them to be hashable, allowing them to be dictionary keys or members of other sets.
I ran some informal timing tests comparing the speed for creation of and lookup on relatively large (3000-element) frozen and mutable sets; there wasn't much difference. This conflicts with the above link, but supports what Brandon says about them being identical but for the aspect of mutability.
...update
Now, because frozensets are immutable, they don't have an update method. Brandon used the set.update method to avoid creating and then discarding a temporary list en route to set creation; I'm going to take a different approach.
items = (item for lst in (L2, L3) for item in lst)
This generator expression makes items an iterator over, consecutively, the contents of L2 and L3. Not only that, but it does it without creating a whole list-full of intermediate objects. Using nested for expressions in generators is a bit confusing, but I manage to keep it sorted out by remembering that they nest in the same order that they would if you wrote actual for loops, e.g.
def get_items(lists):
for lst in lists:
for item in lst:
yield item
That generator function is equivalent to the generator expression that we assigned to items. Well, except that it's a parametrized function definition instead of a direct assignment to a variable.
Anyway, enough digression. The big deal with generators is that they don't actually do anything. Well, at least not right away: they just set up work to be done later, when the generator expression is iterated. This is formally referred to as being lazy. We're going to do that (well, I am, anyway) by passing items to the frozenset function, which iterates over it and returns a frosty cold frozenset.
unwanted = frozenset(items)
You could actually combine the last two lines, by putting the generator expression right inside the call to frozenset:
unwanted = frozenset(item for lst in (L2, L3) for item in lst)
This neat syntactical trick works as long as the iterator created by the generator expression is the only parameter to the function you're calling. Otherwise you have to write it in its usual separate set of parentheses, just like you were passing a tuple as an argument to the function.
Now we can build a new list in the same way that Brandon did, with a list comprehension. These use the same syntax as generator expressions, and do basically the same thing, except that they are eager instead of lazy (again, these are actual technical terms), so they get right to work iterating over the items and creating a list from them.
L4 = [item for item in L1 if item not in unwanted]
This is equivalent to passing a generator expression to list, e.g.
L4 = list(item for item in L1 if item not in unwanted)
but more idiomatic.
So this will create the list L4, containing the elements of L1 which weren't in either L2 or L3, maintaining the order that they were originally in and the number of them that there were.
If you just want to know which values are in L1 but not in L2 or L3, it's much easier: you just create that set:
L1_unique_values = set(L1) - unwanted
You can make a list out of it, as does st0le, but that might not really be what you want. If you really do want the set of values that are only found in L1, you might have a very good reason to keep that set as a set, or indeed a frozenset:
L1_unique_values = frozenset(L1) - unwanted
...Annnnd, now for something completely different:
from itertools import ifilterfalse, chain
L4 = list(ifilterfalse(frozenset(chain(L2, L3)).__contains__, L1))

Assuming your individual lists won't contain duplicates....Use Set and Difference
L1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
L2 = [4, 7, 8]
L3 = [5, 2, 9]
print(list(set(L1) - set(L2) - set(L3)))

This may be less pythonesque than the list-comprehension answer, but has a simpler look to it:
l1 = [ ... ]
l2 = [ ... ]
diff = list(l1) # this copies the list
for element in l2:
diff.remove(element)
The advantage here is that we preserve order of the list, and if there are duplicate elements, we remove only one for each time it appears in l2.

Doing such operations in Lists can hamper your program's performance very soon. What happens is with each remove, List operations do a fresh malloc & move elements around. This can be expensive if you have a very huge list or otherwise. So I would suggest this -
I am assuming your list has unique elements. Otherwise you need to maintain a list in your dict having duplicate values. Anyway for the data your provided, here it is-
METHOD 1
d = dict()
for x in L1: d[x] = True
# Check if L2 data is in 'd'
for x in L2:
if x in d:
d[x] = False
for x in L3:
if x in d:
d[x] = False
# Finally retrieve all keys with value as True.
final_list = [x for x in d if d[x]]
METHOD 2
If all that looks like too much code. Then you could try using set. But this way your list will loose all duplicate elements.
final_set = set.difference(set(L1),set(L2),set(L3))
final_list = list(final_set)

I think intuited's answer is way too long for such a simple problem, and Python already has a builtin function to chain two lists as a generator.
The procedure is as follows:
Use itertools.chain to chain L2 and L3 without creating a memory-consuming copy
Create a set from that (in this case, a frozenset will do because we don't change it after creation)
Use list comprehension to filter out elements that are in L1 and also in L2 or L3. As set/frozenset lookup (x in someset) is O(1), this will be very fast.
And now the code:
L1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
L2 = [4, 7, 8]
L3 = [5, 2, 9]
from itertools import chain
tmp = frozenset(chain(L2, L3))
L4 = [x for x in L1 if x not in tmp] # [1, 3, 6]
This should be one of the fastest, simplest and least memory-consuming solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.