perfomance of len(List) vs reading a variable

perfomance of len(List) vs reading a variable - python

A similar question has already been ask Cost of len() function here. However, this question looks at the cost of len it self.
Suppose, I have a code that repeats many times len(List), every time is O(1), reading a variable is also O(1) plus assigning it is also O(1).
As a side note, I find that n_files = len(Files) is somewhat more readable than repeated len(Files) in my code. So, that is already an incentive for me to do this.
You could also argue against me, that somewhere in the code Files can be modified, so n_files is no longer correct, but that is not the case.
My question is:
Is the a number of calls to len(Files) after which accessing n_files
will be faster?

A few results (time, in seconds, for one million calls), with a ten-element list using Python 2.7.10 on Windows 7; store is whether we store the length or keeping calling len, and alias is whether or not we create a local alias for len:
Store Alias n= 1 10 100
Yes Yes 0.862 1.379 6.669
Yes No 0.792 1.337 6.543
No Yes 0.914 1.924 11.616
No No 0.879 1.987 12.617
and a thousand-element list:
Store Alias n= 1 10 100
Yes Yes 0.877 1.369 6.661
Yes No 0.785 1.299 6.808
No Yes 0.926 1.886 11.720
No No 0.891 1.948 12.843
Conclusions:
Storing the result is more efficient than calling len repeatedly, even for n == 1;
Creating a local alias for len can make a small improvement for larger n where we aren't storing the result, but not as much as just storing the result would; and
The influence of the length of the list is negligible, suggesting that whether or not the integers are interned isn't making any difference.
Test script:
def test(n, l, store, alias):
if alias:
len_ = len
len_l = len_(l)
else:
len_l = len(l)
for _ in range(n):
if store:
_ = len_l
elif alias:
_ = len_(l)
else:
_ = len(l)
if __name__ == '__main__':
from itertools import product
from timeit import timeit
setup = 'from __main__ import test, l'
for n, l, store, alias in product(
(1, 10, 100),
([None]*10,),
(True, False),
(True, False),
):
test_case = 'test({!r}, l, {!r}, {!r})'.format(n, store, alias)
print test_case, len(l),
print timeit(test_case, setup=setup)

Function calls in python are costly, so if you are 100% sure that the size of n_files would not change when you are accessing its length from the variable, you can use the variable, if that is what is more readable for you as well.
An Example performance test for both accessing len(list) and accessing from variable , gives the following result -
In [36]: l = list(range(100000))
In [37]: n_l = len(l)
In [40]: %timeit newn = len(l)
10000000 loops, best of 3: 92.8 ns per loop
In [41]: %timeit new_n = n_l
10000000 loops, best of 3: 33.1 ns per loop
Accessing the variable is always faster than using len() .

Using l = len(li) is faster:
python -m timeit -s "li = [1, 2, 3]" "len(li)"
1000000 loops, best of 3: 0.239 usec per loop
python -m timeit -s "li = [1, 2, 3]; l = len(li)" "l"
10000000 loops, best of 3: 0.0949 usec per loop

Using len(Files) instead of n_files is likely to be slower. Yes you have to lookup n_files, but in the former case you'll have to lookup both len and Files and then on top of that call a function that "calculates" the length of Files.

Related

How to deoptimze memory access in python?

This may not useful. It's just a challenge I have set up for myself.
Let's say you have a big array. What can you do so that the program does not benefit from caching, cache line prefetching or the fact that the next memory access can only be determined after the first access finishes.
So we have our array:
array = [0] * 10000000
What would be the best way to deoptimize the memory access if you had to access all elements in a loop? The idea is to increase the access time of each memory location as much as possible
I'm not looking for a solution which proposes to do "something else" (which takes time) before doing the next access. The idea is really to increase the access time as much as possible. I guess we have to traverse the array in a certain way (perhaps randomly? I'm still looking into it)

I did not expect any difference, but in fact accessing the digits in random order is significantly slower than accessing them in order or in reverse order (which is both about the same).
>>> N = 10**5
>>> arr = [random.randint(0, 1000) for _ in range(N)]
>>> srt = list(range(N))
>>> rvd = srt[::-1]
>>> rnd = random.sample(srt, N)
>>> %timeit sum(arr[i] for i in srt)
10 loops, best of 5: 24.9 ms per loop
>>> %timeit sum(arr[i] for i in rvd)
10 loops, best of 5: 25.7 ms per loop
>>> %timeit sum(arr[i] for i in rnd)
10 loops, best of 5: 59.2 ms per loop
And it really seems to be the randomness. Just accessing indices out of order, but with a pattern, e.g. as [0, N-1, 2, N-3, ...] or [0, N/2, 1, N/2+1, ...], is just as fast as accessing them in order:
>>> alt1 = [i if i % 2 == 0 else N - i for i in range(N)]
>>> alt2 = [i for p in zip(srt[:N//2], srt[N//2:]) for i in p]
>>> %timeit sum(arr[i] for i in alt1)
10 loops, best of 5: 24.5 ms per loop
>>> %timeit sum(arr[i] for i in alt2)
10 loops, best of 5: 24.1 ms per loop
Interestingly, just iterating the shuffled indices (and calculating their sum as with the array above) is also slower than doing the same with the sorted indices, but not as much. Of the ~35ms difference between srt and rnd, ~10ms seem to come from iterating the randomized indices, and ~25ms for actually accessing the indices in random order.
>>> %timeit sum(i for i in srt)
100 loops, best of 5: 19.7 ms per loop
>>> %timeit sum(i for i in rnd)
10 loops, best of 5: 30.5 ms per loop
>>> %timeit sum(arr[i] for i in srt)
10 loops, best of 5: 24.5 ms per loop
>>> %timeit sum(arr[i] for i in rnd)
10 loops, best of 5: 56 ms per loop
(IPython 5.8.0 / Python 3.7.3 on a rather old laptop running Linux)

Python interns small integers. Use integers > 255. * just adds references to the number already in the list when expanded, use unique values instead. Caches hate randomness, so go random.
import random
array = list(range(256, 10000256))
while array:
array.pop(random.randint(0, len(array)-1))
A note on interning small integers. When you create an integer in your program, say 12345, python creates an object on the heap of 55 or greater bytes. This is expensive. So, numbers between (I think) -4 and 255 are built into python to optimize common small number operations. By avoiding these numbers you force python to allocate integers on the heap, spreading out the amount of memory you will touch and reducing cache efficiency.
If you use a single number in the array [1234] * 100000, that single number is referenced many times. If you use unique numbers, they are all individually allocated on the heap, increasing memory footprint. And when they are removed from the list, python has to touch the object to reduce its reference count which pulls its memory location into cache, invalidating something else.

why python dict update insanely slow?

I had a python program which reads lines from files and puts them into dict, for simple, it looks like:
data = {'file_name':''}
with open('file_name') as in_fd:
for line in in_fd:
data['file_name'] += line
I found it took hours to finish.
And then, I did a bit change to the program:
data = {'file_name':[]}
with open('file_name') as in_fd:
for line in in_fd:
data['file_name'].append(line)
data['file_name'] = ''.join(data['file_name'])
It finished in seconds.
I thought it's += makes the program slow, but it seems not. Please take a look at the result of the following test.
I knew we could use list append and join to improve performance when concat strings. But I never thought such a performance gap between append and join and add and assign.
So I decided to do some more tests, and finally found it's the dict update operation makes the program insanely slow. Here is a scripts:
import time
LOOPS = 10000
WORD = 'ABC'*100
s1=time.time()
buf1 = []
for i in xrange(LOOPS):
buf1.append(WORD)
ss = ''.join(buf1)
s2=time.time()
buf2 = ''
for i in xrange(LOOPS):
buf2 += WORD
s3=time.time()
buf3 = {'1':''}
for i in xrange(LOOPS):
buf3['1'] += WORD
s4=time.time()
buf4 = {'1':[]}
for i in xrange(LOOPS):
buf4['1'].append(WORD)
buf4['1'] = ''.join(buf4['1'])
s5=time.time()
print s2-s1, s3-s2, s4-s3, s5-s4
In my laptop(mac pro 2013 mid, OS X 10.9.5, cpython 2.7.10), it's output is:
0.00299620628357 0.00415587425232 3.49465799332 0.00231599807739
Inspired by juanpa.arrivillaga's comments, I did a bit change to the second loop:
trivial_reference = []
buf2 = ''
for i in xrange(LOOPS):
buf2 += WORD
trivial_reference.append(buf2) # add a trivial reference to avoid optimization
After the change, now the second loops takes 19 seconds to complete. So it seems just a optimization problem just as juanpa.arrivillaga said.

+= performs really bad when building large strings but can be efficient in one-case in CPython.mentioned below
For sure-shot faster string concatenation use str.join().
From String Concatenation section under Python Performance Tips:
Avoid this:
s = ""
for substring in list:
s += substring
Use s = "".join(list) instead. The former is a very common and catastrophic mistake when building large strings.
Why s += x is faster than s['1'] += x or s[0] += x?
From Note 6:
CPython implementation detail: If s and t are both strings, some
Python implementations such as CPython can usually perform an in-place
optimization for assignments of the form s = s + t or s += t. When
applicable, this optimization makes quadratic run-time much less
likely. This optimization is both version and implementation
dependent. For performance sensitive code, it is preferable to use the
str.join() method which assures consistent linear concatenation
performance across versions and implementations.
The optimization in case of CPython is that if a string has only one reference then we can resize it in-place.
/* Note that we don't have to modify *unicode for unshared Unicode
objects, since we can modify them in-place. */
Now latter two are not simple in-place additions. In fact these are not in-place additions at all.
s[0] += x
is equivalent to:
temp = s[0] # Extra reference. `S[0]` and `temp` both point to same string now.
temp += x
s[0] = temp
Example:
>>> lst = [1, 2, 3]
>>> def func():
... lst[0] = 90
... return 100
...
>>> lst[0] += func()
>>> print lst
[101, 2, 3] # Not [190, 2, 3]
But in general never use s += x for concatenating string, always use str.join on a collection of strings.
Timings
LOOPS = 1000
WORD = 'ABC'*100
def list_append():
buf1 = [WORD for _ in xrange(LOOPS)]
return ''.join(buf1)
def str_concat():
buf2 = ''
for i in xrange(LOOPS):
buf2 += WORD
def dict_val_concat():
buf3 = {'1': ''}
for i in xrange(LOOPS):
buf3['1'] += WORD
return buf3['1']
def list_val_concat():
buf4 = ['']
for i in xrange(LOOPS):
buf4[0] += WORD
return buf4[0]
def val_pop_concat():
buf5 = ['']
for i in xrange(LOOPS):
val = buf5.pop()
val += WORD
buf5.append(val)
return buf5[0]
def val_assign_concat():
buf6 = ['']
for i in xrange(LOOPS):
val = buf6[0]
val += WORD
buf6[0] = val
return buf6[0]
>>> %timeit list_append()
1000 loops, best of 3: 1.31 ms per loop
>>> %timeit str_concat()
100 loops, best of 3: 3.09 ms per loop
>>> %run so.py
>>> %timeit list_append()
10000 loops, best of 3: 71.2 us per loop
>>> %timeit str_concat()
1000 loops, best of 3: 276 us per loop
>>> %timeit dict_val_concat()
100 loops, best of 3: 9.66 ms per loop
>>> %timeit list_val_concat()
100 loops, best of 3: 9.64 ms per loop
>>> %timeit val_pop_concat()
1000 loops, best of 3: 556 us per loop
>>> %timeit val_assign_concat()
100 loops, best of 3: 9.31 ms per loop
val_pop_concat is fast here because by using pop() we are dropping reference from the list to that string and now CPython can resize it in-place(guessed correctly by #niemmi in comments).

Pre-allocating a list of None

Suppose you want to write a function which yields a list of objects, and you know in advance the length n of such list.
In python the list supports indexed access in O(1), so it is arguably a good idea to pre-allocate the list and access it with indexes instead of allocating an empty list and using the append() method. This is because we avoid the burden of expanding the whole list if the space is not enough.
If I'm using python, probably performances are not that relevant in any case, but what is the better way of pre-allocating a list?
I know these possible candidates:
[None] * n → allocating two lists
[None for x in range(n)] — or xrange in python2 → building another object
Is one significantly better than the other?
What if we are in the case n = len(input)? Since input exists already, would [None for x in input] have better performances w.r.t. [None] * len(input)?

In between those two options the first one is clearly better as no Python for loop is involved.
>>> %timeit [None] * 100
1000000 loops, best of 3: 469 ns per loop
>>> %timeit [None for x in range(100)]
100000 loops, best of 3: 4.8 us per loop
Update:
And list.append has an O(1) complexity too, it might be a better choice than pre-creating list if you assign the list.append method to a variable.
>>> n = 10**3
>>> %%timeit
lis = [None]*n
for _ in range(n):
lis[_] = _
...
10000 loops, best of 3: 73.2 us per loop
>>> %%timeit
lis = []
for _ in range(n):
lis.append(_)
...
10000 loops, best of 3: 92.2 us per loop
>>> %%timeit
lis = [];app = lis.append
for _ in range(n):
app(_)
...
10000 loops, best of 3: 59.4 us per loop
>>> n = 10**6
>>> %%timeit
lis = [None]*n
for _ in range(n):
lis[_] = _
...
10 loops, best of 3: 106 ms per loop
>>> %%timeit
lis = []
for _ in range(n):
lis.append(_)
...
10 loops, best of 3: 122 ms per loop
>>> %%timeit
lis = [];app = lis.append
for _ in range(n):
app(_)
...
10 loops, best of 3: 91.8 ms per loop

When you append an item to a list, Python 'over-allocates', see the source-code of the list object. This means that for example when adding 1 item to a list of 8 items, it actually makes room for 8 new items, and uses only the first one of those. The next 7 appends are then 'for free'.
In many languages (e.g. old versions of Matlab, the newer JIT might be better) you are always told that you need to pre-allocate your vectors, since appending during a loop is very expensive. In the worst case, appending of a single item to a list of length n can cost O(n) time, since you might have to create a bigger list and copy all the existing items over. If you need to do this on every iteration, the overall cost of adding n items is O(n^2), ouch. Python's pre-allocation scheme spreads the cost of growing the array over many single appends (see amortized costs), effectively making the cost of a single append O(1) and the overall cost of adding n items O(n).
Additionally, the overhead of the rest of your Python code is usually so large, that the tiny speedup that can be obtained by pre-allocating is insignificant. So in most cases, simply forget about pre-allocating, unless your profiler tells you that appending to a list is a bottleneck.
The other answers show some profiling of the list preallocation itself, but this is useless. The only thing that matters is profiling your complete code, with all your calculations inside your loop, with and without pre-allocation. If my prediction is right, the difference is so small that the computation time you win is dwarfed by the time spent thinking about, writing and maintaining the extra lines to pre-allocate your list.

Obviously, the first version. Let me explain why.
When you do [None] * n, Python internally creates a list object of size n and it copies the the same object (here None) (this is the reason, you should use this method only when you are dealing with immutable objects) to all the memory locations. So memory allocation is done only once. After that a single iteration through the list to copy the object to all the elements. list_repeat is the function which corresponds to this type of list creation.
# Creates the list of specified size
np = (PyListObject *) PyList_New(size);
....
...
items = np->ob_item;
if (Py_SIZE(a) == 1) {
elem = a->ob_item[0];
for (i = 0; i < n; i++) {
items[i] = elem; // Copies the same item
Py_INCREF(elem);
}
return (PyObject *) np;
}
When you use a list comprehension to build a list, Python cannot know the actual size of the list being created, so it initially allocates a chunk of memory and a fresh copy of the object is stored in the list. When the list grows beyond the allocated length, it has to allocate the memory again and continue with the new object creation and storing that in the list.

Populate list or tuple from callable or lambda in python

This is a problem I've come across a lot lately. Google doesn't seem to have an answer so I bring it to the good people of stack overflow.
I am looking for a simple way to populate a list with the output of a function. Something like this:
fill(random.random(), 3) #=> [0.04095623, 0.39761869, 0.46227642]
Here are other ways I've found to do this. But I'm not really happy with them, as they seem inefficient.
results = []
for x in xrange(3): results.append(random.random())
#results => [0.04095623, 0.39761869, 0.46227642]
and
map(lambda x: random.random(), [None] * 3)
#=> [0.04095623, 0.39761869, 0.46227642]
Suggestions?
Thanks for all the answers. I knew there was a more python-esque way.
And to the efficiency questions...
$ python --version
Python 2.7.1+
$ python -m timeit "import random" "map(lambda x: random.random(), [None] * 3)"
1000000 loops, best of 3: 1.65 usec per loop
$ python -m timeit "import random" "results = []" "for x in xrange(3): results.append(random.random())"
1000000 loops, best of 3: 1.41 usec per loop
$ python -m timeit "import random" "[random.random() for x in xrange(3)]"
1000000 loops, best of 3: 1.09 usec per loop

How about a list comprehension?
[random.random() for x in xrange(3)]
Also, in many cases, you need the values just once. In these cases, a generator expression which computes the values just-in-time and does not require a memory allocation is preferable:
results = (random.random() for x in xrange(3))
for r in results:
...
# results is "used up" now.
# We could have used results_list = list(results) to convert the generator
By the way, in Python 3.x, xrange has been replaced by range. In Python 2.x, range allocates the memory and calculates all values beforehand (like a list comprehension), whereas xrange calculates the values just-in-time and does not allocate memory (it's a generator).

why do you think they are inefficient?
There is another way to do it,a list-comprehension
listt= [random.random() for i in range(3)]

something more generic...
from random import random
fill = lambda func, num: [func() for x in xrange(num)]
# for generating tuples:
fill = lambda func, num: (func() for x in xrange(num))
# then just call:
fill(random, 4)
# or...
fill(lambda : 1+2*random(), 4)

list = [random.random() for i in xrange(3)]
list = [random.random() for i in [0]*3]
list = [i() for i in [random.random]*3]
Or :
fill =lambda f,n: [f() for i in xrange(n)]
fill(random.random , 3 ) #=> [0.04095623, 0.39761869, 0.46227642]

List comprehension is probably clearest, but for the itertools afficionado:
>>> list(itertools.islice(iter(random.random, None), 3))
[0.42565379345946064, 0.41754360645917354, 0.797286438646947]
A quick check with timeit shows that the itertools version is ever so slightly faster for more than 10 items, but still go with whatever seems clearest to you:
C:\Python32>python lib\timeit.py -s "import random, itertools" "list(itertools.islice(iter(random.random, None), 10))"
100000 loops, best of 3: 2.93 usec per loop
C:\Python32>python lib\timeit.py -s "import random, itertools" "[random.random() for _ in range(10)]"
100000 loops, best of 3: 3.19 usec per loop

Efficient reordering of coordinate pairs (2-tuples) in a list of pairs in Python

I am wanting to zip up a list of entities with a new entity to generate a list of coordinates (2-tuples), but I want to assure that for (i, j) that i < j is always true.
However, I am not extremely pleased with my current solutions:
from itertools import repeat
mems = range(1, 10, 2)
mem = 8
def ij(i, j):
if i < j:
return (i, j)
else:
return (j, i)
def zipij(m=mem, ms=mems, f=ij):
return map(lambda i: f(i, m), ms)
def zipij2(m=mem, ms=mems):
return map(lambda i: tuple(sorted([i, m])), ms)
def zipij3(m=mem, ms=mems):
return [tuple(sorted([i, m])) for i in ms]
def zipij4(m=mem, ms=mems):
mems = zip(ms, repeat(m))
half1 = [(i, j) for i, j in mems if i < j]
half2 = [(j, i) for i, j in mems[len(half1):]]
return half1 + half2
def zipij5(m=mem, ms=mems):
mems = zip(ms, repeat(m))
return [(i, j) for i, j in mems if i < j] + [(j, i) for i, j in mems if i > j]
Output for above:
>>> print zipij() # or zipij{2-5}
[(1, 8), (3, 8), (5, 8), (7, 8), (8, 9)]
Instead of normally:
>>> print zip(mems, repeat(mem))
[(1, 8), (3, 8), (5, 8), (7, 8), (9, 8)]
Timings: snipped (no longer relevant, see much faster results in answers below)
For len(mems) == 5, there is no real issue with any solution, but for zipij5() for instance, the second list comprehension is needlessly going back over the first four values when i > j was already evaluated to be True for those in the first comprehension.
For my purposes, I'm positive that len(mems) will never exceed ~10000, if that helps form any answers for what solution is best. To explain my use case a bit (I find it interesting), I will be storing a sparse, upper-triangular, similarity matrix of sorts, and so I need the coordinate (i, j) to not be duplicated at (j, i). I say of sorts because I will be utilizing the new Counter() object in 2.7 to perform quasi matrix-matrix and matrix-vector addition. I then simply feed counter_obj.update() a list of 2-tuples and it increments those coordinates how many times they occur. SciPy sparse matrices ran about 50x slower, to my dismay, for my use cases... so I quickly ditched those.
So anyway, I was surprised by my results... The first methods I came up with were zipij4 and zipij5, and yet they are still the fastest, despite building a normal zip() and then generating a new zip after changing the values. I'm still rather new to Python, relatively speaking (Alex Martelli, can you hear me?), so here are my naive conclusions:
tuple(sorted([i, j])) is extremely expensive (Why is that?)
map(lambda ...) seems to always do worse than a list comp (I think I've read this and it makes sense)
Somehow zipij5() isn't much slower despite going over the list twice to check for i-j inequality. (Why is this?)
And lastly, I would like to know which is considered most efficient... or if there are any other fast and memory-inexpensive ways that I haven't yet thought of. Thank you.
Current Best Solutions
## Most BRIEF, Quickest with UNSORTED input list:
## truppo's
def zipij9(m=mem, ms=mems):
return [(i, m) if i < m else (m, i) for i in ms]
## Quickest with pre-SORTED input list:
## Michal's
def zipij10(m=mem, ms=mems):
i = binsearch(m, ms) ## See Michal's answer for binsearch()
return zip(ms[:i], repeat(m)) + zip(repeat(m), ms[i:])
Timings
# Michal's
Presorted - 410µs per loop
Unsorted - 2.09ms per loop ## Due solely to the expensive sorted()
# truppo's
Presorted - 880µs per loop
Unsorted - 896µs per loop ## No sorted() needed
Timings were using mems = range(1, 10000, 2), which is only ~5000 in length. sorted() will probably become worse at higher values, and with lists that are more shuffled. random.shuffle() was used for the "Unsorted" timings.

Current version:
(Fastest at the time of posting with Python 2.6.4 on my machine.)
Update 3: Since we're going all out, let's do a binary search -- in a way which doesn't require injecting m into mems:
def binsearch(x, lst):
low, high = -1, len(lst)
while low < high:
i = (high - low) // 2
if i > 0:
i += low
if lst[i] < x:
low = i
else:
high = i
else:
i = high
high = low
return i
def zipij(m=mem, ms=mems):
i = binsearch(m, ms)
return zip(ms[:i], repeat(m)) + zip(repeat(m), ms[i:])
This runs in 828 µs = 0.828 ms on my machine vs the OP's current solution's 1.14 ms. Input list assumed sorted (and the test case is the usual one, of course).
This binary search implementation returns the index of the first element in the given list which is not smaller than the object being searched for. Thus there's no need to inject m into mems and sort the whole thing (like in the OP's current solution with .index(m)) or walk through the beginning of the list step by step (like I did previously) to find the offset at which it should be divided.
Earlier attempts:
How about this? (Proposed solution next to In [25] below, 2.42 ms to zipij5's 3.13 ms.)
In [24]: timeit zipij5(m = mem, ms = mems)
100 loops, best of 3: 3.13 ms per loop
In [25]: timeit [(i, j) if i < j else (j, i) for (i, j) in zip(mems, repeat(mem))]
100 loops, best of 3: 2.42 ms per loop
In [27]: [(i, j) if i < j else (j, i) for (i, j) in zip(mems, repeat(mem))] == zipij5(m=mem, ms=mems)
Out[27]: True
Update: This appears to be just about exactly as fast as the OP's self-answer. Seems more straighforward, though.
Update 2: An implementation of the OP's proposed simplified solution:
def zipij(m=mem, ms=mems):
split_at = 0
for item in ms:
if item < m:
split_at += 1
else:
break
return [(item, m) for item in mems[:split_at]] + [(m, item) for item in mems[split_at:]]
In [54]: timeit zipij()
1000 loops, best of 3: 1.15 ms per loop
Also, truppo's solution runs in 1.36 ms on my machine. I guess the above is the fastest so far. Note you need to sort mems before passing them into this function! If you're generating it with range, it is of course already sorted, though.

Why not just inline your ij()-function?
def zipij(m=mem, ms=mems):
return [(i, m) if i < m else (m, i) for i in ms]
(This runs in 0.64 ms instead of 2.12 ms on my computer)
Some benchmarks:
zipit.py:
from itertools import repeat
mems = range(1, 50000, 2)
mem = 8
def zipij7(m=mem, ms=mems):
cpy = sorted(ms + [m])
loc = cpy.index(m)
return zip(ms[:(loc)], repeat(m)) + zip(repeat(m), ms[(loc):])
def zipinline(m=mem, ms=mems):
return [(i, m) if i < m else (m, i) for i in ms]
Sorted:
>python -m timeit -s "import zipit" "zipit.zipinline()"
100 loops, best of 3: 4.44 msec per loop
>python -m timeit -s "import zipit" "zipit.zipij7()"
100 loops, best of 3: 4.8 msec per loop
Unsorted:
>python -m timeit -s "import zipit, random; random.shuffle(zipit.mems)" "zipit.zipinline()"
100 loops, best of 3: 4.65 msec per loop
p>python -m timeit -s "import zipit, random; random.shuffle(zipit.mems)" "zipit.zipij7()"
100 loops, best of 3: 17.1 msec per loop

Most recent version:
def zipij7(m=mem, ms=mems):
cpy = sorted(ms + [m])
loc = cpy.index(m)
return zip(ms[:(loc)], repeat(m)) + zip(repeat(m), ms[(loc):])
Benches slightly faster for me than truppo's, slower by 30% than Michal's. (Looking into that now)
I may have found my answer (for now). It seems I forgot about making a list comp version for `zipij()``:
def zipij1(m=mem, ms=mems, f=ij):
return [f(i, m) for i in ms]
It still relies on my silly ij() helper function, so it doesn't win the award for brevity, certainly, but timings have improved:
# 10000
1.27s
# 50000
6.74s
So it is now my current "winner", and also does not need to generate more than one list, or use a lot of function calls, other than the ij() helper, so I believe it would also be the most efficient.
However, I think this could still be improved... I think that making N ij() function calls (where N is the length of the resultant list) is not needed:
Find at what index mem would fit into mems when ordered
Split mems at that index into two parts
Do zip(part1, repeat(mem))
Add zip(repeat(mem), part2) to it
It'd basically be an improvement on zipij4(), and this avoids N extra function calls, but I am not sure of the speed/memory benefits over the cost of brevity. I will maybe add that version to this answer if I figure it out.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.